Is Dynamic Area 2 cachable / bufferable?
Jon Abbott (1421) 2651 posts |
RO4 introduced a cached/buffered DA2 – presumably to speed up the Wimp, has this continued on in RO5? Also is there any difference in these memory flags between RO5 IOMD and GPU based OS versions? |
Jeffrey Lee (213) 6048 posts |
All screen memory under RISC OS 5 (whether it’s DA 2 or an ‘external’ framebuffer) is non-cacheable, bufferable. I recently had a look at the screen caching code in the Ursula branch of the kernel to see how it worked (since it’s one of the things I’m considering adding to RISC OS 5). The implementation looks like a bit of a hack – due to both hardware limitations and software limitations (or an unwillingness to change the software). Implementing cacheable screen memory for modern machines (ARMv7, perhaps ARMv6 too) should be easy (no need to worry about cache inconsistency with doubly-mapped areas), but I don’t particularly fancy trying to implement it for older machines (unless we drop support for doubly-mapped screen memory… which, apart from IOMD, would be quite nice since IOMD is the only platform where the hardware actually supports VIDC-style wrap-around screen buffers.) |
Colin (478) 2433 posts |
Is there a reason IOMD, perhaps arm6, can’t continue to use the current system whilst everything else uses the new? |
Jon Abbott (1421) 2651 posts |
Its certainly a nasty hack, relying on Aborts to detect writes to the screen, although does work for the most part – except gaming of course ;) – made worse by bugs in the OS which prevent caching being turned off legally. I ended up modifying L2PT directly as a short term fix, although have subsequently dropped support for RO4 due to other novel design decisions made with memory mapping that break most games.
That would explain the very poor performance I’m seeing with the blitter on the Iyonix. As RO5 doesn’t support RO4’s SWI’s to enable/disable screen caching, I’ll modify L2PT directly for the time being and see if there’s an improvement. If I recall correctly, RO4’s hack was purely to get around the hit to cleaning the cache at VSync on StrongARM. Under newer CPU’s where a ranged clean can be done, enabling caching shouldn’t be a big deal; it just needs the DCache cleaning for the screen memory at VSync. Possible a quick win in desktop performance, assuming IO memory is cacheable of course. Is it? |
Jeffrey Lee (213) 6048 posts |
I’m not sure. At the least we’d have the issue that there’d be two different implementations, which would add a bit of a maintenance cost. I recently had a look at the screen caching code in the Ursula branch of the kernel to see how it worked I don’t mind it using aborts – that was the only real option for IOMD hardware (IIRC Phoebe/IOMD 2 was going to have a hardware register which would detect screen writes). I’m more concerned about the fact that the code to make the screen cacheable is a post-process performed after each DA 2 grow/shrink – it goes through the page tables and converts the entries from 4K page mappings to 1MB section mappings.
Well, the problem is that it’s not really a ranged cache clean – the OS has to iterate through the address range, performing the clean/invalidate on a per-cache line basis (and then potentially repeat that process for each level of cache). So if you blindly do it across the full range of VRAM (which for a 4K display would be almost 32MB) it’ll be quite slow (and the OS will most likely realise doing a ranged cache clean for that much RAM is a bit silly and switch to do a full cache clean instead). For most machines we’d be able to alleviate the problem by using a write-through cache policy, but we’d still need to worry about doing a clean/invalidate for any hardware accelerated render ops, so (especially when considering the size of modern screens) we’d still most likely want to use aborts to track which pages are dirty and limit the cache cleaning to just the required area. I think this is something the page table ‘access flag’ is designed to help with – this is a feature (at one time managed by hardware, but now ARM seem to have deprecated that and it requires manual management via an abort handler instead) which updates a flag in the page table entry once that entry is loaded into the TLB. So we could use that to keep better track of which pages are in the cache than the Ursula-style domain system would be able to.
Yes, IO memory can be made cacheable. Although now you mention it, it does raise the issue that we’re using 1MB section mappings for IO memory – which wouldn’t be ideal for keeping the cache maintenance to sensible levels. But that is one of the things I’d be changing with the new PMP-based screen memory management. So no, unfortunately it’s not a quick win – there are a few other pieces that need to come together to allow it to all work cleanly. |
Jon Abbott (1421) 2651 posts |
I recall it switched DA2 to a Domain 1 so it could quickly detect screen writes early on in the Abort handler. Whilst games are running under ADFFS, I swap it back to Domain 0 and turned cacheable off in L2PT. Mapping DA2 in 1MB chuck breaks quite a lot of games, as they’re generally expecting that if ScreenSize is set to 160K, it is actually 160K and not 1MB. The knock on effect is that frames invariably are at a different address to that expected.
Are you referring to IMB_Range ARMop ? Last week I started recoding ADFFS to make use of the exposed ARMOp’s in 5.23, but immediately spotted there was no D-Cache ranged clean exposed. Are any of the MCRR’s exposed? c12 / c14? c12 should be sufficient for cleaning the screen at VSync surely?
Will 1MB section mappings have any impact with the size of screen memory in 32bit, 1920×1080 is 8MB? I would have thought that as long as MCRR’s are being used (if available) there shouldn’t be a big overhead on a VSync screen scrub. How the Pi2’s L2 GPU cache comes into play here though, I’m not sure. I am seeing screen corruption under the blitter on the first line (with cache/buffer off), so am guessing I need to sync the GPU cache on the Pi2 somehow. I need to find some documentation on it to see what’s required. |
Jon Abbott (1421) 2651 posts |
I had a play with enabling caching/buffering on both DA2 and the GPU IO memory earlier (ignoring cache cleaning for the minute) with rather inconsistent results: Iyonix – No screen tearing and a marked improvement in performance (the blitter now achieves 50fps) I don’t understand why I’m not seeing tearing on the Iyonix or Pi, its as if the IO memory can’t be cached/buffered. Having said that, the Iyonix and Pi are on 5.21 and the Pi2 on 5.23 – not that I’m expecting that to make a difference. |
Jeffrey Lee (213) 6048 posts |
Well, the problem is that it’s not really a ranged cache clean – the OS has to iterate through the address range, performing the clean/invalidate on a per-cache line basis (and then potentially repeat that process for each level of cache) All the ranged cache ops – IMB_Range, IMB_List, MMU_ChangingEntry, and MMU_ChangingEntries.
Yeah, there aren’t any calls specifically for ranged cache cleaning. However you can perform cache cleaning & invalidation on a per-page basis by co-opting the MMU_ChangingEntry and MMU_ChangingEntries ARMops.
Unfortunately those are an ARMv6/ARM11-speicifc feature – there’s no support for them in ARMv7 (and I don’t think they’re supported in ARMv5 or below either – but then again I hadn’t even heard of them until you mentioned them). It looks like they’re a good feature to take advantage of for the Pi (RISC OS currently doesn’t use those ops), but the fact that none of the other machines support them means that we can’t really design cross-architecture features which only offer acceptable performance when those ops are available. Yes, IO memory can be made cacheable. Although now you mention it, it does raise the issue that we’re using 1MB section mappings for IO memory – which wouldn’t be ideal for keeping the cache maintenance to sensible levels. It’s all down to the mechanism we use to identify dirty pages. If we use the access flag style of approach (use aborts to detect the first read or write to each page) then the size of the page will have a big impact on how much cache maintenance we perform – if you change one byte of screen memory then we’ll only see that as a 4KB page being dirty or a 1MB section being dirty. And if it’s only only one byte which has changed then it’s clearly better to do a 4KB flush than a 1MB flush. If you’re changing lots of 4KB pages (e.g. a vertical line) then there’ll be extra overhead in terms of the abort handler, but I think the savings in terms of cache maintenance would outweigh that (especially since the MCRR ranged cache ops aren’t supported by most machines).
I’m not sure if the GPU on the Pi 2 has an L2 cache (didn’t they remove it when they gave the ARM its own dedicated L2 cache?) But if it does have one, then AFAIK it (or the GPU) handles coherency with main memory automatically. We certainly haven’t had to do anything to adjust RISC OS to work with it.
It’s possibly because they have relatively tiny caches compared to the Pi 2, so the data is more likely to naturally get flushed out to memory by the time the GPU needs to use it. The fact that you’re seeing a performance improvement is a clear indicator that the caching is working!
For the Pi it might be relevant – a couple of months ago I did improve our support for the different VMSA memory attributes, which delivered a marked improvement for operations involving IO memory (screen operations, RAM disc, etc.) |
Jon Abbott (1421) 2651 posts |
I had a play with using cacheing/buffering and cleaning the data cache at VSync on the Pi2. Although it all appears to work on the Iyonix, the Pi2 is a whole different can of worms. I couldn’t test the Pi as they’re all packed up for the London Show. Firstly, the Pi2’s core doesn’t appear to implement MCRR – it throws an undefined instruction. The PDF for the Pi’s core and ARMv7-M, -A and -R do detail the instruction, so I don’t know if you need to be in a virtualized CPU mode, as it does use VA and not MVA. On the Pi2 it would seem that cleaning the data cache isn’t sufficient to get the GPU to see the changes, I suspect the GPU cache is possibly coming into play here, but can’t find any documentation to clarify. Essentially, you can clean the entire CPU data cache, but the GPU doesn’t see the change for several VSyncs. I conducted these tests using OS_SynchroniseCodeAreas in both ranged and full mode, Cache_CleanAll and DMB_Write |
Jeffrey Lee (213) 6048 posts |
The MCRR instruction is supported, but not the specific operations which you’re trying to use. (As I said above, the ranged cache maintenance ops appear to be an ARMv6/ARM11-specific feature)
OS_SynchroniseCodeAreas won’t help you – it only cleans/invalidates the L1 caches. DMB_Write won’t really help either (IIRC all the cache/TLB maintenance ARMops perform a full memory barrier equivalent to DSB_ReadWrite). Cache_CleanAll should be doing what you need, so I’m not sure why you’re seeing issues. |
Jon Abbott (1421) 2651 posts |
I’ve posted a YouTube video using Cache_CleanAll so you can see for yourself. The same code on the Pi B+ works as expected, so it’s Pi2 specific. We need a TRM for the GPU, at a guess it has a cache that needs flushing separately. EDIT: There’s this for the BCM2835, I expect this now controls the dedicated GPU L2 cache on the BCM2836. Where does 7ee01000 map to in logical space? Do I need to map that block as physical IO first via OS_Memory? EDIT2: Going by p93 of the Pi B VideoCore PDF writing %100 to V3D_L2CACTL (GPU address 0×00020) should clear the GPU L2 cache. Are the GPU registers mapped in IO already? Looking at the FFT example it looks the physical address of V3D_L2CATCL is 0×3F0C0020? |
Jon Abbott (1421) 2651 posts |
I’ve been thinking about this issue overnight and come to the conclusion that one of two things is probably the cause:
Which raises the following questions:
One other thing that may be coming into play here is how I’m mapping the RO3.1 screen memory in. Due to the issues in OS_Memory with flag corruption, double mapping, I’m modifying L2PT directly to map 1F88000 to 1FFFFFF and the double map at 2000000 to 2077FFF and also setting cacheable/bufferable on DA2 as RISCOS forces them off for graphics memory. Does RISCOS need to be aware that these pages are now cacheable/bufferable, should I be calling MMU_ChangingUncachedEntries instead of TLB_InvalidateEntry on the ranges perhaps? Is is possible to resolve the issues in OS_Memory, so I can legally map these ranges? It would require a means to mirror memory ranges, as the RO3.1 double-mapped range is mirroring DA2. EDIT: I’ve just realised there’s another potential gotcha. Where I’m remapping the video memory with cacheable/bufferable set, OS_Memory 13 is probably changing the logical address of the video memory and causing problems in RISCOS. |
Jeffrey Lee (213) 6048 posts |
That is a distinct possibility – how old is the ROM that you’re using? When I extended the ARMops and exposed them via OS_MMUControl 2 there were a few days (bug introduced at 22:02:28 on Aug 14 , bug fixed at 23:33:37 on Aug 17) where the cache identification code was broken, so no L2 cache maintenance would have been being performed. I probably should have mentioned this earlier, but I figured that if it took you two months to get round to using the ARMops then it would have also have taken you two months to get round to downloading a ROM containing support for them! |
Jon Abbott (1421) 2651 posts |
I’m on the 23-10-15 build and am updating it weekly |
Jeffrey Lee (213) 6048 posts |
By doing something like what you’re doing :-) Make a buffer which is used for DMA cacheable, and then make sure the DMA master sees the correct data when writing to the buffer and flushing the cache afterwards. For a more cast-iron test you could probably use memory-to-memory DMA, to allow code to verify that the data is being read correctly (rather than relying on your eyes spotting glitches)
Correct.
MMU_ChangingEntry is the closest to this. E.g. when DMAManager uses OS_Memory 0 to mark pages as temporarily uncacheable, the kernel updates the L2PT entry and then uses MMU_ChangingEntry to clean+invalidate all cache levels for that page.
If manual cleaning of GPU caches is required, I’d hope that the GraphicsV driver would be smart enough to know when to clean them itself. But I don’t think we’re yet in the situation where GPU cache cleaning is required.
RISC OS doesn’t really care, but one thing you do need to be careful of is that there are some extra rules on ARMv6+ for how to let the CPU know about updated page table entries. IIRC, the full rules for page table updates are:
If you don’t follow those rules then you can end up with odd aborts where the CPU is using stale prefetched data/TLB entries.
It’s a bit of a tricky one – RISC OS isn’t very good at keeping track of doubly-mapped areas. When I get home tonight I’ll try and focus on writing up some design ideas I’ve been having for the improved GraphicsV memory management, as I think they’ll offer a suitable solution to the problem. |
Jon Abbott (1421) 2651 posts |
Trouble is, I’m not really proving anything as the L2 cache on the GPU may be having an impact. I’m struggling to rule that out because there’s no way to get the logical address from a physical via OS_Memory without remapping it – and by remapping the VideoCore with different flags, it looks like OS_Memory 13 is altering the logical address and causing BCMVideo to crash :( Could OS_Memory be extended so you can translate a physical to logical address, if it’s already mapped? eg OS_Memory 17, < physical > returning < logical > in R1. OS_Memory 9/25 don’t have the VideoCore as a Controller, so you can’t get the addresses that route.
If it’s the GraphicsV driver requesting the clean, then yes. In this scenario, where we have a driver sitting between the OS and GraphicsV to provide legacy MODE support, there needs to be a means to get the underlying GraphicsV driver to clean the GPU cache. This is of course somewhat academic until I can prove it is the GPU L2 cache that’s causing the issue, it may well still turn out to be the CPU L2 cache.
I think I’m okay here, as I’m only changing the access to existing pages from User R/W to User R and the reverse, so I can trap self-modifying code. It would fail miserable if this wasn’t working. For the changes to DA2 however, I will change it to use MMU_ChangingUncachedEntry or MMU_ChangingEntry as appropriate.
Does RISCOS need to track them? We could leave the maintenance to the app/driver that’s requesting the map – it’s a bit of a niche market for circular buffers. I’m using them for both Sound and Video. Essentially, we just need a means to map the same physical pages to multiple logical address ranges. It doesn’t even need to be aware it’s doubly mapped, the app/driver can just map it twice itself.
I’d be interested in your ideas on this. When you start looking at providing legacy MODE support within the OS, you may come across some of these niggly issues. Speaking of which, do you have an idea of timescales to provide legacy mode support? I ask because I was approached by several people at the London Show, asking if I could spin out my blitter into a seperate Module. With modern RISCOS hardware supporting 16m colours only, all legacy MODE’s are now unavailable. My blitter is obviously way beyond what’s needed as I’m emulating VIDC20 / IOMD down to raster level for games. A straight blit per VSync is all that’s really required. - – - - So, to recap the main point. I need to confirm it is the GPU L2 cache that needs cleaning – I just need to figure out how to get it’s logical address and enable cacheing/buffering without breaking the GraphicsV driver. I might have a poke around in BCMVideo’s private space to see if I can get it from there – I only need to prove either way, then we can figure out what needs adding if it does need cleaning. |
Jon Abbott (1421) 2651 posts |
I’ve now confirmed Cache_CleanAll is cleaning the CPU L2 cache, I’m not sure if it’s cleaning the GPU L2 cache though, but don’t believe it’s required. The root cause of the problem was that I was creating the double map at 1F88000 to 1FFFFFF and 2000000 to 2077FFF with cacheable/bufferable, but wasn’t altering DA2’s cacheable/bufferable to match; turns out, this breaks cache cleaning on the Pi2. Setting the cacheable/bufferable on the matching memory in DA2 cures the problem. I expect the L2 CPU cache on the Pi2 is working on MVA and not physical addresses. This does pose another potential problem, in that the secondary and tertiary maps could get out of sync with primary – which might explain the random screen corruption I’m now seeing (it could also be unaligned memory accesses, as I’ve yet to verify my unaligned Abort handler.) |
Jeffrey Lee (213) 6048 posts |
Yeah, we could probably do that (there’s already something similar for the PCI module, I believe – a way to get the logical mapping of some physical PCI memory, but only if it already exists) RISC OS isn’t very good at keeping track of doubly-mapped areas. It doesn’t need to track them, but it would be nice if it could. When I get home tonight I’ll try and focus on writing up some design ideas I’ve been having for the improved GraphicsV memory management, as I think they’ll offer a suitable solution to the problem. Over here – although I did kind of gloss over a few of the things.
No idea. Overhauling the GraphicsV memory management is the next big thing I want to do (and is a stepping stone to the legacy mode support). But other tasks are likely to pop up and get in the way. And once the memory management overhaul is complete, I’m not really sure what my next task will be (multi-monitor support? hardware overlays? bigger/better mouse pointer images? Legacy mode support is important, but it’s just one of many things!) Of course, once the memory management overhaul is complete, things will hopefully be in a state where you or anyone else would be able to write a fairly clean and generic support module to provide the required functionality. |
Jeffrey Lee (213) 6048 posts |
According to some raw notes of mine from the ARM ARM (in the context of how to support double mapping for cacheable screen memory): - ARMv7: - data caches are PIPT; multiple mappings are fine as long as the attributes match (for simplicity, require all attributes to match, even if in reality it's only the cacheability & memory type attributes that matter). cache maintenance performed on a particular VA affects all aliases of that VA. - instruction caches typically need invalidation for writing of new instructions, and typically don't invalidate aliases when VA maintenance is performed (and if it's VIVT the only guaranteed way to get rid of all aliases is to do a full invalidate) - but, it's highly unlikely you'd want to doubly-map code, so make it a rule that any cacheable doubly-mapped area must be non-executable - ARMv6: - for multiply-mapeed cacheable non-shareable regions, page colouring may be in effect: - bits 13:12 of VA of all mappings must be identical - and if page size != 4K, bits 13:12 of PA must match 13:12 of VA - ARMv5 and below appear to be VIVT for both I+D, so multiply-mapped cacheable regions = bad Possibly the issue you were seeing with the conflicting cacheability attributes, the CPU was automatically invalidating the cache lines (or losing track of them) when it saw the conflicting entry get loaded into the TLB. |