Cache maintenance learnings

18 posts, 5 voices

Mar 10, 2016 2:37pm Jeffrey Lee (213) 6048 posts	Recently I’ve been looking at some cache maintenance issues which seem to be the root cause of the SATA issues seen on IGEPv5 and (to a lesser extent) Titanium. From looking at how the OS currently handles cache maintenance, and how many current CPUs implement their caches, it’s becoming pretty clear to me that the OS is operating in a manner that doesn’t work very well on modern machines (and even some old ones). For the short-term I should soon have a fix available that will make SATA happy, but longer-term we should look to change the way we’re dealing with cache maintenance within the OS – that’s what this thread is for. Key problems In ARMv7 and above, cache maintenance operations are guaranteed to affect disabled caches, and address-based operations are guaranteed to affect all memory types (i.e. you can make something non-cacheable and then flush it). Prior to ARMv7 this behaviour was implementation defined, and at least one CPU (ARM11) ignores address-based maintenance ops which target non-cacheable memory. It’s implementation defined whether unexpected cache hits are ignored or not (i.e. when a page is marked noncacheable but hits a cache entry). In particular, all the ARMv7+ TRMs that I’ve looked at when looking at this issue suggest that they all ignore the unexpected hit, and I have a test app which suggests that XScale ignores unexpected hits as well (interrupt hole in OS_Memory 0 “make temporarily uncacheable” operation) For now I’ve fixed the OS_Memory 0 issue by disabling interrupts around the critical part, but in the multi-core future which we all dream of, this isn’t really a workable solution – we’d have to temporarily halt all the other cores so we can make sure they aren’t accessing the memory. So unless the caller can guarantee that nobody is actively using the page which is being made uncacheable, using OS_Memory 0 to make a page uncacheable is inherently unsafe (Also I haven’t yet worked out what to do for FIQs – claiming the vector within OS_Memory 0 may cause issues if it was the FIQ owner who is making the call to OS_Memory!) If a page is marked cacheable, there’s no way of guaranteeing that the cache doesn’t contain data for that page. So cleaning the cache before making a page noncacheable isn’t going to work (this is the main issue that was tripping up SATA/Cortex-A15 – we clean the page from the cache and then the CPU appeared to be prefetching the page back into the cache prior to us making it noncacheable) AIUI, address-based cache maintenance operations are the only maintenance operations which will be coherent across multiple cores (i.e. doing a full cache clean via set/way based ops won’t clean the L1 caches of the other cores). I believe set/way based ops are also bad news for hypervisors (saw it mentioned in some slides from an ARM presentation somewhere – something like the hypervisor has to do a full cache clean for each single set/way based op that it sees, I guess because it can’t be certain which cores need cleaning – see page 8/slide 6#2) RISC OS doesn’t understand the concept of physically tagged caches, so cleans the cache whenever pages are being mapped out. If the OS understood physically tagged caches a lot of these problems would go away, but there’d still be some tricky edge cases (OS_Memory 0, and dynamic areas claiming specific pages) Potential solutions Note that I’m trying to only consider solutions which will work well for multi-core. OS_Memory 0 We almost certainly want to deprecate the OS_Memory 0 “make temporarily uncacheable” operation and start using a different approach for DMA. Potentially we could just go with the following approach: For DMA which reads from RAM, all that’s necessary is to clean the cache for the required region before starting the DMA. For DMA which writes to RAM, we’d have to require that all the regions involved are cache-line aligned (everything else will have to go via a bounce buffer). Then we can simply do a cache invalidate at the start of the operation (to prevent any dirty cache lines from writing back over the incoming data) followed by another invalidate at the end of the operation (to get rid of any stale prefetched data) (edit – this seems to be ARM’s preferred way of doing things) Physically tagged caches The solution which I’ve been thinking of for some time is to make it so that the CAM retains the cacheability attributes of pages which are mapped out. So on systems with physically tagged caches (all ARMv7 IIRC, maybe ARMv6 too?) it can avoid flushing the (data) cache when mapping a page out, including when the page changes owner (e.g. gets put back into the free pool). If there’s a page in the free pool which is marked as cacheable, and it’s due to be added to a non-cacheable DA, then after the page is mapped in to its new location we can perform an invalidate operation. For PMPs which are claiming a page but not mapping it in yet, this would probably require the kernel to make a temporary mapping to another location like the physical access window. The observant of you will remember that ARM11 ignores address-based cache maintenance ops on noncacheable memory. So we’d probably want the “make page uncacheable” logic to be in an ARMop-style utility routine so that we can implement different workarounds for different CPUs as required. Possibly it’s only ARMv6/ARM11 which will suffer from this, since I don’t think there are any older CPUs which use physically tagged caches. Dynamic areas claiming specific pages This is a tricky one. The current solution (disable interrupts around the page replacement code) is fine for single-core but will obviously fail for multi-core. Perhaps for this the only workable solution would be to temporarily halt the other cores – at least for situations where the page is marked as shareable. Considering that we’d need to deal with situations where a page is mapped in on one core but not another, we’d probably need to make a cross-core call anyway (e.g. if core A requests a page that’s currently being used by just core B). Set/way based operations not working with multi-core Whenever RISC OS needs to do cache maintenance it generally looks at the size of the area and makes a judgement call (based on the value returned by the Cache_RangeThreshold ARMop) as to whether a ranged clean or a full clean would be fastest. So the obvious solution here would be to get rid of that code so that only the ranged clean is used. But for older or single core machines where multi-core safety isn’t an issue we’d probably want to keep the old logic present – I guess we can just fudge it by making the Cache_RangeThreshold ARMop return 4G-1, so the runtime checks for will still be there, but the optimisation will never be taken? Anyone else have any useful insights on how things should/shouldn’t be done?

Mar 10, 2016 5:26pm Rick Murray (539) 14047 posts	The small question is – how does Linux do it? A multi core system with multiple processes and threads must have an interesting way of shuffling it all around. How would that approach work with RISC OS’s task switching (etc)? I have a feeling that part of this may involve expanding the kernel’s understanding of multiple tasks to support multiple threads (may help with USB/network updates) so maybe this may provide ideas for future directions?

Mar 11, 2016 10:32am Colin (478) 2433 posts	My understanding of cache complexities is rudimetary but I have a few questions/thoughts. Quote For DMA which reads from RAM, all that’s necessary is to clean the cache for the required region before starting the DMA. For DMA which writes to RAM, we’d have to require that all the regions involved are cache-line aligned (everything else will have to go via a bounce buffer). Then we can simply do a cache invalidate at the start of the operation (to prevent any dirty cache lines from writing back over the incoming data) followed by another invalidate at the end of the operation (to get rid of any stale prefetched data) (edit – this seems to be ARM’s preferred way of doing things) EndQuote Isn’t the above only required if the memory used for dma is cachable? The link says nothing about an invalidate being required both before and after DMA writes do you have a link which discusses this point? I can’t find one yet. If non-cacheable memory is used for dma there is only a problem with cache coherency if the non_cachable memory was previously cacheable and so may have cache hits – memory that has always been non-cachable doesn’t have cache hits does it? So if the os used a non-cacheable and cacheable memory pool it would require no cache maintenance. You would only need cache maintenance if you moved memory from the cached pool to the non-cached pool.

Mar 11, 2016 11:17am Jeffrey Lee (213) 6048 posts	The small question is – how does Linux do it? A multi core system with multiple processes and threads must have an interesting way of shuffling it all around. I haven’t really looked at Linux’s memory management in any depth, but I get the feeling that (a) like RISC OS it suffers from its own legacy design issues, and (b) I’m not sure if it supports claiming pages that are in active use by someone else. I’m always mindful of the fact that on OMAP boards, you have to tell the Linux kernel how much memory to reserve for the GPU/framebuffer – Linux was born on a platform where the GPU had dedicated memory, whereas RISC OS was born on a platform with unified memory, so the ability to claim a page which is in use by someone else (in order to extend screen memory) was never a design concern for them. Possibly this situation has improved (I’m mainly thinking back to 2009 when I started the BeagleBoard port), but considering that I’m always consistently impressed by how poor the desktop Linux experience is on ARM I highly doubt it (OMAP3, OMAP4, OMAP5, iMX6 – whenever I’ve had reason to check, none of the default Linux distributions had the ability to change the screen resolution from the screen settings in their configure app, and none of them seemed to be configured to use EDID, you had to manually select a mode in the u-boot config. Raspbian I don’t think I’ve checked, and chances are you probably can change the resolution there, but for one reason or another I’ve always found desktop Linux on ARM to be a pretty terrible user experience) Isn’t the above only required if the memory used for dma is cachable? Correct. The link says nothing about an invalidate being required both before and after DMA writes do you have a link which discusses this point? I can’t find one yet. The first invalidate is needed if you aren’t sure what the memory has been used for beforehand. In most cases it’ll be a writable page which you’ll be DMA-ing to, so you’d need to either clean or invalidate the cache to make sure any dirty cache lines don’t get written back over the incoming DMA data. If the page is read-only, or write-through cacheable, this won’t be required. If non-cacheable memory is used for dma there is only a problem with cache coherency if the non_cachable memory was previously cacheable and so may have cache hits – memory that has always been non-cachable doesn’t have cache hits does it? Correct. However using non-cacheable memory for DMA can cause you to lose out on one of the main performance benefits – most file I/O operations specify a buffer in the client’s memory to use for the source/destination data. If your DMA can target that memory directly then it will keep the amount of data being moved around to a minimum and you’ll get the best performance. If you have to have a CPU routine which copies the data to/from a temporary non-cacheable buffer (also known as a “bounce buffer”) then that will slow things down quite a bit (generally more than the cost of the cache maintenance), and may end up being slower than filling/emptying the hardware FIFO manually (it’s the difference between copying the data once vs. copying the data twice)

Mar 11, 2016 12:43pm Colin (478) 2433 posts	The first invalidate is needed if you aren’t sure what the memory has been used for beforehand. It was the second invalidate I had problems with. I couldn’t work out why there would be a prefetch but my best guess so far is I gather these cpu’s can do a speculative prefetch so a cache read from memory could happen before the memory is written to by DMA making the cache incoherent by the time DMA has finished.

Mar 11, 2016 1:06pm Jeffrey Lee (213) 6048 posts	Correct.

Mar 11, 2016 3:38pm Ben Avison (25) 445 posts	Just for the record, I feel I should point out that RISC OS’s cache maintenance around DMA has changed before, at RISC OS 3.7. This was because StrongARM introduced write-back caching; prior to that, there was no cache maintenance needed in the RAM→IO direction. I believe it was only ever really the case that all that was needed was a cache clean at that point in order to support StrongARM. Making DMAManager set pages temporarily uncacheable in the RAM→IO direction as well as in the IO→RAM was always an easy solution, because it could be implemented by simply removing the check inside DMAManager for which direction we’re going in and doing it unconditionally. The alternative would have involved creating a new kernel API to expose cache clean functionality. Not having been involved at the time, I don’t know how much this was down to laziness, versus time pressure, or even cautiousness. Bear in mind, even within Acorn, there were a limited number of engineers competent and confident enough to work on such low-level code. I can’t say I’m a massive fan of the idea of restricting IO→RAM DMA to cacheline alignment. This might be OK for some things like sound buffers, but any DMA that’s used to implement file transfers is quite likely to fall foul of that. Just think of the typical use case when an application loads or saves a file: typically it’ll be to or from a heap block in memory, and whilst these are invariably word-aligned, it’s rare that a heap manager deals in cacheline aligned blocks (in fact OS_Heap never does). I doubt much software bothers to round its heap blocks up to cacheline boundaries either. Also, typically on the IO side, you can only seek with sector size granularity, so it’s not like there’s the option to do the first sub-cacheline via a bounce buffer and then the rest of the transfer by DMA, because subsequent sectors will be just as unaligned as the first was. Here’s a rather ugly workaround if the transfer starts mid-cacheline: increment the transfer start address in RAM by one sector round down to the next cacheline, and invalidate from there upwards DMA the first sector to a bounce buffer, and remaining sectors directly to their destination addresses only after DMA has completed (and the same cachelines have been invalidated again), do the memcpy from the bounce buffer to the first sector’s destination address (with similar special treatment for where transfers end mid-cacheline of course). To avoid the danger of the bounce buffer itself being remapped during DMA, it’s probably best to always allocate it from IO memory (via the PCIManager module). I’m not buying the argument that the way it’s currently done is uniquely a problem for marking a page temporarily uncacheable though. In a multi-core world, you’d still need to clean or invalidate the caches of all cores before doing DMA. If it’s true that doing cache maintenance by address will propagate to other cores, then that will work, but I’d be worried that could potentially be very slow. It might be better to issue some sort of inter-core interrupt to get the attention of all of them, and request that they each do a full cache clean/invalidate. That mechanism would be just as easy to use in order to request page table updates. FIQs: I can see there’s a problem there. You certainly don’t want to be disabling FIQs for something as long as a cache flush and page table manipulation. I wonder how OSes like Linux handle a FIQ routine performing a write to a copy-on-write page for example? However, I’m not quite clear how often a FIQ owner would be calling OS_Memory 0 for a page which its own FIQ handler is accessing though; OS_Memory 0 is mostly useful for DMA, and using DMA and FIQs for the same device sounds unlikely to me.

Mar 11, 2016 5:23pm Jeffrey Lee (213) 6048 posts	I can’t say I’m a massive fan of the idea of restricting IO→RAM DMA to cacheline alignment. Yes, I think I might have worded that poorly. Clearly the key thing is to make sure that if DMA is touching an address it’s not in the same cache line as an address the CPU is touching – so you don’t need to mess with the alignment of the data from the user’s point of view, you just need to redirect the first/last few bytes of the transfer to a bounce buffer. And there’s not necessarily any need for it to be a full sectors worth of data – if you’ve got a decent list-based DMA controller then it should be happy with just transferring a small number of bytes to the bounce buffer to get cache line alignment, followed by the main block straight to the (aligned) destination, then the last few bytes to another bounce buffer. I’m not buying the argument that the way it’s currently done is uniquely a problem for marking a page temporarily uncacheable though. In a multi-core world, you’d still need to clean or invalidate the caches of all cores before doing DMA. If it’s true that doing cache maintenance by address will propagate to other cores, then that will work, but I’d be worried that could potentially be very slow. I believe all multi-core ARMv7+ CPUs have a snoop control unit which manages the coherency of the L1 data caches between the cores, for pages marked as shareable. This includes cache/TLB maintenance operations (if the option is enabled – for the A9 this seems to be hidden away in the CP15 auxiliary control register, but I’d assume all the other cores have equivalent settings). If the SCU can efficiently maintain coherency for read/write accesses between the cores, I’d hope that using it to perform cache maintenance would be more efficient than interrupting the other cores and making them do it all manually. Maybe the core’s memory access will slow down a bit due to increased pressure on its cache, but at least it will still be crunching user code. Of course it’s only shareable pages which the SCU cares about, so if the OS is conservative in how it marks pages as shareable then that will completely avoid any performance impact for the other cores. Hanging off of the SCU there’s also the accelerator coherency port (so external DMA bus masters can be cache-coherent), but I’m not sure if we have any machines which have ACP integrated. I don’t think there are any ARMv6 CPUs which have SCUs. To avoid the need for software coherency, multi-core ARM11 chips actually treat shareable pages as non-cacheable. Edit: Unless you’re more concerned about cache maintenance performance when there’s many megabytes of data to transfer? (i.e. much more than the cache size). In which case, yes, maybe interrupting the cores so they can do a full cache clean would be better.

Mar 11, 2016 8:06pm Ben Avison (25) 445 posts	Unless you’re more concerned about cache maintenance performance when there’s many megabytes of data to transfer? Yes, I was thinking again about filing systems where certain operations (loading large data files to/from RAM, copying large files etc) can routinely involve very large amounts of data, and doing cache maintenance by address over such a large range could be more of a bottleneck than the DMA itself is (or even if it isn’t, it would be tying up the CPU when it could otherwise be doing something more useful, especially if we ever get non-blocking IO). Of course, it remains to be seen where the threshold between addressed and whole cache maintenance being fastest would fall in practice. Interrupting other cores would make it slower than in the single-core world, but I suspect it would still cross over sooner or later.

Mar 26, 2016 12:29pm Jeffrey Lee (213) 6048 posts	Today’s cache maintenance learning: Cortex-A53 generates an “unsupported exclusive access” data abort when LDREX/STREX targets a page which is temporarily unacheable (i.e. Normal, non-cacheable – documented in the TRM). It also aborts if a LDREX/STREX targets a cacheable page while the data cache is disabled (not documented in the TRM?). So both OS_Memory 0 “make temporarily uncacheable” and *Cache Off need to die.

Jul 24, 2016 12:03pm Jeffrey Lee (213) 6048 posts	Some more observations, this time to do with multicore: MVA-based cache/branch predictor maintenance operations for Normal, non-cacheable memory are (generally) not broadcast to other cores. This means our approach of “make non-cacheable, clean cache” will require us to interrupt the other cores so that they can perform maintenance on their own L1 caches. Wrong In addition to the fact that set/way based operations aren’t broadcast to other cores, there’s also the problem that some types of cache (e.g. those which use the MOESI protocol) allow dirty cache lines to change ownership from one cache to another, without requiring the data to be written to main memory. This means that if one core writes data to its cache and then attempts a full cache flush, the data might not actually hit memory because another core had taken ownership of it in the meantime. (I’m also not sure whether a set/way maintenance op which hits a MOESI ‘shared’ cache line would result in a broadcast or not, so even if the original owner knew that another cache now owned it I’m not sure if it would ask the new owner to perform maintenance or not). So if we stick with using set/way based operations we’re also going to need to interrupt the other cores – although I’m not really sure how that would work if the data can theoretically hop between cores at any time. Disabling the caches would presumably prevent the movement of the data, but then you’d have the trouble that loads/stores would bypass the cache, so you’d have to perform the operation with interrupts disabled (ouch), and you’d have to use a non-cacheable area of memory for synchronising the cache clean completion with the other cores.

Jul 26, 2016 11:54am Jon Abbott (1421) 2661 posts	MVA-based cache/branch predictor maintenance operations Are the physical based cacheops broadcast? Broadcasting MVA based wouldn’t be very useful as it assumes the memory map is identical across the cores. This sort of lines up with the other discussion around when the cache should be flushed, in that to support multi-core not only would it be beneficial to flush at release, but to do it based on physical if its broadcast. MOESI Wasn’t the Owned flag added to MESI in ARMv7 so you could tell which core owned the cache entry? And wasn’t the Snoop Control Unit added to maintain consistency across the cores? One thing I do recall is that the SCU only maintains the data cache, so modified or self-modified code consistency might be an issue. SpriteOp and Sound conversion spring to mind.

Jul 26, 2016 12:43pm Jeffrey Lee (213) 6048 posts	Are the physical based cacheops broadcast? There are no physical cache ops. Broadcasting MVA based wouldn’t be very useful as it assumes the memory map is identical across the cores. Data caches are guaranteed to be PIPT, so I’d assume the broadcast form of the operation would use the physical address (as looked up by the originating core) rather than the logical one. For (non-PIPT) instruction caches, considering that for single-core devices the only guaranteed way of making sure all aliases of an address are flushed from the cache is to do a full cache flush, perhaps they send both the virtual and physical address? So if the other core has (or had) a mapping at the same address then it’s guaranteed to find it, but if it has (or had) a mapping at a different address then there’s no guarantee it will be found, just as with the single-core case. Wasn’t the Owned flag added to MESI in ARMv7 so you could tell which core owned the cache entry? And wasn’t the Snoop Control Unit added to maintain consistency across the cores? Yes, but both of those are only any good if you use cache maintenance ops which broadcast to the other cores. To elaborate on the problem I was describing: Core 1 could write some data to its cache Core 2 could take ownership of that data Core 1 could then evict the data from its cache, safe in the knowledge that Core 2 now owns the dirty cache line Set/way based cache flushes (which are the only way of doing a full data/unified cache clean, other than iterating over the entirety of the logical address space) don’t broadcast to other cores. So if a set/way based op is now used to clean the cache, it won’t find the data that was moved to core 2’s cache. If you use a ranged cache flush then it’ll be fine, because it will get broadcast to all the cores (regardless of whether the data is in the local core’s cache, IIRC), so as long as the core hasn’t been removed from the cluster the data will get flushed correctly. One thing I do recall is that the SCU only maintains the data cache, so modified or self-modified code consistency might be an issue. SpriteOp and Sound conversion spring to mind. The basic rule is the same as since the StrongARM came out. Flush the data cache to the point of coherency, then invalidate the I cache for the region. There are broadcast cache operations for both of these.

Jul 27, 2016 3:07am Jon Abbott (1421) 2661 posts	Regards set/way, I’m sure I read somewhere that it can only be used to bring the machine up/down and that MVA should be used at all other times in multi-core. ARM state their cores are cache coherent, so provided core 1 cleans to L2 (assuming L2 is shared across the cores, which is obviously platform dependent – I think in ARMish its the Point of Unification or something similar), core 2 will simply refresh from L2 and take Ownership. In which case, core 1 doesn’t need to worry about core 2’s L1. Didn’t ARM also introduce non-privileged cache cleaning in ARMv8 to assist with this.

Jul 28, 2016 9:07am Jon Abbott (1421) 2661 posts	Whilst looking through the Cortex-A9 r4 erratum I noticed an errata that comes into play here: 764369 Data or unified cache line maintenance by MVA fails on Inner Shareable memory* – Under certain timing circumstances, a data or unified cache line maintenance operation by MVA that targets an Inner Shareable memory region might fail to propagate to either the Point of Coherency or to the Point of Unification of the system. It looks like it impacts external memory access (DMA etc), as the L1’s remain coherent, L2 however may not. The workaround is interesting as it entails using an undocumented control register that disables the migratory feature (which I believe is what you’re concerned about), forcing propagation to PoC and PoU when another processor reads the location.

Jul 28, 2016 12:31pm Jeffrey Lee (213) 6048 posts	Regards set/way, I’m sure I read somewhere that it can only be used to bring the machine up/down and that MVA should be used at all other times in multi-core. Yes, I believe that’s true. Didn’t ARM also introduce non-privileged cache cleaning in ARMv8 to assist with this. Correct (only available in AArch64 state) It looks like it impacts external memory access (DMA etc), as the L1’s remain coherent, L2 however may not. I think it would affect writing code too – if it fails to propagate to the point of unification then the I-cache won’t be able to see the new instructions.

Sep 13, 2016 12:45pm Jeffrey Lee (213) 6048 posts	MVA-based cache/branch predictor maintenance operations for Normal, non-cacheable memory are (generally) not broadcast to other cores. I think I may be wrong about this – re-reading the ARM ARM last night (and looking at the older v7 rev B version for further clarity – the rev C text seems like it misses several possible configurations from its explanation) suggests that under all circumstances MVA-based operations on Normal, non-cacheable memory will broadcast to other cores. But there is some variance between different CPUs in terms of which shareability domain will be used – whether it uses the domain indicated in the page tables, or whether it always uses the outer shareable domain. So as long as the page tables are correct you can guarantee that the operation will affect the desired shareability domain (the outer shareabiltiy domain implicitly contains the inner shareability domain of the CPU making the request), and the only difference is whether the CPU will “over-share” the request and potentially cause unintentional cache maintenance to be performed for other CPUs (which could cause problems if you’re e.g. invalidating cache lines without writing them back, but for us it’s probably OK)

Oct 19, 2016 11:09am Jeffrey Lee (213) 6048 posts	I have a feeling that part of this may involve expanding the kernel’s understanding of multiple tasks to support multiple threads I don’t think that fixing cache management will require the kernel to be taught about threads. But, fixing cache management is an important pre-requisite for allowing useful multi-core.