Cache maintenance learnings
Jeffrey Lee (213) 6048 posts |
Recently I’ve been looking at some cache maintenance issues which seem to be the root cause of the SATA issues seen on IGEPv5 and (to a lesser extent) Titanium. From looking at how the OS currently handles cache maintenance, and how many current CPUs implement their caches, it’s becoming pretty clear to me that the OS is operating in a manner that doesn’t work very well on modern machines (and even some old ones). For the short-term I should soon have a fix available that will make SATA happy, but longer-term we should look to change the way we’re dealing with cache maintenance within the OS – that’s what this thread is for. Key problems
Potential solutionsNote that I’m trying to only consider solutions which will work well for multi-core. OS_Memory 0We almost certainly want to deprecate the OS_Memory 0 “make temporarily uncacheable” operation and start using a different approach for DMA. Potentially we could just go with the following approach:
(edit – this seems to be ARM’s preferred way of doing things) Physically tagged cachesThe solution which I’ve been thinking of for some time is to make it so that the CAM retains the cacheability attributes of pages which are mapped out. So on systems with physically tagged caches (all ARMv7 IIRC, maybe ARMv6 too?) it can avoid flushing the (data) cache when mapping a page out, including when the page changes owner (e.g. gets put back into the free pool). If there’s a page in the free pool which is marked as cacheable, and it’s due to be added to a non-cacheable DA, then after the page is mapped in to its new location we can perform an invalidate operation. For PMPs which are claiming a page but not mapping it in yet, this would probably require the kernel to make a temporary mapping to another location like the physical access window. The observant of you will remember that ARM11 ignores address-based cache maintenance ops on noncacheable memory. So we’d probably want the “make page uncacheable” logic to be in an ARMop-style utility routine so that we can implement different workarounds for different CPUs as required. Possibly it’s only ARMv6/ARM11 which will suffer from this, since I don’t think there are any older CPUs which use physically tagged caches. Dynamic areas claiming specific pagesThis is a tricky one. The current solution (disable interrupts around the page replacement code) is fine for single-core but will obviously fail for multi-core. Perhaps for this the only workable solution would be to temporarily halt the other cores – at least for situations where the page is marked as shareable. Considering that we’d need to deal with situations where a page is mapped in on one core but not another, we’d probably need to make a cross-core call anyway (e.g. if core A requests a page that’s currently being used by just core B). Set/way based operations not working with multi-coreWhenever RISC OS needs to do cache maintenance it generally looks at the size of the area and makes a judgement call (based on the value returned by the Cache_RangeThreshold ARMop) as to whether a ranged clean or a full clean would be fastest. So the obvious solution here would be to get rid of that code so that only the ranged clean is used. But for older or single core machines where multi-core safety isn’t an issue we’d probably want to keep the old logic present – I guess we can just fudge it by making the Cache_RangeThreshold ARMop return 4G-1, so the runtime checks for will still be there, but the optimisation will never be taken? Anyone else have any useful insights on how things should/shouldn’t be done? |
Rick Murray (539) 13839 posts |
The small question is – how does Linux do it? A multi core system with multiple processes and threads must have an interesting way of shuffling it all around. I have a feeling that part of this may involve expanding the kernel’s understanding of multiple tasks to support multiple threads (may help with USB/network updates) so maybe this may provide ideas for future directions? |
Colin (478) 2433 posts |
My understanding of cache complexities is rudimetary but I have a few questions/thoughts. Quote
(edit – this seems to be ARM’s preferred way of doing things) EndQuote Isn’t the above only required if the memory used for dma is cachable? The link says nothing about an invalidate being required both before and after DMA writes do you have a link which discusses this point? I can’t find one yet. If non-cacheable memory is used for dma there is only a problem with cache coherency if the non_cachable memory was previously cacheable and so may have cache hits – memory that has always been non-cachable doesn’t have cache hits does it? So if the os used a non-cacheable and cacheable memory pool it would require no cache maintenance. You would only need cache maintenance if you moved memory from the cached pool to the non-cached pool. |
Jeffrey Lee (213) 6048 posts |
I haven’t really looked at Linux’s memory management in any depth, but I get the feeling that (a) like RISC OS it suffers from its own legacy design issues, and (b) I’m not sure if it supports claiming pages that are in active use by someone else. I’m always mindful of the fact that on OMAP boards, you have to tell the Linux kernel how much memory to reserve for the GPU/framebuffer – Linux was born on a platform where the GPU had dedicated memory, whereas RISC OS was born on a platform with unified memory, so the ability to claim a page which is in use by someone else (in order to extend screen memory) was never a design concern for them. Possibly this situation has improved (I’m mainly thinking back to 2009 when I started the BeagleBoard port), but considering that I’m always consistently impressed by how poor the desktop Linux experience is on ARM I highly doubt it (OMAP3, OMAP4, OMAP5, iMX6 – whenever I’ve had reason to check, none of the default Linux distributions had the ability to change the screen resolution from the screen settings in their configure app, and none of them seemed to be configured to use EDID, you had to manually select a mode in the u-boot config. Raspbian I don’t think I’ve checked, and chances are you probably can change the resolution there, but for one reason or another I’ve always found desktop Linux on ARM to be a pretty terrible user experience)
Correct.
The first invalidate is needed if you aren’t sure what the memory has been used for beforehand. In most cases it’ll be a writable page which you’ll be DMA-ing to, so you’d need to either clean or invalidate the cache to make sure any dirty cache lines don’t get written back over the incoming DMA data. If the page is read-only, or write-through cacheable, this won’t be required.
Correct. However using non-cacheable memory for DMA can cause you to lose out on one of the main performance benefits – most file I/O operations specify a buffer in the client’s memory to use for the source/destination data. If your DMA can target that memory directly then it will keep the amount of data being moved around to a minimum and you’ll get the best performance. If you have to have a CPU routine which copies the data to/from a temporary non-cacheable buffer (also known as a “bounce buffer”) then that will slow things down quite a bit (generally more than the cost of the cache maintenance), and may end up being slower than filling/emptying the hardware FIFO manually (it’s the difference between copying the data once vs. copying the data twice) |
Colin (478) 2433 posts |
It was the second invalidate I had problems with. I couldn’t work out why there would be a prefetch but my best guess so far is I gather these cpu’s can do a speculative prefetch so a cache read from memory could happen before the memory is written to by DMA making the cache incoherent by the time DMA has finished. |
Jeffrey Lee (213) 6048 posts |
Correct. |
Ben Avison (25) 445 posts |
Just for the record, I feel I should point out that RISC OS’s cache maintenance around DMA has changed before, at RISC OS 3.7. This was because StrongARM introduced write-back caching; prior to that, there was no cache maintenance needed in the RAM→IO direction. I believe it was only ever really the case that all that was needed was a cache clean at that point in order to support StrongARM. Making DMAManager set pages temporarily uncacheable in the RAM→IO direction as well as in the IO→RAM was always an easy solution, because it could be implemented by simply removing the check inside DMAManager for which direction we’re going in and doing it unconditionally. The alternative would have involved creating a new kernel API to expose cache clean functionality. Not having been involved at the time, I don’t know how much this was down to laziness, versus time pressure, or even cautiousness. Bear in mind, even within Acorn, there were a limited number of engineers competent and confident enough to work on such low-level code. I can’t say I’m a massive fan of the idea of restricting IO→RAM DMA to cacheline alignment. This might be OK for some things like sound buffers, but any DMA that’s used to implement file transfers is quite likely to fall foul of that. Just think of the typical use case when an application loads or saves a file: typically it’ll be to or from a heap block in memory, and whilst these are invariably word-aligned, it’s rare that a heap manager deals in cacheline aligned blocks (in fact OS_Heap never does). I doubt much software bothers to round its heap blocks up to cacheline boundaries either. Also, typically on the IO side, you can only seek with sector size granularity, so it’s not like there’s the option to do the first sub-cacheline via a bounce buffer and then the rest of the transfer by DMA, because subsequent sectors will be just as unaligned as the first was. Here’s a rather ugly workaround if the transfer starts mid-cacheline:
(with similar special treatment for where transfers end mid-cacheline of course). To avoid the danger of the bounce buffer itself being remapped during DMA, it’s probably best to always allocate it from IO memory (via the PCIManager module). I’m not buying the argument that the way it’s currently done is uniquely a problem for marking a page temporarily uncacheable though. In a multi-core world, you’d still need to clean or invalidate the caches of all cores before doing DMA. If it’s true that doing cache maintenance by address will propagate to other cores, then that will work, but I’d be worried that could potentially be very slow. It might be better to issue some sort of inter-core interrupt to get the attention of all of them, and request that they each do a full cache clean/invalidate. That mechanism would be just as easy to use in order to request page table updates. FIQs: I can see there’s a problem there. You certainly don’t want to be disabling FIQs for something as long as a cache flush and page table manipulation. I wonder how OSes like Linux handle a FIQ routine performing a write to a copy-on-write page for example? However, I’m not quite clear how often a FIQ owner would be calling OS_Memory 0 for a page which its own FIQ handler is accessing though; OS_Memory 0 is mostly useful for DMA, and using DMA and FIQs for the same device sounds unlikely to me. |
Jeffrey Lee (213) 6048 posts |
Yes, I think I might have worded that poorly. Clearly the key thing is to make sure that if DMA is touching an address it’s not in the same cache line as an address the CPU is touching – so you don’t need to mess with the alignment of the data from the user’s point of view, you just need to redirect the first/last few bytes of the transfer to a bounce buffer. And there’s not necessarily any need for it to be a full sectors worth of data – if you’ve got a decent list-based DMA controller then it should be happy with just transferring a small number of bytes to the bounce buffer to get cache line alignment, followed by the main block straight to the (aligned) destination, then the last few bytes to another bounce buffer.
I believe all multi-core ARMv7+ CPUs have a snoop control unit which manages the coherency of the L1 data caches between the cores, for pages marked as shareable. This includes cache/TLB maintenance operations (if the option is enabled – for the A9 this seems to be hidden away in the CP15 auxiliary control register, but I’d assume all the other cores have equivalent settings). If the SCU can efficiently maintain coherency for read/write accesses between the cores, I’d hope that using it to perform cache maintenance would be more efficient than interrupting the other cores and making them do it all manually. Maybe the core’s memory access will slow down a bit due to increased pressure on its cache, but at least it will still be crunching user code. Of course it’s only shareable pages which the SCU cares about, so if the OS is conservative in how it marks pages as shareable then that will completely avoid any performance impact for the other cores. Hanging off of the SCU there’s also the accelerator coherency port (so external DMA bus masters can be cache-coherent), but I’m not sure if we have any machines which have ACP integrated. I don’t think there are any ARMv6 CPUs which have SCUs. To avoid the need for software coherency, multi-core ARM11 chips actually treat shareable pages as non-cacheable. Edit: Unless you’re more concerned about cache maintenance performance when there’s many megabytes of data to transfer? (i.e. much more than the cache size). In which case, yes, maybe interrupting the cores so they can do a full cache clean would be better. |
Ben Avison (25) 445 posts |
Yes, I was thinking again about filing systems where certain operations (loading large data files to/from RAM, copying large files etc) can routinely involve very large amounts of data, and doing cache maintenance by address over such a large range could be more of a bottleneck than the DMA itself is (or even if it isn’t, it would be tying up the CPU when it could otherwise be doing something more useful, especially if we ever get non-blocking IO). Of course, it remains to be seen where the threshold between addressed and whole cache maintenance being fastest would fall in practice. Interrupting other cores would make it slower than in the single-core world, but I suspect it would still cross over sooner or later. |
Jeffrey Lee (213) 6048 posts |
Today’s cache maintenance learning: Cortex-A53 generates an “unsupported exclusive access” data abort when LDREX/STREX targets a page which is temporarily unacheable (i.e. Normal, non-cacheable – documented in the TRM). It also aborts if a LDREX/STREX targets a cacheable page while the data cache is disabled (not documented in the TRM?). So both OS_Memory 0 “make temporarily uncacheable” and *Cache Off need to die. |
Jeffrey Lee (213) 6048 posts |
Some more observations, this time to do with multicore:
|
Jon Abbott (1421) 2651 posts |
Are the physical based cacheops broadcast? Broadcasting MVA based wouldn’t be very useful as it assumes the memory map is identical across the cores. This sort of lines up with the other discussion around when the cache should be flushed, in that to support multi-core not only would it be beneficial to flush at release, but to do it based on physical if its broadcast.
Wasn’t the Owned flag added to MESI in ARMv7 so you could tell which core owned the cache entry? And wasn’t the Snoop Control Unit added to maintain consistency across the cores? One thing I do recall is that the SCU only maintains the data cache, so modified or self-modified code consistency might be an issue. SpriteOp and Sound conversion spring to mind. |
Jeffrey Lee (213) 6048 posts |
There are no physical cache ops.
Data caches are guaranteed to be PIPT, so I’d assume the broadcast form of the operation would use the physical address (as looked up by the originating core) rather than the logical one. For (non-PIPT) instruction caches, considering that for single-core devices the only guaranteed way of making sure all aliases of an address are flushed from the cache is to do a full cache flush, perhaps they send both the virtual and physical address? So if the other core has (or had) a mapping at the same address then it’s guaranteed to find it, but if it has (or had) a mapping at a different address then there’s no guarantee it will be found, just as with the single-core case.
Yes, but both of those are only any good if you use cache maintenance ops which broadcast to the other cores. To elaborate on the problem I was describing:
If you use a ranged cache flush then it’ll be fine, because it will get broadcast to all the cores (regardless of whether the data is in the local core’s cache, IIRC), so as long as the core hasn’t been removed from the cluster the data will get flushed correctly.
The basic rule is the same as since the StrongARM came out. Flush the data cache to the point of coherency, then invalidate the I cache for the region. There are broadcast cache operations for both of these. |
Jon Abbott (1421) 2651 posts |
Regards set/way, I’m sure I read somewhere that it can only be used to bring the machine up/down and that MVA should be used at all other times in multi-core. ARM state their cores are cache coherent, so provided core 1 cleans to L2 (assuming L2 is shared across the cores, which is obviously platform dependent – I think in ARMish its the Point of Unification or something similar), core 2 will simply refresh from L2 and take Ownership. In which case, core 1 doesn’t need to worry about core 2’s L1. Didn’t ARM also introduce non-privileged cache cleaning in ARMv8 to assist with this. |
Jon Abbott (1421) 2651 posts |
Whilst looking through the Cortex-A9 r4 erratum I noticed an errata that comes into play here:
It looks like it impacts external memory access (DMA etc), as the L1’s remain coherent, L2 however may not. The workaround is interesting as it entails using an undocumented control register that disables the migratory feature (which I believe is what you’re concerned about), forcing propagation to PoC and PoU when another processor reads the location. |
Jeffrey Lee (213) 6048 posts |
Yes, I believe that’s true.
Correct (only available in AArch64 state)
I think it would affect writing code too – if it fails to propagate to the point of unification then the I-cache won’t be able to see the new instructions. |
Jeffrey Lee (213) 6048 posts |
I think I may be wrong about this – re-reading the ARM ARM last night (and looking at the older v7 rev B version for further clarity – the rev C text seems like it misses several possible configurations from its explanation) suggests that under all circumstances MVA-based operations on Normal, non-cacheable memory will broadcast to other cores. But there is some variance between different CPUs in terms of which shareability domain will be used – whether it uses the domain indicated in the page tables, or whether it always uses the outer shareable domain. So as long as the page tables are correct you can guarantee that the operation will affect the desired shareability domain (the outer shareabiltiy domain implicitly contains the inner shareability domain of the CPU making the request), and the only difference is whether the CPU will “over-share” the request and potentially cause unintentional cache maintenance to be performed for other CPUs (which could cause problems if you’re e.g. invalidating cache lines without writing them back, but for us it’s probably OK) |
Jeffrey Lee (213) 6048 posts |
I don’t think that fixing cache management will require the kernel to be taught about threads. But, fixing cache management is an important pre-requisite for allowing useful multi-core. |