Extending OS_MMUControl
Jon Abbott (1421) 2651 posts |
Following on from this thread and now this thread, both of which require CPU specific cache operations, I’d like to kick off a formal discussion about exposing some of the ARMOp’s via SWI. The requirement for low level cache ops for drivers etc is increasing with every SoC release and having spent weeks myself essentially duplicating everything that’s already in RISCOS, I wouldn’t want anyone else to go through the pain and frustration of the intricacies of cross-CPU cache/barrier operations. My proposal is to extend OS_MMUControl to allow single/ranged memory, TLB and barrier operations:
OS_MMUControl 2 – Ranged/Single cache flush request
OS_MMUControl 3 – Barrier operations
OS_MMUControl 4 – Prefetch operations
This would add full flexibility to cache ops and make a lot of the internal RISCOS ARMOps publicly available. OS_SynchroniseCodeAreas duplicates some of this functionality so it might be worth making it a veneer to OS_MMUControl and centralise all cache/TLB ops in one place. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
A few quick thoughts (will have a proper look later):
OS_SynchroniseCodeAreas is just a veneer to an ARMop call, so there isn’t really much point in making adding an extra step of indirection. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
I added it at the last minute for completeness, I can’t think of a use for it either unless adding/locking code into the cache by line.
Looks like I didn’t save my final edit, have now corrected. The scenario here in my case, is to precache and lock the Abort veneer in ADFFS into the instruction cache, so it can quickly determine if code is self-modifying. At 40-100k Aborts a second that some games produce, by locking the two lines it would require it will substantially reduce the performance hit. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Here’s the doc describing the existing ARMops: https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Kernel/Docs/HAL/Attic/ARMop_API?rev=1.1.2.2;content-type=text%2Fx-cvsweb-markup;hideattic=0#l136 As you can see, there aren’t really any ranged cache operations, apart from those done implicitly by the MMU ARMops. So that will cause some issues for OS_MMUControl 2. Another issue with OS_MMUControl 2, if it’s aiming to be a general-purpose cache maintenance SWI, is that it doesn’t say which cache level(s) should be affected by the operation. Rather than have APIs which require the cache levels to be specified explicitly, the ARM ARM lists two different ‘points of interest’ which general-purpose cache maintenance ops should aim to target:
At the moment the cache & MMU ARMops do cache maintenance to the PoC, while the IMB ARMops only do cache maintenance to PoU. For MMU ops the behaviour of flushing to PoC is often wasteful (99% of MMU ops will be for task swapping in the Wimp and so PoU will suffice), so it’s one of the things on my “would be nice to fix” list. It might be worth looking at the ops you’re performing (and why) and seeing if they fit with the existing ARMops before we spend type implementing a bunch of new ops for this SWI. For OS_MMUControl 4 – if you’re adding the ability to lock cache lines, I think it would be a good idea to allow unlocking as well ;-) Other than that it sounds like a good addition (even if it does mean new ARMops we (I) would have to implement!) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
Everything you say makes sense to me.
When looking through the BCM source, I’m certain I spotted ranged ops that use the new MRCC instructions – I lifted them and tried it in ADFFS. Anyhow, doesn’t really matter, ranged is going to be essential for efficient cache coherency. Thankfully the days of full cache flushes on 80321/SA are long gone :)
Definitely avoid specifying cache levels its too processor specific, where there are two in a particular processor and OS_SynchroniseCodeAreas is called for example, it should handle both caches if required. The idea behind the changes I’m proposing is to offer more granularity around what’s flushed but obviscate the bulk of the low level ArmOp actions that need to happen and avoid userland code doing any kind of direct ARMOp on the CPU.
Agree with you there, I’m not familiar enough with them as they’re not publicly exposed, so some of what I require may already be there. ADFFS performs the following cache operations:
As you can see, I’ve had to make some processor specific. With the changes proposed above, the SWI’s can determine if the processor supports individual/ranged D/I actions and act accordingly, so if Invalidate I entry is called on a StrongARM for example, the SWI will perform Invalidate I instead. What I’d also like to do, but have held off for the time being, is to lock the core Abort handler code into the I cache as it’s hit potentially millions of times a second. I’d like to implement this when I code the misaligned memory action handler for ARM7
Good point, now corrected.
Sorry about that, I honest feel bad that you’re the only person that can pick this sort of thing up. From searching the OS source code, it’s littered with OS_SynchroniseCodeArea and I’m sure the additions I’m proposing could improve the efficiency of some of those by switching to more appropriate clean/flushes. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
It sounds like 1-4 in your list should be replaceable with a single OS_SynchroniseCodeAreas. 5 is equivalent to the TLB_InvalidateEntry ARMop, so we could easily expose that via OS_MMUControl. 6 will almost certainly only see useful benefits if you use the PLI instruction directly. Neither XScale nor ARM11 appear to have CP15 ops for instruction preload, so you only really need to deal with two cases: pre-ARMv7 where PLI isn’t available, and ARMv7 where PLI is. Although if you’re lucky PLI might be interpreted as a NOP (or PLD) on ARMv5/v6 (and on <=ARMv4 it’ll definitely NOP due to being an unconditional instruction). 7 & 8 also sound like they should be using OS_SynchroniseCodeAreas. Note that although you might think you can get away with doing less work than OS_SynchroniseCodeAreas does when dealing with certain behaviour patterns (e.g. avoid cleaning I cache if you know the code hasn’t been executed before), the ARM ARM makes it clear that (for ARMv7, at least) the only way of guaranteeing that the CPU won’t spuriously precache some code/data is to make sure it’s marked as non-cacheable or to make sure the access permissions prevent the (privileged/unprivileged) access from succeeding (see section B2.2.2, “Cache behavior”. Also, I’ve just realised that the ARM ARM uses American spelling for behaviour, colour, and presumably most other words. Yuck!)
Funnily enough, I suspect all of those are used in places where new instructions have (potentially) been written to memory, and so use of OS_SynchroniseCodeAreas is completely appropriate ;-) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
I opted to go direct to avoid the SWI overhead, considering its within the Abort handler and is called potentially millions of times per second. Unless the entry point for OS_SynchroniseCodeAreas were exposed directly so it could LDR PC to it, I’ll probably leave these as-is for the moment. It’s an area I need to revisit when I move onto ARMv7 support as previously mentioned. Would exposing the OS_SynchroniseCodeAreasaddress be a possibility?
I must have misread the TRM here then, I find ARM’s documents incredibly difficult to distinguish which variation of which chip supports which ops. On p3-71 of the ARM1176JZF-S Rev. r0p7 TRM is this:
That is the ARM core in the Pi isn’t it? 7 & 8 also sound like they should be using OS_SynchroniseCodeAreas. I hear what you’re saying. One SWI is probably quicker in this instance anyhow for ARMv5+, I’ve already switched it to OS_SynchroniseCodeAreas in the next Pi release….however. On StrongARM/80321 repeatedly flushing small blocks of D cache and then one final I invalidation would be a lot more efficient than repeated I invalidation due to the cache flush overhead. So my line of thinking here is to improve performance on legacy CPU’s not ARM11+ 7/8 are called potentially 128 times per JIT code pass, so 128 D flushes for the Codelets and one final I flush on exit is how I’d like to coded it – although at the minute they both call OS_SynchroniseCodeAreas. The codelet area is separate to everything else, all codelets are cache aligned and its cleaned when codelets are removed so I believe I can avoid I cache flushes. I’ll do some testing on StrongARM over the next few days and see how reliable it is in a soak test. My recent rewrite to use OS_SynchroniseCodeAreas was to add Iyonix support, where I thought as you mention here, it could be predictive caching ahead that was causing the problems I was seeing. Sadly it’s still broken, so I think I have a compiler issue that’s unrelated to caching. Where I was previously using MCR to clean D / I separately it was working correctly on the Pi.
I withdraw my PM for that patronising comment ;) … seriously for a minute, I’m aware of why the calls are there, some just seemed oddly placed and went against the usage advise in the SWI documentation. Perhaps the SWI documentation needs rewording slightly to not preclude IRQ handlers. EDIT: Forgot to add how I’ve coded this in ADFFS: In the upcoming release, it builds up a heap of cache ops to perform before exit. These detail the range and flush clear required: I, D or I&D. Prior to handing back execution, it scans the heap and decides if its quicker to perform full cache flushes or lots of D/I cache flushes – so I have flexibility around what’s flushed and can optimize for performance. For example, if it writes 16kb of instructions, it doesn’t need to clean the ranges that have already dropped out of the cache. I’ve not built that up fully yet, but intend to scan it backwards until it hits a point the cache is known to not contain. How I determine that cut-off I don’t know yet, but the option is there should I wish to add it. At the moment it simply calls OS_SynchroniseCodeAreas repeatedly for each heap entry, although as the heap is built up, if it hit a threshold (currently 1kb of pending ranges) it stops adding to the heap and simply trigger one full cache flush on exit. Due to the way (ADFFS’) branch prediction/walking works and code in general triggers the JIT, there can be lots of small runs of code both in the codelet space and Appspace. From the tests I’ve done the bulk of the flush operations are small D / I MVA ranges. The split is 1xxx:1 (ranged vs full) and higher and can be tuned by altering both the branch prediction level and max number of instructions the JIT will handle in one entry. This really only affects games on the 1st pass though, on initial pass there’s large chunks of code being written out and it will issue full cache flushes/invalidation. On 2nd+ it’s simply picking up conditional branches and subroutines not previously seen. With branch prediction/walking now implemented, 99% of Zarch’s code as an example is fully running in native code within a second of the demo starting. It then takes a good ten mins to see about a dozen small blocks of code that are rare case uses. You can see screen shots of how this has progressed as I’ve coded it over the past week with the final result I’ve setted on here I’m not actually showing the small block flushes as a number – it’s a spinning graphic, so you can’t tell from the screenshot the split, but you can see that it opted for 85 full cache flushes during the 1st pass as large blocks of code where translated. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
One last thing, is it currently possible to get the cache line size through RISCOS? I need to know that for the codelet cache alignment, for the time being I’ve hardcoded it at 32 bytes but really need to change it to a variable. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
I can’t think of any reason not to. Potentially we could expose the IMB ARMops directly, since they have well-defined calling convention (and it’s not that far from OS_SynchroniseCodeAreas anyway) I must have misread the TRM here then, I find ARM’s documents incredibly difficult to distinguish which variation of which chip supports which ops. On p3-71 of the ARM1176JZF-S Rev. r0p7 TRM is this: Yes, you’re right – I foolishly only searched for “preload” in the ARM11/XScale docs and forgot that ARM also liked to use the term “prefetch”.
Very true. Perhaps we should have an optimised version of OS_SynchroniseCodeAreas which allows you to pass in a list of address ranges to act on?
Assuming LRU or round-robin replacement strategy :-)
Not at the moment, no. That should be easy enough to add to OS_PlatformFeatures – maybe a reason code which allows you to query the line size & cache size for a given cache level. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
Great, that would allow me to switch the Abort handler to RISCOS routines.
That works in my scenario, it may not work for others though. Both options if possible for flexibility.
True, I’ve not looked at it yet and wasn’t planning on implementing anytime soon. It would need more information about cache strategy to work, as you point out.
Perfect |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
It took me a while to get round to it, but OS_MMUControl 2 (for getting ARMops) and OS_PlatformFeatures 33 (for reading cache information) are now available (and in today’s ROMs, too). Let me know if you spot any issues (the IMB_List implementations are actually completely untested, but they’re simple enough transformations of IMB_Range that I don’t think I would have introduced any bugs!) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
Excellent, thanks for implementing this. I’ll make use of them when I recode ADFFS for Page Zero support, the two combined should increase the JIT performance by an order of magnitude. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
I’ve finally got around to coding this up, whilst adding Page Zero Relocation and ARMv7 support. Either I’m doing something daft, or have bugs in my code, as I can’t get these ARMOps to achieve the same result as MCR’s and OS_SynchroniseCodeAreas. Just to confirm I am calling the correct ARMOp’s: OS_SynchoriseCodeAreas, 0 (Full sync) becomes IMB_Full And what do I need to do about this:
How does IMB_List work in this scenario? I initially rewrote everything as above, which lock immediately, so I’ve rolled back and have only switched OS_SynchroniseCodeAreas to IMB_Range, which fails after the first set of instructions are written back. When calling the ARMOp’s, the CPU is in either Abort or SVC and I’m using the following in-line calling routine:
EDIT: |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Careful – that will invalidate the data cache, without cleaning it first, so will most likely crash the machine. It would be better to use IMB_Full (although obviously not as optimal, if you know the D cache doesn’t need cleaning). And what do I need to do about this: You don’t really have to do anything; you can just call the routine as normal. But if you wanted to optimise for machines with unified caches, you could check to see if IMB_List is a dummy routine, and if so switch to a different version of your code which skips building the list of dirty cache lines and skips calling IMB_List with it.
Excellent! Glad you got it sorted. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
Didn’t know that, by luck my recode did away with this ARMOp. It’s turned out better that I’d expected as using ARMOp’s has allowed me to remove a lot of code. |