Extending OS_MMUControl

15 posts, 2 voices

Mar 20, 2015 5:52am

Jon Abbott (1421) 2651 posts

Following on from this thread and now this thread, both of which require CPU specific cache operations, I’d like to kick off a formal discussion about exposing some of the ARMOp’s via SWI.

The requirement for low level cache ops for drivers etc is increasing with every SoC release and having spent weeks myself essentially duplicating everything that’s already in RISCOS, I wouldn’t want anyone else to go through the pain and frustration of the intricacies of cross-CPU cache/barrier operations.

My proposal is to extend OS_MMUControl to allow single/ranged memory, TLB and barrier operations:

Entry
R0	Reason code (bits 0 -7),
	Flags (bits 8 – 31) which are reason code specific
R1	Start line / MVA
R2	End line / MVA if bit 25 set (inclusive)

OS_MMUControl 2 – Ranged/Single cache flush request

Bits	Meaning
0-7	2
24	Lines (otherwise MVA)
25	Ranged R1/R2 specify range inclusive (single otherwise, R1 specified only)
26	Invalidate
27	Clean
28	Drain write buffer
29	Data cache
30	TLB
31	Instruction cache
All other bits are reserved, and must be zero

OS_MMUControl 3 – Barrier operations

Bits	Meaning
0-7	3
30	Data Synchronization Barrier
31	Data Memory Barrier
All other bits are reserved, and must be zero

OS_MMUControl 4 – Prefetch operations

Bits	Meaning
0-7	4
29	Unlock instructions from cache (MVA in R1, R2)
30	Lock instructions into cache (MVA in R1, R2)
31	Prefetch instruction cache lines (MVA in R1, R2)
All other bits are reserved, and must be zero

This would add full flexibility to cache ops and make a lot of the internal RISCOS ARMOps publicly available.

OS_SynchroniseCodeAreas duplicates some of this functionality so it might be worth making it a veneer to OS_MMUControl and centralise all cache/TLB ops in one place.

Mar 20, 2015 10:19am

Jeffrey Lee (213) 6048 posts

A few quick thoughts (will have a proper look later):

OS_MMUControl 2 bit 24 looks a bit out of place – none of the ARMops allow you to explicitly specify cache lines to operate on, they only use MVA. Did you have a particular use case in mind for this flag? To sensibly support it we’d need to add a call to read the highest cache line number (and we’d need to consider what to do in multi-level cache situations, or even for odd cases like the instruction and data cache sizes not being identical)
OS_MMUControl 4 – the overhead of the SWI will almost certainly offset any performance gains you were hoping to get. Just use the PLD/PLI/PLW instructions directly (depending on desired behaviour & architecture support!). About the only SWI which would make sense would be one which returns an instruction which you can then write out into your code, e.g. “give me a PLI instruction for an address in R4”. But that won’t fully be able to cope with the fact that PLI can have an offset applied while an MCR op (is there one on some architectures?) can’t.

OS_SynchroniseCodeAreas duplicates some of this functionality so it might be worth making it a veneer to OS_MMUControl and centralise all cache/TLB ops in one place.

OS_SynchroniseCodeAreas is just a veneer to an ARMop call, so there isn’t really much point in making adding an extra step of indirection.

Mar 20, 2015 2:11pm

Jon Abbott (1421) 2651 posts

OS_MMUControl 2 bit 24 looks a bit out of place – none of the ARMops allow you to explicitly specify cache lines to operate on, they only use MVA. Did you have a particular use case in mind for this flag?

I added it at the last minute for completeness, I can’t think of a use for it either unless adding/locking code into the cache by line.

OS_MMUControl 4 – the overhead of the SWI will almost certainly offset any performance gains you were hoping to get.

Looks like I didn’t save my final edit, have now corrected.

The scenario here in my case, is to precache and lock the Abort veneer in ADFFS into the instruction cache, so it can quickly determine if code is self-modifying. At 40-100k Aborts a second that some games produce, by locking the two lines it would require it will substantially reduce the performance hit.

Mar 23, 2015 2:01pm

Jeffrey Lee (213) 6048 posts

Here’s the doc describing the existing ARMops: https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Kernel/Docs/HAL/Attic/ARMop_API?rev=1.1.2.2;content-type=text%2Fx-cvsweb-markup;hideattic=0#l136

As you can see, there aren’t really any ranged cache operations, apart from those done implicitly by the MMU ARMops. So that will cause some issues for OS_MMUControl 2.

Another issue with OS_MMUControl 2, if it’s aiming to be a general-purpose cache maintenance SWI, is that it doesn’t say which cache level(s) should be affected by the operation. Rather than have APIs which require the cache levels to be specified explicitly, the ARM ARM lists two different ‘points of interest’ which general-purpose cache maintenance ops should aim to target:

Point of coherency – “For a particular MVA, the point at which all agents that can access memory are guaranteed to see the same copy of a memory location”. E.g. a cache flush to PoC would generally involve flushing to main memory.
Point of unification – “For a particular processor, the point by which the instruction and data caches and the translation table walks of that processor are guaranteed to see the same copy of a memory location”. E.g. on all current systems OS_SynchroniseCodeAreas doesn’t need to bother flushing the data from the L2 cache, only the L1 cache.

At the moment the cache & MMU ARMops do cache maintenance to the PoC, while the IMB ARMops only do cache maintenance to PoU. For MMU ops the behaviour of flushing to PoC is often wasteful (99% of MMU ops will be for task swapping in the Wimp and so PoU will suffice), so it’s one of the things on my “would be nice to fix” list.

It might be worth looking at the ops you’re performing (and why) and seeing if they fit with the existing ARMops before we spend type implementing a bunch of new ops for this SWI.

For OS_MMUControl 4 – if you’re adding the ability to lock cache lines, I think it would be a good idea to allow unlocking as well ;-) Other than that it sounds like a good addition (even if it does mean new ARMops we (I) would have to implement!)

Mar 23, 2015 8:01pm

Jon Abbott (1421) 2651 posts

Everything you say makes sense to me.

As you can see, there aren’t really any ranged cache operations

When looking through the BCM source, I’m certain I spotted ranged ops that use the new MRCC instructions – I lifted them and tried it in ADFFS. Anyhow, doesn’t really matter, ranged is going to be essential for efficient cache coherency. Thankfully the days of full cache flushes on 80321/SA are long gone :)

Another issue with OS_MMUControl 2, if it’s aiming to be a general-purpose cache maintenance SWI, is that it doesn’t say which cache level(s) should be affected by the operation.

Definitely avoid specifying cache levels its too processor specific, where there are two in a particular processor and OS_SynchroniseCodeAreas is called for example, it should handle both caches if required. The idea behind the changes I’m proposing is to offer more granularity around what’s flushed but obviscate the bulk of the low level ArmOp actions that need to happen and avoid userland code doing any kind of direct ARMOp on the CPU.

It might be worth looking at the ops you’re performing (and why) and seeing if they fit with the existing ARMops before we spend time implementing a bunch of new ops for this SWI.

Agree with you there, I’m not familiar enough with them as they’re not publicly exposed, so some of what I require may already be there. ADFFS performs the following cache operations:

Clean D entry – in the Abort handler for self-modifying code. It will clean 1 or 2 MVA’s depending on the cache alignment and length in the case of STM
Invalidate I entry – in the Abort handler for self-modifying code. It will invalidate 1 or 2 MVA’s depending on the cache alignment and length in the case of STM (80321+ only)
Drain Write Buffer – in the Abort handler after cleaning D entries
Invalidate I Cache – in the Abort handler for self-modifyin code when running on StrongARM
Invalidate TLB Entry – in the JIT when it marks a page as read only that contains code, or when switching the pages back to R/W in paravirtualized SWI’s that affect memory ranges (eg OS_File)
Precache I – in the JIT exit handler shortly before it passes control back to the code its just translated
Clean D range – once its written a codelet to memory
Clean D & Invalidate I range – when it hits a branch to follow or is about to exit. It’s also used in various paravirtualized SWI’s such as OS_File that affect ranges of memory that potentially contain code

As you can see, I’ve had to make some processor specific. With the changes proposed above, the SWI’s can determine if the processor supports individual/ranged D/I actions and act accordingly, so if Invalidate I entry is called on a StrongARM for example, the SWI will perform Invalidate I instead.

What I’d also like to do, but have held off for the time being, is to lock the core Abort handler code into the I cache as it’s hit potentially millions of times a second. I’d like to implement this when I code the misaligned memory action handler for ARM7

For OS_MMUControl 4 – if you’re adding the ability to lock cache lines, I think it would be a good idea to allow unlocking as well

Good point, now corrected.

even if it does mean new ARMops we (I) would have to implement!

Sorry about that, I honest feel bad that you’re the only person that can pick this sort of thing up.

From searching the OS source code, it’s littered with OS_SynchroniseCodeArea and I’m sure the additions I’m proposing could improve the efficiency of some of those by switching to more appropriate clean/flushes.

Mar 24, 2015 12:38am

Jeffrey Lee (213) 6048 posts

It sounds like 1-4 in your list should be replaceable with a single OS_SynchroniseCodeAreas.

5 is equivalent to the TLB_InvalidateEntry ARMop, so we could easily expose that via OS_MMUControl.

6 will almost certainly only see useful benefits if you use the PLI instruction directly. Neither XScale nor ARM11 appear to have CP15 ops for instruction preload, so you only really need to deal with two cases: pre-ARMv7 where PLI isn’t available, and ARMv7 where PLI is. Although if you’re lucky PLI might be interpreted as a NOP (or PLD) on ARMv5/v6 (and on <=ARMv4 it’ll definitely NOP due to being an unconditional instruction).

7 & 8 also sound like they should be using OS_SynchroniseCodeAreas.

Note that although you might think you can get away with doing less work than OS_SynchroniseCodeAreas does when dealing with certain behaviour patterns (e.g. avoid cleaning I cache if you know the code hasn’t been executed before), the ARM ARM makes it clear that (for ARMv7, at least) the only way of guaranteeing that the CPU won’t spuriously precache some code/data is to make sure it’s marked as non-cacheable or to make sure the access permissions prevent the (privileged/unprivileged) access from succeeding (see section B2.2.2, “Cache behavior”. Also, I’ve just realised that the ARM ARM uses American spelling for behaviour, colour, and presumably most other words. Yuck!)

From searching the OS source code, it’s littered with OS_SynchroniseCodeArea and I’m sure the additions I’m proposing could improve the efficiency of some of those by switching to more appropriate clean/flushes.

Funnily enough, I suspect all of those are used in places where new instructions have (potentially) been written to memory, and so use of OS_SynchroniseCodeAreas is completely appropriate ;-)

Mar 24, 2015 3:54am

Jon Abbott (1421) 2651 posts

It sounds like 1-4 in your list should be replaceable with a single OS_SynchroniseCodeAreas.

I opted to go direct to avoid the SWI overhead, considering its within the Abort handler and is called potentially millions of times per second. Unless the entry point for OS_SynchroniseCodeAreas were exposed directly so it could LDR PC to it, I’ll probably leave these as-is for the moment. It’s an area I need to revisit when I move onto ARMv7 support as previously mentioned. Would exposing the OS_SynchroniseCodeAreasaddress be a possibility?

6 will almost certainly only see useful benefits if you use the PLI instruction directly. Neither XScale nor ARM11 appear to have CP15 ops for instruction preload, so you only really need to deal with two cases: pre-ARMv7 where PLI isn’t available, and ARMv7 where PLI is. Although if you’re lucky PLI might be interpreted as a NOP (or PLD) on ARMv5/v6 (and on <=ARMv4 it’ll definitely NOP due to being an unconditional instruction).

I must have misread the TRM here then, I find ARM’s documents incredibly difficult to distinguish which variation of which chip supports which ops. On p3-71 of the ARM1176JZF-S Rev. r0p7 TRM is this:

MCR p15, 0, <Rd>, c7, c14, 1 MVA Prefetch Instruction Cache Line, using MVA

That is the ARM core in the Pi isn’t it?

7 & 8 also sound like they should be using OS_SynchroniseCodeAreas.

Note that although you might think you can get away with doing less work than OS_SynchroniseCodeAreas does when dealing with certain behaviour patterns (e.g. avoid cleaning I cache if you know the code hasn’t been executed before), the ARM ARM makes it clear that (for ARMv7, at least) …

I hear what you’re saying. One SWI is probably quicker in this instance anyhow for ARMv5+, I’ve already switched it to OS_SynchroniseCodeAreas in the next Pi release….however. On StrongARM/80321 repeatedly flushing small blocks of D cache and then one final I invalidation would be a lot more efficient than repeated I invalidation due to the cache flush overhead. So my line of thinking here is to improve performance on legacy CPU’s not ARM11+

7/8 are called potentially 128 times per JIT code pass, so 128 D flushes for the Codelets and one final I flush on exit is how I’d like to coded it – although at the minute they both call OS_SynchroniseCodeAreas.

The codelet area is separate to everything else, all codelets are cache aligned and its cleaned when codelets are removed so I believe I can avoid I cache flushes. I’ll do some testing on StrongARM over the next few days and see how reliable it is in a soak test.

My recent rewrite to use OS_SynchroniseCodeAreas was to add Iyonix support, where I thought as you mention here, it could be predictive caching ahead that was causing the problems I was seeing. Sadly it’s still broken, so I think I have a compiler issue that’s unrelated to caching. Where I was previously using MCR to clean D / I separately it was working correctly on the Pi.

Funnily enough, I suspect all of those are used in places where new instructions have (potentially) been written to memory, and so use of OS_SynchroniseCodeAreas is completely appropriate ;-)

I withdraw my PM for that patronising comment ;) … seriously for a minute, I’m aware of why the calls are there, some just seemed oddly placed and went against the usage advise in the SWI documentation. Perhaps the SWI documentation needs rewording slightly to not preclude IRQ handlers.

EDIT: Forgot to add how I’ve coded this in ADFFS:

In the upcoming release, it builds up a heap of cache ops to perform before exit. These detail the range and flush clear required: I, D or I&D.

Prior to handing back execution, it scans the heap and decides if its quicker to perform full cache flushes or lots of D/I cache flushes – so I have flexibility around what’s flushed and can optimize for performance. For example, if it writes 16kb of instructions, it doesn’t need to clean the ranges that have already dropped out of the cache.

I’ve not built that up fully yet, but intend to scan it backwards until it hits a point the cache is known to not contain. How I determine that cut-off I don’t know yet, but the option is there should I wish to add it.

At the moment it simply calls OS_SynchroniseCodeAreas repeatedly for each heap entry, although as the heap is built up, if it hit a threshold (currently 1kb of pending ranges) it stops adding to the heap and simply trigger one full cache flush on exit.

Due to the way (ADFFS’) branch prediction/walking works and code in general triggers the JIT, there can be lots of small runs of code both in the codelet space and Appspace. From the tests I’ve done the bulk of the flush operations are small D / I MVA ranges.

The split is 1xxx:1 (ranged vs full) and higher and can be tuned by altering both the branch prediction level and max number of instructions the JIT will handle in one entry. This really only affects games on the 1st pass though, on initial pass there’s large chunks of code being written out and it will issue full cache flushes/invalidation. On 2nd+ it’s simply picking up conditional branches and subroutines not previously seen.

With branch prediction/walking now implemented, 99% of Zarch’s code as an example is fully running in native code within a second of the demo starting. It then takes a good ten mins to see about a dozen small blocks of code that are rare case uses. You can see screen shots of how this has progressed as I’ve coded it over the past week with the final result I’ve setted on here

I’m not actually showing the small block flushes as a number – it’s a spinning graphic, so you can’t tell from the screenshot the split, but you can see that it opted for 85 full cache flushes during the 1st pass as large blocks of code where translated.

Mar 24, 2015 5:25am

Jon Abbott (1421) 2651 posts

One last thing, is it currently possible to get the cache line size through RISCOS?

I need to know that for the codelet cache alignment, for the time being I’ve hardcoded it at 32 bytes but really need to change it to a variable.

Mar 24, 2015 11:22am

Jeffrey Lee (213) 6048 posts

Would exposing the OS_SynchroniseCodeAreasaddress be a possibility?

I can’t think of any reason not to. Potentially we could expose the IMB ARMops directly, since they have well-defined calling convention (and it’s not that far from OS_SynchroniseCodeAreas anyway)

I must have misread the TRM here then, I find ARM’s documents incredibly difficult to distinguish which variation of which chip supports which ops. On p3-71 of the ARM1176JZF-S Rev. r0p7 TRM is this:

MCR p15, 0, , c7, c14, 1 MVA Prefetch Instruction Cache Line, using MVA

That is the ARM core in the Pi isn’t it?

Yes, you’re right – I foolishly only searched for “preload” in the ARM11/XScale docs and forgot that ARM also liked to use the term “prefetch”.

On StrongARM/80321 repeatedly flushing small blocks of D cache and then one final I invalidation would be a lot more efficient than repeated I invalidation due to the cache flush overhead. So my line of thinking here is to improve performance on legacy CPU’s not ARM11+

Very true. Perhaps we should have an optimised version of OS_SynchroniseCodeAreas which allows you to pass in a list of address ranges to act on?

For example, if it writes 16kb of instructions, it doesn’t need to clean the ranges that have already dropped out of the cache.

Assuming LRU or round-robin replacement strategy :-)

One last thing, is is currently possible to get the cache line size through RISCOS?

Not at the moment, no. That should be easy enough to add to OS_PlatformFeatures – maybe a reason code which allows you to query the line size & cache size for a given cache level.

Mar 24, 2015 12:25pm

Jon Abbott (1421) 2651 posts

Potentially we could expose the IMB ARMops directly, since they have well-defined calling convention

Great, that would allow me to switch the Abort handler to RISCOS routines.

Perhaps we should have an optimised version of OS_SynchroniseCodeAreas which allows you to pass in a list of address ranges to act on?

That works in my scenario, it may not work for others though. Both options if possible for flexibility.

Assuming LRU or round-robin replacement strategy :-)

True, I’ve not looked at it yet and wasn’t planning on implementing anytime soon. It would need more information about cache strategy to work, as you point out.

That should be easy enough to add to OS_PlatformFeatures – maybe a reason code which allows you to query the line size & cache size for a given cache level.

Perfect

Aug 6, 2015 9:52pm

Jeffrey Lee (213) 6048 posts

It took me a while to get round to it, but OS_MMUControl 2 (for getting ARMops) and OS_PlatformFeatures 33 (for reading cache information) are now available (and in today’s ROMs, too). Let me know if you spot any issues (the IMB_List implementations are actually completely untested, but they’re simple enough transformations of IMB_Range that I don’t think I would have introduced any bugs!)

Aug 8, 2015 10:34am

Jon Abbott (1421) 2651 posts

Excellent, thanks for implementing this. I’ll make use of them when I recode ADFFS for Page Zero support, the two combined should increase the JIT performance by an order of magnitude.

Oct 26, 2015 4:38am

Jon Abbott (1421) 2651 posts

I’ve finally got around to coding this up, whilst adding Page Zero Relocation and ARMv7 support. Either I’m doing something daft, or have bugs in my code, as I can’t get these ARMOps to achieve the same result as MCR’s and OS_SynchroniseCodeAreas.

Just to confirm I am calling the correct ARMOp’s:

OS_SynchoriseCodeAreas, 0 (Full sync) becomes IMB_Full
OS_SynchoriseCodeAreas, 1 (Ranged sync) becomes IMB_List (the ARMOp is using the same table structure I am)
MCR P15, 0, Rx, C8, C6, 1 (Invalidate TLB Entry by MVA) becomes TLB_InvalidateEntry
MCR P15, 0, R0, C7, C5, 0 (Invalidate Entire Instruction Cache) becomes Cache_InvalidateAll
MCR P15, 0, R0, C7, C10, 4 (Drain Write Buffer) becomes DSB_Write
MCR P15, 0, Rx, C7, C5, 1 (Invalidate I cache line by MVA) there’s no direct equivalent, but can be achieved with IMB_Range
MCR P15, 0, Rx, C7, C10, 1 (Clean D cache line by MVA) there’s no direct equivalent, but can be achieved with IMB_Range

And what do I need to do about this:

Unused ARMops (e.g. IMB on CPU with unified cache) will return a pointer to a dummy routine which does nothing – this can be detected as it will be a single MOV pc,lr instruction.

How does IMB_List work in this scenario?

I initially rewrote everything as above, which lock immediately, so I’ve rolled back and have only switched OS_SynchroniseCodeAreas to IMB_Range, which fails after the first set of instructions are written back.

When calling the ARMOp’s, the CPU is in either Abort or SVC and I’m using the following in-line calling routine:

MOV R14, PC
LDR PC, ArmOpX

EDIT:
I’ve recoded everything through the night to reduce the number of ARMOp’s required (now only using 1, 5, 7, 8 and 9) and have got it working. I’d missed one location where R14 needed preserving across the call and was reducing the last address one instruction too many in some situations.

Oct 26, 2015 11:55am

Jeffrey Lee (213) 6048 posts

MCR P15, 0, R0, C7, C5, 0 (Invalidate Entire Instruction Cache) becomes Cache_InvalidateAll

Careful – that will invalidate the data cache, without cleaning it first, so will most likely crash the machine.

It would be better to use IMB_Full (although obviously not as optimal, if you know the D cache doesn’t need cleaning).

And what do I need to do about this:

Unused ARMops (e.g. IMB on CPU with unified cache) will return a pointer to a dummy routine which does nothing – this can be detected as it will be a single MOV pc,lr instruction.

How does IMB_List work in this scenario?

You don’t really have to do anything; you can just call the routine as normal. But if you wanted to optimise for machines with unified caches, you could check to see if IMB_List is a dummy routine, and if so switch to a different version of your code which skips building the list of dirty cache lines and skips calling IMB_List with it.

I’ve recoded everything through the night to reduce the number of ARMOp’s required (now only using 1, 5, 7, 8 and 9) and have got it working.

Excellent! Glad you got it sorted.

Oct 26, 2015 4:53pm

Jon Abbott (1421) 2651 posts

Careful – that will invalidate the data cache, without cleaning it first, so will most likely crash the machine.

Didn’t know that, by luck my recode did away with this ARMOp. It’s turned out better that I’d expected as using ARMOp’s has allowed me to remove a lot of code.

Reply

To post replies, please first log in.

Forums → Code review →

Extending OS_MMUControl

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options