OS_SynchroniseCodeArea speed issue
Jon Abbott (1421) 2651 posts |
This example takes between 9 cs and 2 seconds on my Pi3 at an F12 prompt:
This example takes between 4.5 seconds and 62.5 seconds:
Why does the CPU speed have such an effect on invalidating the cache on a Pi3? And why doesn’t OS_SynchroniseCodeAreas set the CPU speed to full prior to doing whatever it’s doing, that’s so CPU intensive. |
Jeffrey Lee (213) 6048 posts |
I believe the CPU is already at full speed when you’re at the command line (unless it’s waiting for keyboard input). Besides, the difference between 9 cs and 2 seconds is a factor of ~20, and 4.5 seconds and 62.5 seconds is a factor of ~14. IIRC the default min & max speeds for a Pi 3 are 1200MHz and 600MHz – so unless you’ve set your min speed down to 60MHz (or have overclocked to 12GHz) there’s clearly more to this than just CPU speed. Assuming you’re on a recent OS version (July 8 2018 or newer), OS_SynchroniseCodeAreas for large areas, on a multi-core machine, will be slower than before because there’s no (trivial) SMP-friendly full data cache flush operation. On a single-core machine large clean operations can be simplified to a full d-cache flush followed by a full i-cache invalidate, but on multi-core machines it must walk the entire address range flushing each MVA from the d-cache in turn, followed by the full i-cache invalidate. To stop OS_SynchroniseCodeAreas 0 taking forever, it’s been modified so that on SMP machines it’ll only do a ranged clean of the RMA and application space (the two historic areas where code is most likely to appear). So execution speed will mostly depend on how large your RMA & appslot are. Additionally, on ARMv7 and newer, cache maintenance operations can trigger data aborts if they target unmapped memory. The kernel has a handler for this which will cause it to skip the instruction – but the cache maintenance loops are still pretty dumb so it’ll keep trying all the other cache lines within that page. So your second example, where you’re cleaning pretty much every address in the system, will be generating tens or hundreds of millions of aborts due to ~80% of the memory map being empty. |
Jon Abbott (1421) 2651 posts |
You can’t assume the CPU speed when you’re about to do something CPU intensive. From the tests I’ve done, the CPU speed appears to be either high or low when you F12 from the desktop and under a Task window, it fluctuates.
I have the low speed set to 100MHz, to fix the blanking issue.
Build date is 15th Feb 2019. Daft question, but why is it worried about the other cores, when the OS is running on one? Shouldn’t it only invalidate the current core caches and any areas shared between cores? Should it not also be extended to have an explicit flag for flushing all cores? Any current occurrences of the SWI will be specific to the core the app/OS are running on. |
Jeffrey Lee (213) 6048 posts |
The Wimp should set the CPU back to a high speed when entering the command line – so that sounds like a bug.
For OS_SynchroniseCodeAreas 0, I was too lazy/paranoid to make the logic change take effect once the other cores are started.
With SMP, “any areas shared between cores” is typically almost every page in the system. You’ll want the RMA and other dynamic areas to be shared so that programs don’t have to worry about which core they’re running on before they try to access the code/data held there. And any multi-threaded app will want its application space to be shareable so that it gets full multi-core performance instead of being restricted to running on a single core at a time. For single-threaded apps it may be feasible to mark its wimpslot non-shareable, but that feels like an optimisation that we should only look into once we’ve got the basics up and running. |