Flushing the cache correctly on the various chips
Jon Abbott (1421) 2651 posts |
From Googling, it appears the method required to flush the cache changes from StrongARM through to the latest chips, does anyone know what’s required for which chips? I need to flush just a few entries from the D and I caches to support self-modifying code. On StrongARM, I initially tried the following instruction with R0 starting at the start address (with bits cleared to align to the cache) and increasing by 32 until it’s past the end address: MCR P15, 0, R0, C7, C6, 1 (flush D-cache entry) Followed by: This doesn’t appear to flush the entries. So, I then tried switching from Abort32 to SVC32 (on RO5.21) and calling OS_SyncroniseCodeAreas, 1, startaddr, endaddr. This had exactly the same issue – surely that’s a bug? The only reliably way I’ve found of flushing entries from the D-cache is to read 32kb of memory in 32 byte steps. Where am I going wrong? |
Jeffrey Lee (213) 6048 posts |
OS_SynchroniseCodeAreas should work fine. When you were calling it, were you also calling your cache flush code before it? Because your code is wrong. The “flush D-cache entry” operation merely invalidates the cache entry – it doesn’t trigger any writeback to memory. Instead you want to be using MCR P15, 0, R0, C7, C10, 1 (clean D-cache entry). This is what the kernel uses when you call OS_SynchroniseCodeAreas, and so far nobody else has reported any issues with it! (see IMB_Range_WB_Crd in s.ARMops) |
Jon Abbott (1421) 2651 posts |
OS_SynchroniseCodeAreas was the only call, its not working for me. However…using flush D-cache entry has fixed the issue. The issue with OS_SynchroniseCodeAreas could of course be related to swapping in/out of Abort32, however I’ve used the same code for the Abort handler when it swaps modes with no issues. When I get a chance, I’ll code a repro and see what happens. Knew I was doing something wrong, mind you in my defense, the wording “clear” and “flush” aren’t the best choices to describe what they do. Thanks |
Jeffrey Lee (213) 6048 posts |
Yeah, I’m not quite sure why Intel decided on that wording. |
Jon Abbott (1421) 2651 posts |
I’ve had to fall back to reading a block of RAM, clean D-cache entry isn’t 100% reliable. It randomly doesn’t seem to work and I’ve no idea why. I’m now starting to understand why the Linux kernel ditched it in favour of reading a block on RAM! I’m not sure which chips this affects, I’ve only tested on StrongARM at the minute. Once I’ve recoded the DA2 support following your OS bug fixes last night, I’ll see what happens on ARM11. I’ve also yet to repro the issue with OS_SynchroniseCodeAreas – I’ve looked at the OS code and it does what I’m doing, but has a completely different outcome – as if it’s not called in the first place. I’m running Zarch (with disc protection) – which will crash immediately if the cache isn’t flushed as it immediately goes into self-modifying code – this is what I see when using OS_SynchroniseCodeAreas (on RO3.71, will confirm RO5.21 later). Using clean D-cache entry it will run for a while then crash. Reading a block of RAM, it runs without problem for hours. All very odd. |
Jon Abbott (1421) 2651 posts |
I’m having trouble flushing the caches on the 80321, does anyone know the correct method? I based the following on the example given in the Intel XScale Core Developer’s Manual but it seems to result in rubbish being written back to memory:
|
Jon Abbott (1421) 2651 posts |
Is there a cut off point where you’re better off flushing the entire cache instead of individual cache lines? For example, is it quicker to flush the entire cache if half of the lines need flushing? |
Jeffrey Lee (213) 6048 posts |
Yes. The kernel has knowledge of a cut-off point for each CPU/cache type, but I don’t think anyone’s actually bothered to tune the values to match reality. |
Jon Abbott (1421) 2651 posts |
Any idea what the Iyonix and Pi cut-over would be? How would one even go about finding out? Randomly flushing and comparing times isn’t going to work, so it needs some Intel/ARM defined figures which don’t appear to be publicly available. |
Jeffrey Lee (213) 6048 posts |
1KB < N < 1MB
I guess there are a few different situations you’d need to test for:
Time how long it takes to do ranged flush & full flush for each of the above, for various data set sizes, and that should hopefully point you at where a sensible cutoff should be. Although those tests will only really be taking into account the impact on your code – measuring the impact on other code (when you do a full flush and end up evicting valid, active code/data which was in use by something else) will be a bit trickier.
Probably because too much of it will come down to implementation & circumstance. Memory speeds & latencies, how much data needs to be flushed, how much load there is on the memory bus, etc. |
Jon Abbott (1421) 2651 posts |
I’m writing code that’s about to be executed in small chunks (max 128 words) and codelets to a separate area that’s previously not seen the I cache. I think I can get away with cleaning the D cache and invalidating the I cache by MVA for the code about to be executed and clean just the D cache by MVA for the code that’s not been previously seen. As it exits to execute the code it flushes the write buffer. I can’t get it working though. |