RISC OS Open: Forum: Flushing the cache correctly on the various chips

Dec 18, 2013 10:43am

Jon Abbott (1421) 2651 posts

From Googling, it appears the method required to flush the cache changes from StrongARM through to the latest chips, does anyone know what’s required for which chips? I need to flush just a few entries from the D and I caches to support self-modifying code.

On StrongARM, I initially tried the following instruction with R0 starting at the start address (with bits cleared to align to the cache) and increasing by 32 until it’s past the end address:

MCR P15, 0, R0, C7, C6, 1 (flush D-cache entry)

Followed by:
MCR P15, 0, R0, C7, C10, 4 (drain Write Buffer)
MCR P15, 0, R0, C7, C5, 0 (invalidate I-cache)

This doesn’t appear to flush the entries.

So, I then tried switching from Abort32 to SVC32 (on RO5.21) and calling OS_SyncroniseCodeAreas, 1, startaddr, endaddr. This had exactly the same issue – surely that’s a bug?

The only reliably way I’ve found of flushing entries from the D-cache is to read 32kb of memory in 32 byte steps. Where am I going wrong?

Dec 18, 2013 12:02pm

Jeffrey Lee (213) 6048 posts

OS_SynchroniseCodeAreas should work fine. When you were calling it, were you also calling your cache flush code before it? Because your code is wrong. The “flush D-cache entry” operation merely invalidates the cache entry – it doesn’t trigger any writeback to memory. Instead you want to be using MCR P15, 0, R0, C7, C10, 1 (clean D-cache entry). This is what the kernel uses when you call OS_SynchroniseCodeAreas, and so far nobody else has reported any issues with it! (see IMB_Range_WB_Crd in s.ARMops)

Dec 18, 2013 12:56pm

Jon Abbott (1421) 2651 posts

OS_SynchroniseCodeAreas was the only call, its not working for me. However…using flush D-cache entry has fixed the issue.

The issue with OS_SynchroniseCodeAreas could of course be related to swapping in/out of Abort32, however I’ve used the same code for the Abort handler when it swaps modes with no issues. When I get a chance, I’ll code a repro and see what happens.

Knew I was doing something wrong, mind you in my defense, the wording “clear” and “flush” aren’t the best choices to describe what they do.

Thanks

Dec 18, 2013 1:04pm

Jeffrey Lee (213) 6048 posts

Knew I was doing something wrong, mind you in my defense, the wording “clear” and “flush” aren’t the best choices to describe what they do.

Yeah, I’m not quite sure why Intel decided on that wording.

Dec 19, 2013 7:17pm

Jon Abbott (1421) 2651 posts

I’ve had to fall back to reading a block of RAM, clean D-cache entry isn’t 100% reliable. It randomly doesn’t seem to work and I’ve no idea why. I’m now starting to understand why the Linux kernel ditched it in favour of reading a block on RAM!

I’m not sure which chips this affects, I’ve only tested on StrongARM at the minute. Once I’ve recoded the DA2 support following your OS bug fixes last night, I’ll see what happens on ARM11.

I’ve also yet to repro the issue with OS_SynchroniseCodeAreas – I’ve looked at the OS code and it does what I’m doing, but has a completely different outcome – as if it’s not called in the first place.

I’m running Zarch (with disc protection) – which will crash immediately if the cache isn’t flushed as it immediately goes into self-modifying code – this is what I see when using OS_SynchroniseCodeAreas (on RO3.71, will confirm RO5.21 later). Using clean D-cache entry it will run for a while then crash. Reading a block of RAM, it runs without problem for hours. All very odd.

Feb 20, 2015 7:33pm

Jon Abbott (1421) 2651 posts

I’m having trouble flushing the caches on the 80321, does anyone know the correct method? I based the following on the example given in the Intel XScale Core Developer’s Manual but it seems to result in rubbish being written back to memory:

 MOV     R0, #1024			;flush cache on Iyonix
 MOV     R1, #&700000			;use unallocated memory
 ._FC_L1
   MCR     P15, 0, R1, C7, C2, 5	;allocate cache line
   ADD     R1, R1, #ARM_CACHE_LINE_SIZE
   SUBS    R0, R0, #1
 BNE     _FC_L1

 MOV     R0, #64			;clean mini data cache
 ._FC_L2
   LDR     R2, [R1], #ARM_CACHE_LINE_SIZE
   SUBS    R0, R0, #1
 BNE     _FC_L2

 MCR     P15, 0, R0, C7, C5, 0 		;invalidate I cache & BTB
 MCR     P15, 0, R0, C7, C6, 0		;invalidate D cache

Mar 10, 2015 4:41pm

Jon Abbott (1421) 2651 posts

Is there a cut off point where you’re better off flushing the entire cache instead of individual cache lines?

For example, is it quicker to flush the entire cache if half of the lines need flushing?

Mar 10, 2015 5:50pm

Jeffrey Lee (213) 6048 posts

Is there a cut off point where you’re better off flushing the entire cache instead of individual cache lines?

Yes.

The kernel has knowledge of a cut-off point for each CPU/cache type, but I don’t think anyone’s actually bothered to tune the values to match reality.

Mar 10, 2015 7:32pm

Jon Abbott (1421) 2651 posts

Any idea what the Iyonix and Pi cut-over would be?

How would one even go about finding out? Randomly flushing and comparing times isn’t going to work, so it needs some Intel/ARM defined figures which don’t appear to be publicly available.

Mar 10, 2015 8:10pm

Jeffrey Lee (213) 6048 posts

Any idea what the Iyonix and Pi cut-over would be?

1KB < N < 1MB

How would one even go about finding out?

I guess there are a few different situations you’d need to test for:

Redundant flush (cost of a flush if the address range doesn’t intersect the D or I cache)
D cache flush (cost of flushing data from D cache, but area doesn’t intersect the I cache)
I cache flush (cost of flushing data from I cache, but area doesn’t intersect the D cache). On first glance this might seem like a programming error (someone’s written some code to RAM and then executed it before doing any cache maintenance), but it might also happen if you write a bunch of code, go off and do a bunch of other work (without evicting stale I cache entries from an earlier code run), and then returned to do the cache flush before you start executing the new code
D+I cache flush (cost of flush if data intersects both D+I cache)

Time how long it takes to do ranged flush & full flush for each of the above, for various data set sizes, and that should hopefully point you at where a sensible cutoff should be. Although those tests will only really be taking into account the impact on your code – measuring the impact on other code (when you do a full flush and end up evicting valid, active code/data which was in use by something else) will be a bit trickier.

so it needs some Intel/ARM defined figures which don’t appear to be publicly available.

Probably because too much of it will come down to implementation & circumstance. Memory speeds & latencies, how much data needs to be flushed, how much load there is on the memory bus, etc.

Mar 10, 2015 9:01pm

Jon Abbott (1421) 2651 posts

I’m writing code that’s about to be executed in small chunks (max 128 words) and codelets to a separate area that’s previously not seen the I cache.

I think I can get away with cleaning the D cache and invalidating the I cache by MVA for the code about to be executed and clean just the D cache by MVA for the code that’s not been previously seen. As it exits to execute the code it flushes the write buffer.

I can’t get it working though.

Flushing the cache correctly on the various chips

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Dec 18, 2013 10:43am Jon Abbott (1421) 2651 posts	From Googling, it appears the method required to flush the cache changes from StrongARM through to the latest chips, does anyone know what’s required for which chips? I need to flush just a few entries from the D and I caches to support self-modifying code. On StrongARM, I initially tried the following instruction with R0 starting at the start address (with bits cleared to align to the cache) and increasing by 32 until it’s past the end address: MCR P15, 0, R0, C7, C6, 1 (flush D-cache entry) Followed by: MCR P15, 0, R0, C7, C10, 4 (drain Write Buffer) MCR P15, 0, R0, C7, C5, 0 (invalidate I-cache) This doesn’t appear to flush the entries. So, I then tried switching from Abort32 to SVC32 (on RO5.21) and calling OS_SyncroniseCodeAreas, 1, startaddr, endaddr. This had exactly the same issue – surely that’s a bug? The only reliably way I’ve found of flushing entries from the D-cache is to read 32kb of memory in 32 byte steps. Where am I going wrong?

Dec 18, 2013 12:02pm Jeffrey Lee (213) 6048 posts	OS_SynchroniseCodeAreas should work fine. When you were calling it, were you also calling your cache flush code before it? Because your code is wrong. The “flush D-cache entry” operation merely invalidates the cache entry – it doesn’t trigger any writeback to memory. Instead you want to be using MCR P15, 0, R0, C7, C10, 1 (clean D-cache entry). This is what the kernel uses when you call OS_SynchroniseCodeAreas, and so far nobody else has reported any issues with it! (see IMB_Range_WB_Crd in s.ARMops)

Dec 18, 2013 12:56pm Jon Abbott (1421) 2651 posts	OS_SynchroniseCodeAreas was the only call, its not working for me. However…using flush D-cache entry has fixed the issue. The issue with OS_SynchroniseCodeAreas could of course be related to swapping in/out of Abort32, however I’ve used the same code for the Abort handler when it swaps modes with no issues. When I get a chance, I’ll code a repro and see what happens. Knew I was doing something wrong, mind you in my defense, the wording “clear” and “flush” aren’t the best choices to describe what they do. Thanks

Dec 18, 2013 1:04pm Jeffrey Lee (213) 6048 posts	Knew I was doing something wrong, mind you in my defense, the wording “clear” and “flush” aren’t the best choices to describe what they do. Yeah, I’m not quite sure why Intel decided on that wording.

Dec 19, 2013 7:17pm Jon Abbott (1421) 2651 posts	I’ve had to fall back to reading a block of RAM, clean D-cache entry isn’t 100% reliable. It randomly doesn’t seem to work and I’ve no idea why. I’m now starting to understand why the Linux kernel ditched it in favour of reading a block on RAM! I’m not sure which chips this affects, I’ve only tested on StrongARM at the minute. Once I’ve recoded the DA2 support following your OS bug fixes last night, I’ll see what happens on ARM11. I’ve also yet to repro the issue with OS_SynchroniseCodeAreas – I’ve looked at the OS code and it does what I’m doing, but has a completely different outcome – as if it’s not called in the first place. I’m running Zarch (with disc protection) – which will crash immediately if the cache isn’t flushed as it immediately goes into self-modifying code – this is what I see when using OS_SynchroniseCodeAreas (on RO3.71, will confirm RO5.21 later). Using clean D-cache entry it will run for a while then crash. Reading a block of RAM, it runs without problem for hours. All very odd.

Feb 20, 2015 7:33pm Jon Abbott (1421) 2651 posts	I’m having trouble flushing the caches on the 80321, does anyone know the correct method? I based the following on the example given in the Intel XScale Core Developer’s Manual but it seems to result in rubbish being written back to memory: `MOV R0, #1024 ;flush cache on Iyonix MOV R1, #&700000 ;use unallocated memory ._FC_L1 MCR P15, 0, R1, C7, C2, 5 ;allocate cache line ADD R1, R1, #ARM_CACHE_LINE_SIZE SUBS R0, R0, #1 BNE _FC_L1 MOV R0, #64 ;clean mini data cache ._FC_L2 LDR R2, [R1], #ARM_CACHE_LINE_SIZE SUBS R0, R0, #1 BNE _FC_L2 MCR P15, 0, R0, C7, C5, 0 ;invalidate I cache & BTB MCR P15, 0, R0, C7, C6, 0 ;invalidate D cache`

Mar 10, 2015 4:41pm Jon Abbott (1421) 2651 posts	Is there a cut off point where you’re better off flushing the entire cache instead of individual cache lines? For example, is it quicker to flush the entire cache if half of the lines need flushing?

Mar 10, 2015 5:50pm Jeffrey Lee (213) 6048 posts	Is there a cut off point where you’re better off flushing the entire cache instead of individual cache lines? Yes. The kernel has knowledge of a cut-off point for each CPU/cache type, but I don’t think anyone’s actually bothered to tune the values to match reality.

Mar 10, 2015 7:32pm Jon Abbott (1421) 2651 posts	Any idea what the Iyonix and Pi cut-over would be? How would one even go about finding out? Randomly flushing and comparing times isn’t going to work, so it needs some Intel/ARM defined figures which don’t appear to be publicly available.

Mar 10, 2015 8:10pm Jeffrey Lee (213) 6048 posts	Any idea what the Iyonix and Pi cut-over would be? 1KB < N < 1MB How would one even go about finding out? I guess there are a few different situations you’d need to test for: Redundant flush (cost of a flush if the address range doesn’t intersect the D or I cache) D cache flush (cost of flushing data from D cache, but area doesn’t intersect the I cache) I cache flush (cost of flushing data from I cache, but area doesn’t intersect the D cache). On first glance this might seem like a programming error (someone’s written some code to RAM and then executed it before doing any cache maintenance), but it might also happen if you write a bunch of code, go off and do a bunch of other work (without evicting stale I cache entries from an earlier code run), and then returned to do the cache flush before you start executing the new code D+I cache flush (cost of flush if data intersects both D+I cache) Time how long it takes to do ranged flush & full flush for each of the above, for various data set sizes, and that should hopefully point you at where a sensible cutoff should be. Although those tests will only really be taking into account the impact on your code – measuring the impact on other code (when you do a full flush and end up evicting valid, active code/data which was in use by something else) will be a bit trickier. so it needs some Intel/ARM defined figures which don’t appear to be publicly available. Probably because too much of it will come down to implementation & circumstance. Memory speeds & latencies, how much data needs to be flushed, how much load there is on the memory bus, etc.

Mar 10, 2015 9:01pm Jon Abbott (1421) 2651 posts	I’m writing code that’s about to be executed in small chunks (max 128 words) and codelets to a separate area that’s previously not seen the I cache. I think I can get away with cleaning the D cache and invalidating the I cache by MVA for the code about to be executed and clean just the D cache by MVA for the code that’s not been previously seen. As it exits to execute the code it flushes the write buffer. I can’t get it working though.