Possible bug intruduced in CLib with ARMv8 support
Jon Abbott (1421) 2651 posts |
I’ve noticed that some CLib based code that relies on SWP behaviour is no longer working after the ARMv8 LDREX/STREX changes. The behaviour I’m seeing is that CLib doesn’t exit and may be stuck in a loop within AcquireMutex |
Jeffrey Lee (213) 6048 posts |
What machine, and what software? A test case would be pretty useful. I can’t see anything obviously wrong in the CLib code. There are a few notes in the ARM ARM that suggest the OS should be manually clearing the exclusive monitor at certain points (after context switches, after data aborts, etc). The OS currently doesn’t do this, and I can see how this could cause issues in some situations (AIUI failing to clear the exclusive monitor may cause a STREX may succeed in a situation where it should have failed), but I can’t quite get my head around whether it would cause issues with AcquireMutex. But that’s possibly because I can’t get my head around why AcquireMutex (or the SWP loop it replaces) are even needed – RISC OS/CLib doesn’t do threads, and the address of heapMutex is never leaked outside that file (alloc.c – the only file that uses AcquireMutex), so (a) the only way situation where AcquireMutex will be called with the mutex already locked is if the heap code has been re-entered, and (b) because the code has been re-entered the mutex lock attempt will never succeed and the code will just sit there forever. |
Jeffrey Lee (213) 6048 posts |
Although, it is worth pointing out that not all control paths clear the mutex (probably a bug?), so if you’re hitting some heap corruption then that would explain why AcquireMutex is getting stuck (and the change from SWP to LDREX/STREX would just be a red herring) [edit] Actually it doesn’t look like it’s possible for GetMoreOSHeap to return CORRUPT, so it shouldn’t be possible for that code to be hit. But I guess it’s worth considering that heapMutex itself may have become corrupt. |
Jon Abbott (1421) 2651 posts |
The code that’s not returning is: When this is executed, I see over 4m loops/sec of the following loop within AcquireMutex: Does anyone have the build just before and after the ARMv8 changes, so I can double-check it is this causing the problem and not another change? Builds dated 28-02-16 and 01-03-16 |
Steve Pampling (1551) 8170 posts |
29-02-16 and 01-03-16 ? Remind me of the email |
Jon Abbott (1421) 2651 posts |
jon at jaspp dot org dot uk Thanks Steve, 29 and 01 should hopefully be pre/post change. |
Steve Pampling (1551) 8170 posts |
Two zip archives en route |
Jon Abbott (1421) 2651 posts |
Steve, they didn’t arrive |
Steve Pampling (1551) 8170 posts |
Resent as two separate mails 3.15MB each Edit: Watched the Windows mail client send them and the percentage incrementing, both should be with you by now. (Or available for download) |
Jon Abbott (1421) 2651 posts |
I can confirm that the 29-02-16 build doesn’t get stuck in fread(), the 01-03-16 build does. An observation…to match the OS_UpCall 6 documentation, shouldn’t that loop be coded as:
|
Jeffrey Lee (213) 6048 posts |
That new loop should be functionally equivalent to the current loop. The pollword value is loaded into a1 (and is expected to be non-zero when the mutex is unlocked). a4 is the “did the write fail?” flag that’s returned by STREX (which should be 0 if the write succeeded, 1 if it failed). There are two situations I can think of where I’d expect the code to get stuck in a loop.
Do you have any abort handlers active which could be interfering with loads/stores? You’ve managed to work out that that loop is being executed 4m times/sec, so I guess you’ve got something going on. |
Jon Abbott (1421) 2651 posts |
Oops, there’s nothing wrong with CLib. My Abort handler was setting Rd to the wrong value when the write succeeded. (walks off with tail between legs) |
Jeffrey Lee (213) 6048 posts |
No problem :-) (Although I am now super-paranoid about any code that’s using LDREX/STREX – the fact that any write to memory can potentially reset the exclusive monitor makes me think that there are at least a couple of bits of code which will need fixing. Plus there are all the places that CLREX needs using) |
Ben Avison (25) 445 posts |
Well, yes and no. It doesn’t provide threading functions, but it doesn’t preclude the use of another library which does, for example DThreads. That’s limited to use for task modules (which is exactly the case it was written to facilitate, unsurprisingly enough). We couldn’t use UnixLib threading in that case because UnixLib doesn’t support modules. A more complex threading library (which also coexists with SharedCLibrary) would be possible, but it would have to install its own timeslicing and non-transient callback handlers to perform thread switching; it would have been usable for pure applications as well as task modules, but since we were only interested in task modules at the time, it was far simpler to re-use TaskWindow to do that grunt work. That’s all you’d need for a single-core system, but as soon as you go multi-core you’d probably be better for a threading library to correspond with the kernel in some way because only the kernel would have a good picture of how the various cores are loaded and where threads might best be assigned. But I digress.
That’s not a worry – any code using LDREX/STREX is supposed to be written as a loop which may be executed multiple times, partly because of the exclusive monitor can be reset in all sorts of circumstances, including an interrupt being handled, for example. It’s failure to reset the exclusive monitor when it should have been that’s the case you’ve got to watch out for, and that’s really only an issue for context-switching code (i.e. the Wimp, and any threading libraries that don’t use the Wimp to do the context switching). To understand why, it’s best to look at an example. LDREX/STREX isn’t just a replacement for SWP, it allows for much richer concurrent programming concepts to be implemented. Let’s look at a counting semaphore: consider two threads acessing the same semaphore, one which is incrementing it and the other which is decrementing it: Thread A: LDREX a1, [a2] ADD a1, a1, #1 STREX a3, a1, [a2] TEQ a3, #1 BEQ loop Thread B: LDREX a1, [a2] SUB a1, a1, #1 STREX a3, a1, [a2] TEQ a3, #1 BEQ loop Now consider what happens with a certain pattern of context switches: Thread A: LDREX a1, [a2] Thread A: ADD a1, a1, #1 context switch Thread B: LDREX a1, [a2] Thread B: SUB a1, a1, #1 context switch Thread A: STREX a3, a1, [a2] Uh oh. As far as the CPU is concerned, the exclusive monitor points at the address in a2, so the STREX from thread A succeeds – except that the value in a1 wasn’t originally loaded using the LDREX that the CPU thinks is paired with the STREX. I think if you’re only using LDREX/STREX for non-counting mutexes then you probably get away without reseting the monitor on a context switch because each thread will only ever be trying to write the “locked” state to the memory location so the only effect is that the “wrong” thread gets to lock the mutex. And since SWP could only achieve non-counting mutexes, any code which is directly replacing old SWP instructions should be OK. But yes, we should probably be doing CLREX in the Wimp in anticipation of more advanced uses coming along (or STREX to a dummy location for ARMv6/ARMv6T2 CPUs). What nobody has yet picked me up on with all the LDREX/STREX code sequences I committed for ARMv8 compatibility is that ideally they should be using WFE/SEV to save power. |
Jeffrey Lee (213) 6048 posts |
RISC OS/CLib doesn’t do threads Ah, fair enough. Although I am now super-paranoid about any code that’s using LDREX/STREX – the fact that any write to memory can potentially reset the exclusive monitor makes me think that there are at least a couple of bits of code which will need fixing. Are you sure? SyncLib’s atomic_process_smp calls a function pointer in the middle of the LDREX/STREX pair, and it’s reasonable to assume that the function might want to use the stack (especially if it’s written in C). When considering the worst case of a CPU where the exclusive monitor is always reset by any memory write (not sure if any such CPU exists) this can obviously lead to problems.
I think it’s still unsafe, even with non-counting mutexes. Consider the following mutex claim code: MOV a3, #0 loop: LDREX a1, [a2] CMP a1, #1 STREXEQ a4, a3, [a2] CMPEQ a4, #0 BNE loop STREX may falsely succeed if the exclusive monitor is locked but the address in the STREX doesn’t match the locked address. So in a multi-core system, if core A and core B attempt to lock the mutex at the same time, but core A is interrupted between the LDREX and STREX by a routine which issues an LDREX (allowing control to return back to the thread with the exclusive monitor pointing at the wrong address) then core B will correctly lock the mutex (it may even lock it while core A is interrupted), and then core A may also be allowed to lock it due to the exclusive monitor pointing at the wrong address. I guess this could also happen in a single-core environment (thread 1 gets interrupted halfway through the lock, thread 2 runs through the full lock sequence and claims the mutex, and then issues a LDREX for somewhere else. Control then returns to thread 1 and thread 1 is incorrectly allowed to lock the mutex as well). What nobody has yet picked me up on with all the LDREX/STREX code sequences I committed for ARMv8 compatibility is that ideally they should be using WFE/SEV to save power. Well, I spotted that you weren’t using WFE, but decided to blame that on OS_UpCall 6 rather than on you directly :-) Much better to have the sleep logic all in one place than to have slightly different logic in every program. |
Ben Avison (25) 445 posts |
I think you’re perhaps being a bit pessimistic there. The ARM ARM talks about the impact of stores to an address that doesn’t match a preceding LDREX being implementation defined, but in practice I believe what happens is that the address of the LDREX is tagged, and a certain number of MS bits of the tag address are compared against subsequent stores. The number of significant tag bits varies from CPU to CPU, and typically the granularity matches the cache line size (which itself varies from core to core) but ARM don’t want to tie themselves down even to that equality. But since you wouldn’t be be allocating mutexes anywhere near the stack, you’re almost certainly safe to push a stack frame whilst remaining in exclusive access state. Also, you’re only going to run into trouble with the race condition you describe between locking two mutexes at different addresses, if the two mutexes are allocated within the same tag granularity. Mutexes are so sparse in RISC OS software that this is probably quite unlikely to happen – but of course the absolutely safe option is to do CLREX on each context switch.
I hadn’t spotted that thread… I’ll reply there. I was mainly thinking of the cases where I didn’t insert an OS_UpCall 6 call though – typically because the previous code hadn’t done so either. Often some actual brain effort would have been required in order to determine exactly when the matching SEV instruction should have been inserted, whereas finding SWP instructions is a mere string search operation :) |
Jeffrey Lee (213) 6048 posts |
When considering the worst case of a CPU where the exclusive monitor is always reset by any memory write (not sure if any such CPU exists) this can obviously lead to problems. Yes and no :-) For the local exclusive monitor, you’re right about stores to memory which don’t match the address tag being OK and not affecting the state of the monitor (I was spending too much time looking at the state machine diagram and not at the preceeding text). But just now I’ve had a look at the docs for the global monitor, which says that it’s implementation defined whether modification of non-shareable memory can cause the monitor to transition from exclusive to open state. So for the atomic_process case, an atomic update of a variable in a shareable page could be prevented if the callback function is using some non-shareable memory (core-local SVC/IRQ stack?)
Not according to the ARMv7 ARM; for both the local and global monitors, the big tables of state changes show that The “Load-Exclusive and Store-Exclusive usage restrictions” section is also worth a read. |
Colin (478) 2433 posts |
If you need to put CLREX in any context switch why can’t you do the above instead? Then if the mutex is reentered between LDREX and STREX the CLREX will cause the original STREX to fail. |
Jeffrey Lee (213) 6048 posts |
This is the example I was thinking of (single-core version): Thread A: MOV a3, #0 Thread A: loop: Thread A: LDREX a1, [a2] (control switches to thread B) Thread B: MOV a3, #0 Thread B: loop: Thread B: LDREX a1, [a2] Thread B: CMP a1, #1 Thread B: STREXEQ a4, a3, [a2] Thread B: CMPEQ a4, #0 Thread B: BNE loop (control switches to thread C) Thread C: LDREX xx, [somewhere_random] (control switches back to thread A) Thread A: CMP a1, #1 Thread A: STREXEQ a4, a3, [a2] Thread A: CMPEQ a4, #0 Thread A: BNE loop When the CPU switches back from thread C to thread A, thread A will think the mutex is free (because it was free at the time of the LDREX), and so will issue the STREX. But the exclusive monitor will be pointing at somewhere_random. For this case, it doesn’t actually matter what address somewhere_random is – if it’s in the same address range as a2 it will succeed (because nothing’s happened between the LDREX and STREX). If it’s not in the same address range it’ll be implementation defined whether it succeeds. So there’s always a chance both threads A and B will be allowed to lock the mutex. Adding a CLREX before every LDREX won’t help, because LDREX always resets the exclusive monitor to point to the indicated address. This means the CLREX has to come after the LDREX, i.e. in the OS context switch code (or you need to do something horrible like disable IRQs around all LDREX/STREX sequences) For the multi-core version of the same problem, thread B simply needs to be running on another core (and thread C, or the OS, needs to do enough work between leaving and re-entering thread A in order to allow B to complete the lock sequence) |
Ben Avison (25) 445 posts |
Ohhkayy… I’ll have some of what ARM’s having. LDREX and STREX used to make some sort of sense back in ARMv6, but it seems to have become rather convoluted! On the one hand, they’re going to immense pains to describe in detail how the only guaranteed transitions of each per-core state machine in the global monitor are LDREX by the core associated with that state machine, and successful stores (STREX or otherwise) by other cores to the same Exclusives Reservation Granule as the LDREX. Fair enough: any stores by the same core that issued the LDREX will set the local monitor to open access, so who cares whether the global monitor for that core’s state machine is in open or exclusive access (I note that for the same reason, CLREX isn’t required to affect the global monitor at all). On the other hand, it does look like they’re saying that a “modification” (sloppy terminology – nowhere do they define what that means, but I suppose we have to take it to mean Store or StoreExcl) to a non-shareable memory location might cause a transition of the global monitor state machine. Which asks the question, why would a core be running accesses to non-shareable memory locations past the global monitor in the first place? Could it be an attempt to “solve” a silicon bug by changing the documentation – and if so, is it any silicon we’re ever going to care about? Or perhaps – looking at the other half of that bullet point list in the ARM ARM, what they’re really trying to say is that any state transition in the local monitor may cause an equivalent state transition in the global monitor – in which case “modification” would be better described as “Store or StoreExcl which sets the local monitor to Open Access”. |
Colin (478) 2433 posts |
It seems to me that the arm document arm synchronization primitives is like a mutex. You have to keep reading it until the apparent contradictions become clear. |
Rick Murray (539) 13840 posts |
I’m looking at the document and came across this:
I’m glad they added the final part of that paragraph, because the earlier part of the document says that old style tricks (like disabling interrupts) are really bad on multitasking systems (hehe, hello RISC OS…) so to follow that up with a suggestion to stall the entire processor into a low power waiting state seems even worse; not to mention one should ideally stall the processor when you know something is going to poke it. As if the ARM version numbering wasn’t bad enough, that document describes “ARMv6K” and the like. I feel like I wanna go bang my head on the wall.
That, there, is like a textbook example of what is wrong with a PMT system. If the call fails why can’t the process kick itself back to the scheduler to force a context switch? Instead it low-powers the processor and waits for the interrupt that triggers the scheduler? Are PMT systems really that braindead? I’ll leave it to somebody else to point out the more technical issues. I’m way too tired… ;-) |
Jeffrey Lee (213) 6048 posts |
Looking at the list of situations for where CLREX is required, I think it basically boils down to the following:
Anybody spot anything I’ve missed? |
Jeffrey Lee (213) 6048 posts |
It looks like OS_SetCallBack will be a bit of a nuisance here. For the vectored callbacks (OS_AddCallBack) we can easily insert a CLREX after they’ve been processed, but for the callback environment handler it’s the handler itself which returns directly to the foreground code. So we’ll basically have to make it the rule that all callback environment handlers issue a CLREX before they return. If we were to perform a CLREX before entering the callback handler, then the rule could be weakened a bit so that it’s only handlers which call SWIs which need to CLREX. But I’m not sure how useful that would be (I’d imagine most callback handlers would want to use SWIs). Or we could make the kernel CLREX on return from every SWI, but that feels a bit heavy-handed! |