Forums → Bugs →

Possible bug intruduced in CLib with ARMv8 support

24 posts, 6 voices

Mar 19, 2016 11:36am Jon Abbott (1421) 2651 posts	I’ve noticed that some CLib based code that relies on SWP behaviour is no longer working after the ARMv8 LDREX/STREX changes. The behaviour I’m seeing is that CLib doesn’t exit and may be stuck in a loop within AcquireMutex

Mar 19, 2016 4:58pm Jeffrey Lee (213) 6048 posts	What machine, and what software? A test case would be pretty useful. I can’t see anything obviously wrong in the CLib code. There are a few notes in the ARM ARM that suggest the OS should be manually clearing the exclusive monitor at certain points (after context switches, after data aborts, etc). The OS currently doesn’t do this, and I can see how this could cause issues in some situations (AIUI failing to clear the exclusive monitor may cause a STREX may succeed in a situation where it should have failed), but I can’t quite get my head around whether it would cause issues with AcquireMutex. But that’s possibly because I can’t get my head around why AcquireMutex (or the SWP loop it replaces) are even needed – RISC OS/CLib doesn’t do threads, and the address of heapMutex is never leaked outside that file (alloc.c – the only file that uses AcquireMutex), so (a) the only way situation where AcquireMutex will be called with the mutex already locked is if the heap code has been re-entered, and (b) because the code has been re-entered the mutex lock attempt will never succeed and the code will just sit there forever.

Mar 19, 2016 5:02pm Jeffrey Lee (213) 6048 posts	Although, it is worth pointing out that not all control paths clear the mutex (probably a bug?), so if you’re hitting some heap corruption then that would explain why AcquireMutex is getting stuck (and the change from SWP to LDREX/STREX would just be a red herring) [edit] Actually it doesn’t look like it’s possible for GetMoreOSHeap to return CORRUPT, so it shouldn’t be possible for that code to be hit. But I guess it’s worth considering that heapMutex itself may have become corrupt.

Mar 20, 2016 4:44pm Jon Abbott (1421) 2651 posts	What machine, and what software? Pi B+, I’m currently looking at Populous. The code that’s not returning is: `fread(&20988, 1, &7D00, &E8AC)` When this is executed, I see over 4m loops/sec of the following loop within AcquireMutex: `01 MOV a1, #6` `SWI XOS_UpCall` `02 LDREX a1, [a2]` `STREX a4, a3, [a2]` `TEQ a4, #1` `BEQ %BT02` `TEQ a1, #0` `BEQ %BT01` Does anyone have the build just before and after the ARMv8 changes, so I can double-check it is this causing the problem and not another change? Builds dated 28-02-16 and 01-03-16

Mar 20, 2016 5:20pm Steve Pampling (1551) 8170 posts	Does anyone have the build just before and after the ARMv8 changes, so I can double-check it is this causing the problem and not another change? Builds dated 28-02-16 and 01-03-16 29-02-16 and 01-03-16 ? Remind me of the email

Mar 20, 2016 6:16pm Jon Abbott (1421) 2651 posts	Remind me of the email jon at jaspp dot org dot uk Thanks Steve, 29 and 01 should hopefully be pre/post change.

Mar 20, 2016 6:46pm Steve Pampling (1551) 8170 posts	Two zip archives en route

Mar 21, 2016 7:52am Jon Abbott (1421) 2651 posts	Two zip archives en route Steve, they didn’t arrive

Mar 21, 2016 8:27am Steve Pampling (1551) 8170 posts	Steve, they didn’t arrive Resent as two separate mails 3.15MB each Edit: Watched the Windows mail client send them and the percentage incrementing, both should be with you by now. (Or available for download)

Mar 21, 2016 10:53am Jon Abbott (1421) 2651 posts	I can confirm that the 29-02-16 build doesn’t get stuck in fread(), the 01-03-16 build does. An observation…to match the OS_UpCall 6 documentation, shouldn’t that loop be coded as: `01 MOV a1, #6` `SWI XOS_UpCall` `02 LDREX a1, [a2]` `STREX a4, a3, [a2]` `TEQ a4, #0` `BNE %BT02` `TEQ a1, #0` `BEQ %BT01`

Mar 21, 2016 12:21pm Jeffrey Lee (213) 6048 posts	An observation…to match the OS_UpCall 6 documentation, shouldn’t that loop be coded as: That new loop should be functionally equivalent to the current loop. The pollword value is loaded into a1 (and is expected to be non-zero when the mutex is unlocked). a4 is the “did the write fail?” flag that’s returned by STREX (which should be 0 if the write succeeded, 1 if it failed). There are two situations I can think of where I’d expect the code to get stuck in a loop. If the mutex is zero on entry. I don’t expect OS_UpCall to touch the value of the pollword, so calling OS_UpCall when the pollword is zero isn’t going to do anything apart from potentially yield to other tasks in a taskwindow environment. If the STREX always reports a write failure. The only guaranteed way the STREX will fail the write is if the exclusive monitor is in the “open” state. The only guaranteed way to reset the exclusive monitor to the open state is via a STREX (to any address) or CLREX. But it may also reset if any other type of store is performed (to any address) (implementation defined). Apparently it can also transition to the open state as the result of other, implementaiton-specific things (the ARM ARM mentions speculative execution, but it wouldn’t surprise me if any kind of abort/exception can reset it also). It’s also implementation defined whether STREX will fail if the exclusive monitor is locked, but the locked address is different from the address targeted by the STREX. But when you consider that the LDREX and STREX are being performed in a tight loop, the only way they could repeatedly fail like that is if there’s always an interrupt or abort which occurs between the two. Which seems very unlikely. Do you have any abort handlers active which could be interfering with loads/stores? You’ve managed to work out that that loop is being executed 4m times/sec, so I guess you’ve got something going on.

Mar 21, 2016 1:25pm Jon Abbott (1421) 2651 posts	a4 is the “did the write fail?” flag that’s returned by STREX (which should be 0 if the write succeeded, 1 if it failed). Oops, there’s nothing wrong with CLib. My Abort handler was setting Rd to the wrong value when the write succeeded. (walks off with tail between legs)

Mar 21, 2016 2:21pm Jeffrey Lee (213) 6048 posts	No problem :-) (Although I am now super-paranoid about any code that’s using LDREX/STREX – the fact that any write to memory can potentially reset the exclusive monitor makes me think that there are at least a couple of bits of code which will need fixing. Plus there are all the places that CLREX needs using)

Mar 21, 2016 4:03pm Ben Avison (25) 445 posts	RISC OS/CLib doesn’t do threads Well, yes and no. It doesn’t provide threading functions, but it doesn’t preclude the use of another library which does, for example DThreads. That’s limited to use for task modules (which is exactly the case it was written to facilitate, unsurprisingly enough). We couldn’t use UnixLib threading in that case because UnixLib doesn’t support modules. A more complex threading library (which also coexists with SharedCLibrary) would be possible, but it would have to install its own timeslicing and non-transient callback handlers to perform thread switching; it would have been usable for pure applications as well as task modules, but since we were only interested in task modules at the time, it was far simpler to re-use TaskWindow to do that grunt work. That’s all you’d need for a single-core system, but as soon as you go multi-core you’d probably be better for a threading library to correspond with the kernel in some way because only the kernel would have a good picture of how the various cores are loaded and where threads might best be assigned. But I digress. Although I am now super-paranoid about any code that’s using LDREX/STREX – the fact that any write to memory can potentially reset the exclusive monitor makes me think that there are at least a couple of bits of code which will need fixing. That’s not a worry – any code using LDREX/STREX is supposed to be written as a loop which may be executed multiple times, partly because of the exclusive monitor can be reset in all sorts of circumstances, including an interrupt being handled, for example. It’s failure to reset the exclusive monitor when it should have been that’s the case you’ve got to watch out for, and that’s really only an issue for context-switching code (i.e. the Wimp, and any threading libraries that don’t use the Wimp to do the context switching). To understand why, it’s best to look at an example. LDREX/STREX isn’t just a replacement for SWP, it allows for much richer concurrent programming concepts to be implemented. Let’s look at a counting semaphore: consider two threads acessing the same semaphore, one which is incrementing it and the other which is decrementing it: Thread A: LDREX a1, [a2] ADD a1, a1, #1 STREX a3, a1, [a2] TEQ a3, #1 BEQ loop Thread B: LDREX a1, [a2] SUB a1, a1, #1 STREX a3, a1, [a2] TEQ a3, #1 BEQ loop Now consider what happens with a certain pattern of context switches: Thread A: LDREX a1, [a2] Thread A: ADD a1, a1, #1 context switch Thread B: LDREX a1, [a2] Thread B: SUB a1, a1, #1 context switch Thread A: STREX a3, a1, [a2] Uh oh. As far as the CPU is concerned, the exclusive monitor points at the address in a2, so the STREX from thread A succeeds – except that the value in a1 wasn’t originally loaded using the LDREX that the CPU thinks is paired with the STREX. I think if you’re only using LDREX/STREX for non-counting mutexes then you probably get away without reseting the monitor on a context switch because each thread will only ever be trying to write the “locked” state to the memory location so the only effect is that the “wrong” thread gets to lock the mutex. And since SWP could only achieve non-counting mutexes, any code which is directly replacing old SWP instructions should be OK. But yes, we should probably be doing CLREX in the Wimp in anticipation of more advanced uses coming along (or STREX to a dummy location for ARMv6/ARMv6T2 CPUs). What nobody has yet picked me up on with all the LDREX/STREX code sequences I committed for ARMv8 compatibility is that ideally they should be using WFE/SEV to save power.

Mar 21, 2016 4:48pm Jeffrey Lee (213) 6048 posts	RISC OS/CLib doesn’t do threads Well, yes and no. It doesn’t provide threading functions, but it doesn’t preclude the use of another library which does, for example DThreads. Ah, fair enough. Although I am now super-paranoid about any code that’s using LDREX/STREX – the fact that any write to memory can potentially reset the exclusive monitor makes me think that there are at least a couple of bits of code which will need fixing. That’s not a worry Are you sure? SyncLib’s atomic_process_smp calls a function pointer in the middle of the LDREX/STREX pair, and it’s reasonable to assume that the function might want to use the stack (especially if it’s written in C). When considering the worst case of a CPU where the exclusive monitor is always reset by any memory write (not sure if any such CPU exists) this can obviously lead to problems. I think if you’re only using LDREX/STREX for non-counting mutexes then you probably get away without reseting the monitor on a context switch because each thread will only ever be trying to write the “locked” state to the memory location so the only effect is that the “wrong” thread gets to lock the mutex. I think it’s still unsafe, even with non-counting mutexes. Consider the following mutex claim code: MOV a3, #0 loop: LDREX a1, [a2] CMP a1, #1 STREXEQ a4, a3, [a2] CMPEQ a4, #0 BNE loop STREX may falsely succeed if the exclusive monitor is locked but the address in the STREX doesn’t match the locked address. So in a multi-core system, if core A and core B attempt to lock the mutex at the same time, but core A is interrupted between the LDREX and STREX by a routine which issues an LDREX (allowing control to return back to the thread with the exclusive monitor pointing at the wrong address) then core B will correctly lock the mutex (it may even lock it while core A is interrupted), and then core A may also be allowed to lock it due to the exclusive monitor pointing at the wrong address. I guess this could also happen in a single-core environment (thread 1 gets interrupted halfway through the lock, thread 2 runs through the full lock sequence and claims the mutex, and then issues a LDREX for somewhere else. Control then returns to thread 1 and thread 1 is incorrectly allowed to lock the mutex as well). What nobody has yet picked me up on with all the LDREX/STREX code sequences I committed for ARMv8 compatibility is that ideally they should be using WFE/SEV to save power. Well, I spotted that you weren’t using WFE, but decided to blame that on OS_UpCall 6 rather than on you directly :-) Much better to have the sleep logic all in one place than to have slightly different logic in every program.

Mar 21, 2016 7:44pm Ben Avison (25) 445 posts	When considering the worst case of a CPU where the exclusive monitor is always reset by any memory write (not sure if any such CPU exists) this can obviously lead to problems. I think you’re perhaps being a bit pessimistic there. The ARM ARM talks about the impact of stores to an address that doesn’t match a preceding LDREX being implementation defined, but in practice I believe what happens is that the address of the LDREX is tagged, and a certain number of MS bits of the tag address are compared against subsequent stores. The number of significant tag bits varies from CPU to CPU, and typically the granularity matches the cache line size (which itself varies from core to core) but ARM don’t want to tie themselves down even to that equality. But since you wouldn’t be be allocating mutexes anywhere near the stack, you’re almost certainly safe to push a stack frame whilst remaining in exclusive access state. Also, you’re only going to run into trouble with the race condition you describe between locking two mutexes at different addresses, if the two mutexes are allocated within the same tag granularity. Mutexes are so sparse in RISC OS software that this is probably quite unlikely to happen – but of course the absolutely safe option is to do CLREX on each context switch. blame that on OS_UpCall 6 I hadn’t spotted that thread… I’ll reply there. I was mainly thinking of the cases where I didn’t insert an OS_UpCall 6 call though – typically because the previous code hadn’t done so either. Often some actual brain effort would have been required in order to determine exactly when the matching SEV instruction should have been inserted, whereas finding SWP instructions is a mere string search operation :)

Mar 21, 2016 8:16pm Jeffrey Lee (213) 6048 posts	When considering the worst case of a CPU where the exclusive monitor is always reset by any memory write (not sure if any such CPU exists) this can obviously lead to problems. I think you’re perhaps being a bit pessimistic there. The ARM ARM talks about the impact of stores to an address that doesn’t match a preceding LDREX being implementation defined, but in practice I believe what happens is that the address of the LDREX is tagged, and a certain number of MS bits of the tag address are compared against subsequent stores. The number of significant tag bits varies from CPU to CPU, and typically the granularity matches the cache line size (which itself varies from core to core) but ARM don’t want to tie themselves down even to that equality. But since you wouldn’t be be allocating mutexes anywhere near the stack, you’re almost certainly safe to push a stack frame whilst remaining in exclusive access state. Yes and no :-) For the local exclusive monitor, you’re right about stores to memory which don’t match the address tag being OK and not affecting the state of the monitor (I was spending too much time looking at the state machine diagram and not at the preceeding text). But just now I’ve had a look at the docs for the global monitor, which says that it’s implementation defined whether modification of non-shareable memory can cause the monitor to transition from exclusive to open state. So for the atomic_process case, an atomic update of a variable in a shareable page could be prevented if the callback function is using some non-shareable memory (core-local SVC/IRQ stack?) Also, you’re only going to run into trouble with the race condition you describe between locking two mutexes at different addresses, if the two mutexes are allocated within the same tag granularity. Not according to the ARMv7 ARM; for both the local and global monitors, the big tables of state changes show that `StoreExcl(!t)` is allowed to succeed (the notes under the table explain that ‘t’ is any address for which the tag matches, so ‘!t’ == any address that doesn’t match). A quick skim through the ARMv8 ARM suggests the same rules are in place there. The “Load-Exclusive and Store-Exclusive usage restrictions” section is also worth a read.

Mar 21, 2016 9:24pm Colin (478) 2433 posts	`MOV a3, #0 loop: CLREX LDREX a1, [a2] CMP a1, #1 STREXEQ a4, a3, [a2] CMPEQ a4, #0 BNE loop` If you need to put CLREX in any context switch why can’t you do the above instead? Then if the mutex is reentered between LDREX and STREX the CLREX will cause the original STREX to fail.

Mar 21, 2016 9:42pm Jeffrey Lee (213) 6048 posts	If you need to put CLREX in any context switch why can’t you do the above instead? This is the example I was thinking of (single-core version): Thread A: MOV a3, #0 Thread A: loop: Thread A: LDREX a1, [a2] (control switches to thread B) Thread B: MOV a3, #0 Thread B: loop: Thread B: LDREX a1, [a2] Thread B: CMP a1, #1 Thread B: STREXEQ a4, a3, [a2] Thread B: CMPEQ a4, #0 Thread B: BNE loop (control switches to thread C) Thread C: LDREX xx, [somewhere_random] (control switches back to thread A) Thread A: CMP a1, #1 Thread A: STREXEQ a4, a3, [a2] Thread A: CMPEQ a4, #0 Thread A: BNE loop When the CPU switches back from thread C to thread A, thread A will think the mutex is free (because it was free at the time of the LDREX), and so will issue the STREX. But the exclusive monitor will be pointing at somewhere_random. For this case, it doesn’t actually matter what address somewhere_random is – if it’s in the same address range as a2 it will succeed (because nothing’s happened between the LDREX and STREX). If it’s not in the same address range it’ll be implementation defined whether it succeeds. So there’s always a chance both threads A and B will be allowed to lock the mutex. Adding a CLREX before every LDREX won’t help, because LDREX always resets the exclusive monitor to point to the indicated address. This means the CLREX has to come after the LDREX, i.e. in the OS context switch code (or you need to do something horrible like disable IRQs around all LDREX/STREX sequences) For the multi-core version of the same problem, thread B simply needs to be running on another core (and thread C, or the OS, needs to do enough work between leaving and re-entering thread A in order to allow B to complete the lock sequence)

Mar 21, 2016 10:41pm Ben Avison (25) 445 posts	But just now I’ve had a look at the docs for the global monitor, which says that it’s implementation defined whether modification of non-shareable memory can cause the monitor to transition from exclusive to open state. Ohhkayy… I’ll have some of what ARM’s having. LDREX and STREX used to make some sort of sense back in ARMv6, but it seems to have become rather convoluted! On the one hand, they’re going to immense pains to describe in detail how the only guaranteed transitions of each per-core state machine in the global monitor are LDREX by the core associated with that state machine, and successful stores (STREX or otherwise) by other cores to the same Exclusives Reservation Granule as the LDREX. Fair enough: any stores by the same core that issued the LDREX will set the local monitor to open access, so who cares whether the global monitor for that core’s state machine is in open or exclusive access (I note that for the same reason, CLREX isn’t required to affect the global monitor at all). On the other hand, it does look like they’re saying that a “modification” (sloppy terminology – nowhere do they define what that means, but I suppose we have to take it to mean Store or StoreExcl) to a non-shareable memory location might cause a transition of the global monitor state machine. Which asks the question, why would a core be running accesses to non-shareable memory locations past the global monitor in the first place? Could it be an attempt to “solve” a silicon bug by changing the documentation – and if so, is it any silicon we’re ever going to care about? Or perhaps – looking at the other half of that bullet point list in the ARM ARM, what they’re really trying to say is that any state transition in the local monitor may cause an equivalent state transition in the global monitor – in which case “modification” would be better described as “Store or StoreExcl which sets the local monitor to Open Access”.

Mar 22, 2016 11:33am Colin (478) 2433 posts	It seems to me that the arm document arm synchronization primitives is like a mutex. You have to keep reading it until the apparent contradictions become clear.

Mar 22, 2016 8:27pm Rick Murray (539) 13840 posts	I’m looking at the document and came across this: In many situations, you want the function to return only when it has acquired the lock. However, looping and retrying consumes power without performing any useful work. Because it is not possible to acquire the resource until an external agent modifies the synchronization variable, a better solution is to put the processor into a low-power state, or to request that the operating system schedules in a new process, and retry at a later point. I’m glad they added the final part of that paragraph, because the earlier part of the document says that old style tricks (like disabling interrupts) are really bad on multitasking systems (hehe, hello RISC OS…) so to follow that up with a suggestion to stall the entire processor into a low power waiting state seems even worse; not to mention one should ideally stall the processor when you know something is going to poke it. As if the ARM version numbering wasn’t bad enough, that document describes “ARMv6K” and the like. I feel like I wanna go bang my head on the wall. In an operating system, the scheduler is often invoked by an exception handler based on an interrupt triggering. When you execute WFI in such an environment, this can put your processor into low-power state for the remainder of its execution slot. When the timer interrupt triggers, the scheduler is invoked and context switches in a different process. That, there, is like a textbook example of what is wrong with a PMT system. If the call fails why can’t the process kick itself back to the scheduler to force a context switch? Instead it low-powers the processor and waits for the interrupt that triggers the scheduler? Are PMT systems really that braindead? I’ll leave it to somebody else to point out the more technical issues. I’m way too tired… …to think. ;-)

Mar 28, 2016 12:08pm Jeffrey Lee (213) 6048 posts	Looking at the list of situations for where CLREX is required, I think it basically boils down to the following: Performing a context switch to a thread that was pre-empted In RISC OS’s case, this would be resuming an RTSupport thread or switching to a task window task (or pthreads in UnixLib, etc.) Threads that explicitly yielded (Wimp_Poll, etc.) shouldn’t need a CLREX – people shouldn’t be putting explicit yield operations (or SWIs in general) in the middle of exclusive sequences On exit from any interrupt handler (IRQ, FIQ) if that handler uses LDREX/STREX For IRQ we can take the easy route and have the kernel do the CLREX on exit from the interrupt handler (making sure it happens right at the end, e.g. after any callback processing) For FIQ it would have to be implemented on a per-handler basis On exit from any abort handlers which use LDREX/STREX This would have to be implemented on a per-handler basis On entry to the data abort handler Needed because the exclusive monitor is unpredictable “after taking a data abort” (i.e. on entry to the vector) Potentially we could fold this into the above item, requiring all data abort handlers to CLREX on exit, regardless of whether they’ve used LDREX/STREX or not Anybody spot anything I’ve missed?

May 22, 2016 3:45pm Jeffrey Lee (213) 6048 posts	For IRQ we can take the easy route and have the kernel do the CLREX on exit from the interrupt handler (making sure it happens right at the end, e.g. after any callback processing) It looks like OS_SetCallBack will be a bit of a nuisance here. For the vectored callbacks (OS_AddCallBack) we can easily insert a CLREX after they’ve been processed, but for the callback environment handler it’s the handler itself which returns directly to the foreground code. So we’ll basically have to make it the rule that all callback environment handlers issue a CLREX before they return. If we were to perform a CLREX before entering the callback handler, then the rule could be weakened a bit so that it’s only handlers which call SWIs which need to CLREX. But I’m not sure how useful that would be (I’d imagine most callback handlers would want to use SWIs). Or we could make the kernel CLREX on return from every SWI, but that feels a bit heavy-handed!

Reply

To post replies, please first log in.

Forums → Bugs →

Possible bug intruduced in CLib with ARMv8 support

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options