Machine hang in USR mode C
Pages: 1 2
Adrian Lees (8595) 14 posts |
Firstly, apologies because I discovered this serious bug many months ago, having distilled it down from a programming error in real code…yet I still haven’t got around to investigating it properly: unsigned data[1]; int main(void) { unsigned *p = data; while (1) *p++ = 0; return 0; } Compiled with either Norcroft+CLib ‘cc -o boom c.boom’ or GCC ELF toolchain (UnixLib, I believe) ‘gcc -o boome c.boom’ this obviously broken code will cause a machine hang, either from within a TaskWindow or at ShellCLI outside of the desktop. Clearly the code is broken but it’s the kind of thing that does happen when loop termination conditions are wrong, is very difficult to debug, and apt to deter new develop(ers|ment). Perhaps somebody else can make more progress investigating it, please? I do know that it only happens within C code; a simple assembler-only loop that fills the app slot and runs off the end just aborts at you’d expect, so it may be something relating to Environment Handlers or stack corruption? Either way this clearly shouldn’t be possible in USR mode! |
Rick Murray (539) 13840 posts |
Just off the top of my head, I think C programs are laid out like:
And what may be happening is that your program is running from ‘data’, through all the stack frames and stuff, and hitting the end. This raises an exception, which the library attempts to handle and fails miserably. The hang may be because we’re still in SVC mode as the exception is being ungracefully vomited out like a giant furball. Might be worth talking a peek to see how the system actually handles exceptions like that – maybe the answer will be clear if it’s trying to make sense of unwinding a stack where everything is null?
RISC OS is Looks like CLib needs to be a little more forgiving and actually check that data it has read makes sense before acting upon it. |
André Timmermans (100) 655 posts |
Like David said the global “data” is probably at an address below the code so *p++ = 0 gets overwritten and you are just left with an infinite loop. And as we know an infinite loop with pure user code is unbreakable under RISC OS. If instead you had put data within the function, it would have been allocated on stack above the code and it would have worked as intended. |
Rick Murray (539) 13840 posts |
Alt-Break? |
Ian Cook (420) 11 posts |
Also the on/off switch, or removing the mains power, will work. |
Stuart Swales (1481) 351 posts |
Rick, as is usually the case, is correct. |
Adrian Lees (8595) 14 posts |
Code is not ordinarily relocated, and not in the case. As Rick says the memory layout is [code,RW static data,ZI static data,heap]and the heap includes the chunked stack. When I started poking at this a while ago I was looking at the exception handling in SharedCLibrary but I was surprised this time to find that it happens with GCC/UnixLib/ELF too. It would appear to indicate that both exception handlers/stack backtracers/whatever need scrutinising and to be a lot less trusting. I suspect (/greatly hope) that it’s not a problem that has infected the ‘abort handling and dispatch to environment handler’ code of the RISC OS kernel itself! |
Adrian Lees (8595) 14 posts |
I think what may be happening is that an environment handler has been installed to handle Events and, whilst SharedCLibrary only wants the Escape event, other background events may will be enabled and I suspect the pass on details of the previous handler have been corrupted. I may try to confirm this experimentally at some point but it’s not immediately clear to me how to go about that. |
Jon Abbott (1421) 2651 posts |
Having spent more hours than I care to imagine debugging hangs in C code, without fail, all have been down to the environment handlers triggering locks or infinite looping errors. In several games, I went so far as to remove the error handlers as they created more problems than they solved, and in several others I “bug fixed” the code. From the original description it sounds like you’re triggering the same issue. My guess would be that the Error handler is internally aborting. The easiest way to prove this is to temporarily remove the C error handler, I just NOP out the Environment handler SWI calls in the compiled Absolute. |
Julie Stamp (8365) 474 posts |
I’d like to see these hangs stop as well. In a previous thread, Jeffrey identified the problem as abort in abort handler, due to the stubs, and all of application space subsequently, being overwritten with NOP’s. He explained that there are two reasons Alt-Break doesn’t work in this situation:
Some fixes were proposed including:
There was also discussion of how to make the Alt-Break watchdog work in more situations, but it looks to me that there are unsolved problems with doing that. |
Rick Murray (539) 13840 posts |
Would it not be possible to check that there’s a valid instruction before jumping to it in a way that might otherwise toast the machine? If not, the abort handler can pick up the pieces by itself (just cancel the app entirely, it’s clearly trashed). |
Jon Abbott (1421) 2651 posts |
The way to work around this scenario is to check if the Aborting address is within the abort handler (ie self-generated) and collapse the stack, enable IRQ etc and hand off to the OS Exit handler. Making executable code read-only is a bit too “implementation defined”, so can’t be relied on. Putting stubs into a DA won’t work as the code needs to branch to the LDR’s. |
Steve Pampling (1551) 8170 posts |
Sort of begs to done as a thing handled by a Pre-emptive task switcher that hands ‘control’ to “normal” RISC OS at intervals. |
Rick Murray (539) 13840 posts |
Or have a secondary task killer that is able to attempt to recover the system. It seems there are a number of places that flatten the supervisor stack, drop to user mode, and fall back into application space. I wonder if a special secret hotkey might be able to do the same at a time when interrupts are still working yet something has upset the system to the point where it’s just frozen up. |
Alan Adams (2486) 1149 posts |
Surely the problem is that to detect hotkeys needs system cooperation. Maybe something more akin to a watchdog routine, to detect that something (like TickerV) hasn’t happened recently. |
Jeffrey Lee (213) 6048 posts |
Yes, but that wouldn’t help for the example that Adrian gave. &00000000 = ANDEQ R0,R0,R0. You could have it check for more instructions, but then you find yourself in danger of trying to solve the halting problem.
That could work; on entry to the abort handlers the kernel could read one of the hardware timers and check how long it’s been since everything was OK (e.g. last successful timer interrupt, or last SWI which exited to usermode with IRQs enabled, etc.). And if it’s been too long it can try and take action to break out of the abort loop. One of the problems with the processor vector & environment handlers is that they’re black boxes; the kernel knows when they’re entered, but it’s hard for it to know when or where they exit. This will make it tricky to develop any kind of intelligent system which is able to identify the specific handler/program which is broken. |
Jon Abbott (1421) 2651 posts |
Ironically, when I’m debugging C based games under ADFFS I turn on a debug feature that terminates the process if the JIT sees this instruction on entry. |
Clive Semmens (2335) 3276 posts |
I vaguely remember discussions among engineers at ARM about possibly dedicating this instruction for some special purpose in debugging; not sure whether anything was ever decided. There are after all plenty of no-ops, and this one is special. |
David Feugey (2125) 2709 posts |
Or ask the user what to do (quit the application or wait for the next alert message). A la Firefox. |
Steve Pampling (1551) 8170 posts |
“A web page is taking too long…” and sadly it doesn’t tell you which page is take so much time. |
Jon Abbott (1421) 2651 posts |
A watchdog would need to be on an FIQ timer to work around IRQ being turned off. How it would actually detect if a process is hung under CMT I don’t know…the process might just take time or be single-tasking. |
Steve Pampling (1551) 8170 posts |
Wouldn’t the state of registers be changing over time on a long process? As to single tasking surely that is yet another reason to ensure that what the task sees as single tasking is in fact just a PMT time slice set. |
Jon Abbott (1421) 2651 posts |
You’d hope. There is however the scenario where registers are changing, but it’s stuck in an infinite loop. You also couldn’t check if IRQ were disabled for too long as its impossible to tell how long they’ve been disabled. I think the Pi keyboard handler is running under FIQ, so you could possibly pick up on ALT-BREAK, I’m not sure if that’s possible on many other machines though. |
Steve Pampling (1551) 8170 posts |
CTRL-ALT-BREAK required? Along with a general rehash of the keyboard handling and mapping. |
Adrian Lees (8595) 14 posts |
I think in the common case the exception handler code itself lies in the SharedCLibrary module and is thus relatively safe; the problem is that the CLib workspace is in the application workspace and vulnerable to my (deliberately) errant code. That workspace contains the details of the ‘pass on/previous’ handlers. I’ve prodded this a little more: #include "swis.h" unsigned data[1]; int main(void) { unsigned *p = data; int limit; _swix(OS_GetEnv, _OUT(1), &limit); const unsigned *ep = (unsigned*)limit; while (p < ep) *p++ = 0; _swix(OS_WriteI+'.', 0); while (1); return 0; } Now that I’m not inducing an abort it will sit there after the ‘.’ is displayed but pressing Return after ‘Alt-Break’ then induces the hang, presumably because I have wiped out other CLib workspace contents such as the stored exit handler, so I would guess that people are correct that it was the Data Abort that was leading to the hang earlier. There’s fortunately no evidence to support my concern that a background Event Handler invocation caused it at an unpredictable time. (Nudging the WimpSlot up to 511M it still displays the ‘.’ and sits there with cursor/keyboard LEDs responsive.) I guess checks could be introduced into OS_ChangeEnvironment when registering/reinstating handlers and/or OS_Exit when the address of the registered handler looks suspect (OS_ValidateAddress?) but that still doesn’t address the problem of exactly what the kernel should do when things have gone so pear-shaped and I’m inclined to shy away from any ad hoc/quick bodges for this specific problem without more careful consideration. It seems to me that we’re nudging up against the fundamental fact that the RISC OS kernel has no real knowledge of tasks or how to do a last-ditch killing of such because it has only ever concerned itself with one. |
Pages: 1 2