Machine hang in USR mode C
Jeffrey Lee (213) 6048 posts |
The OS calls the (data/prefetch/instruction) abort environment handler in a privileged CPU mode. The stub code which gets called has been overwritten with &0 (essentially NOPs), so execution continues until it falls off the end of the application slot (or hits some non-zeroed instruction which causes some other type of abort). Repeat ad infinitum.
I can’t say why they can’t, but I can say why they shouldn’t: Logical address space exhaustion. Rather than simply moving the code elsewhere and hoping that it won’t get overwritten by a different stray pointer, a better proposition would be to protect the code by making the page read-only. Better yet, make all of the code pages for the program read-only (and make the data pages non-executable). |
Jeffrey Lee (213) 6048 posts |
Patching a SWI OS_NewLine into the compiled program just before the wait loop fixes the problem, which suggests that the start of the new application has left TaskWindow in a funny state. I’ve seen this behaviour for a long time – programs which run within task windows typically only start multitasking (and responding to Escape!) once the first character is output.
UnixLib or SCL? On RISC OS 5, UnixLib does mulittask (SCL doesn’t), and neither respond to Escape. I’ll continue investigating. |
nemo (145) 2556 posts |
-MSTUBS – so SCL But as Adrian said, But that’s just which end of the string I’d start pulling.
Oh do get a grip. Even if it were one page per CLib client (and it wouldn’t be) that’s not going to… come on now.
It’s not a ‘stray pointer’ (as in completely random and terribly bad luck) it’s literally next to writeable memory. That isn’t going to happen somewhere else. I don’t want to look into SharedCLibrary, I really don’t, but I presume we are talking about these four stubs in particular:
|
Jeffrey Lee (213) 6048 posts |
Nope – it’s definitely in user mode, IRQs enabled. |
nemo (145) 2556 posts |
Uggh. Yet |
Jeffrey Lee (213) 6048 posts |
address space exhaustion There’s this thing called memory fragmentation, maybe you’ve heard of it? UnixLib likes to use dynamic areas for its heap, which has caused problems in the past (prior to the introduction of PMPs the ARMX6 would have only had a few hundred MB of space available for program-created DAs, thanks to the 2GB free pool mapping). Admittedly I’m not sure how much of that Otter problem was down to lack of free space vs. lack of free contiguous space. But it doesn’t change the fact that if CLib was to start dynamically creating DAs then our chances of troublesome levels of fragmentation would increase. The stubs are only accessed when the task is paged in. To place the stubs outside of application space would be inappropriate. |
nemo (145) 2556 posts |
Yeah, except I didn’t say put the heap in a DA, I said put those four stubs in a DA. |
Jeffrey Lee (213) 6048 posts |
I know. |
nemo (145) 2556 posts |
So the vulnerability seems to be that the addresses in the language descriptor are (typically) of those stubs, so are indirected through the application slot. Whereas CLib could have seen that they are the standard addresses and avoided that indirection. It’s not possible to stop the application following pointers it has corrupted, but it is possible to stop CLib from relying on pointers in the appslot. This is analogous to the long-standing MessageTrans misdesign, that had it relying on the integrity of a linked list of blocks of memory belonging to its clients. The fix is similar – copy the important things away from where the client might trample them. Language descriptors are very small. Fears of address space exhaustion are frankly incredible. |
Adrian Lees (1349) 122 posts |
Re the hang, my guess would be that TaskWindow notices calls to OS_ChangeEnvironment and that causes the behaviour change? In all cases note that I have been invoking my test programs from the CLI within a TaskWindow, so text output has already occurred, just not from my application. I confirm that just introducing ‘swix(OSWriteI+10,0)’ does restore multitasking and Escape functionality, and I vaguely recall having observed this myself once a number of years ago. For clarity, it’s the ‘_stub_XYHandlerInData’ registered by ‘_kernel_init’ (‘risc_oslib.kernel.s.k_body’) at which I have been looking, rather than the C language-specific support code. I can’t see a way to reorder the C$$data and Stub$$Data areas using the linker as a quick-and-dirty workaround (-first and -last are insufficiently flexible). However…If my reasoning is correct, those stubs don’t need to exist anyway, since they assume the validity of R13, and can thus vacate enough registers to discover the location of the static data simply by calling OS_ChangeEnvironment to pick up a pointer to a buffer that has been registered with one of the other handlers (s.k_body/InstallHandlers wants pretty much everything!). Then use that to locate the static base? Or the OS_ChangeEnvironment API could be modified to declare R2/R3 as stored and returned for the low-level exception handlers. I note that the kernel code (‘AdjustOurSet’) already does this, and is just unable to supply either value in a register to the handler code. (Perhaps this approach could be problematic if something tries to provide an alternative implementation; TaskWindow or Wimp perhaps?) |
Jon Abbott (1421) 2651 posts |
If USER code is allowed to install privileged handlers, is a fix even possible? Would a better solution not be getting CTRL-BREAK to work when in an Abort loop, so a locked machine can be recovered or put a check in the OS exception handlers to prevent abort loops. Edit: CTRL-BREAK should read ALT-BREAK |
Jeffrey Lee (213) 6048 posts |
There is definitely some logic in there which tries to keep track of the environment handlers, but I haven’t pulled it apart yet to see where it’s going wrong.
Yep, those are the ones.
It’s dangerous for abort handlers to call SWIs, at least until they’ve determined that it’s safe to do so (e.g. the abort could have been due to SVC stack overflow). (Sigh – CLib calls SWI FPEmulator_Abort from within a data abort handler. Good luck recovering from SVC stack overflows if a C app is active.) However, if CLib is able to detect when its abort/environment handlers are being swapped out, then it could just use a small patch of global memory to store the details of the active client (since currently, there can only be one application client “active” at a time)
Possibly the kernel could re-enable IRQs if the abort occurred with IRQs enabled – but I’m not sure we’d gain much, considering Ctrl-Break is just a quicker way for someone to reset the machine than go reaching for the power button/reset switch. |
Adrian Lees (1349) 122 posts |
Fair point. I was under the false impression that those handlers were at a higher level with SVC stack already flattened. And to correct a couple of other errors on my part: (i) they are not called in SVC mode now (I consulted only PRM1), rather the appropriate privileged mode, and (ii) there are no stored R2/R3 values for the ‘hardware vector’ handlers at present; I skim-read that code too quickly. The abort handler in kernel/clib actually stores to R13_svc even before executing its first SWI. I think I’d argue an overflowed SVC stack cannot and should not be for clib, or any environment handler, to worry about. If it happens, there’s no way that it can be resolved without calling a SWI (no go) or putting yet more low-level hackery into clib (et al) which has already strayed a long way from being user-level application support code. A job for the kernel abort handling code, instead? Jon: My concerns are two-fold really; not just system stability, but also helping coders track down their programming errors during development. It’s hard enough writing code on RISC OS without machine hangs flummoxing and demoralising application writers. Abort handling and reporting really needs to be the most robust and least vulnerable code in the system, IMHO. (Yes, I’m looking at you, DDT!) |
Jon Abbott (1421) 2651 posts |
What’s the key for breaking into an app? Could have sworn it was CTRL-BREAK…I don’t have a machine to hand to try.
Tell me about it, I’ve spent all week debugging code trying to figure out why the machine is locking solid. I must have rebooted my pi-top several hundred times trying random code changes as it’s impossible to diagnose a fault when the machine stiffs. What’s exacerbating the issue further is the screen is blanking, so the debug info I’m writing to screen is also useless. All I’ve figured out in a week of debugging is that sound IRQ’s are involved – I’m no closed to figuring out how or why that’s blanking the screen and stiffing the machine! |
Steffen Huber (91) 1953 posts |
Alt+Break. |
Rick Murray (539) 13850 posts |
I don’t know if it works as it is years old, but I made a hack of DADebug that would periodically spit info to the HAL debug serial port. That might help, if interrupts (and the tickers) are still working. |
Jeffrey Lee (213) 6048 posts |
My memory of them was a bit fuzzy too (it doesn’t help that the PRM doesn’t really give a good explanation of them, and likewise our wiki). The precedence is:
The first four occur in the corresponding abort mode. The default environment handler in the kernel is responsible for setting the OS back to a safe state (flattening SVC & IRQ stacks) and then raising the default abort error message. I haven’t looked yet to see why exactly CLib wants to get in on the action (does it allow aborts to be recovered from, or is it just so it can customise the error handling?), but it does seem to be a bit of a flaw in the OS that the environment handler has to take on so much responsibility.
Alt-break. Which I think relies on callbacks to function, so unless we can kick the abort handling down to user mode, isn’t likely to be very helpful. |
Jon Abbott (1421) 2651 posts |
That’s interesting, I’ve had to add explicit code to handle Aborts raised by PMP’d pages at 3, it doesn’t look like they should be reaching my code if they’re handled at 2.
No wonder it’s next to useless, I implemented a handler for CTRL-SHIFT-F12 in the IRQ handler so apps could be terminated…doesn’t help when the Abort handler gets in a loop with IRQ disabled though!! |
Rick Murray (539) 13850 posts |
Just looking at this, nothing better to do. :-) I think grumpy cat might have missed a few things? The official documentation (PRM), the one I have, says nothing about reading colours (and specifies bits 6-31 must be zero). Oh, and the documentation I’ve seen (wiki here, StrongHelp) that does mention the ability to read notes: When reading the colours, text colours are returned as a colour number in R1, but you must supply a pattern block to read the graphics colours. Yeah… VduVars is a lot simpler… |
nemo (145) 2556 posts |
Unlikely.
You keep using that word. I do not think it means what you think it means.
No further questions m’lud. |
Rick Murray (539) 13850 posts |
So clearly you want me to think that the PRMs don’t count as documentation; as I’m pretty certain that Acorn called it the Programmer’s REFERENCE for a reason.
So… There’s no “documentation” to be relied upon and the “documentation” available is wrong? Maybe this is why there aren’t many developers? |
Rick Murray (539) 13850 posts |
BTW, you still have not replied to the noted comments that the description of SetColour makes no mention of reading (so one shouldn’t be using this call to read) and the third party descriptions say that if you do read the graphics colour you need to specify a memory block …. so, no, the SWI only writes to random locations of you use it incorrectly and don’t pay attention to what others have written about it… |
Colin Ferris (399) 1818 posts |
To Jon CTRL-SHIFT-F12 in the IRQ handler – to quit errant Apps. |
nemo (145) 2556 posts |
Of course they do… right up until there’s a change. Then it becomes out of date documentation.
Have I not mentioned this on an almost weekly basis? :-)
Oh sorry, missed that. It’s certainly an unfortunate SWI name, but that’s because it was introduced in RISC OS 3, and could only set the colour. The ability to read the colour was added in 3.50 – so it’s unfortunate, but when things change, the existing documentation becomes out of date. As I demonstrated with OS_ReadLine, did I not?
Sorry, you’ve misunderstood. The documentation you are looking at is out of date. That’s what happens to stuff printed on paper. |
Steve Pampling (1551) 8172 posts |
Fixed that for you. It’s been happening to us too, although I’ve noted at work that newer often fails to be better. Apparently working brains are an optional extra these days. Note: If it’s written correctly it is fully in date forever for the stated use. Acorn didn’t define that so it’s use in a modern framework may be at odds with the actuality. |