Fatal error recovery
Jon Abbott (1421) 2651 posts |
What method should I use for fatal error recovery? Am I right in thinking I need to reset the IRQ stack, SVC stack, IRQsema then switch to USER and use OS_Exit to get back to a known point. Do I need to do anything special with regard to RTSupport or other IRQ stack sensitive Modules? |
Jeffrey Lee (213) 6048 posts |
Yeah, pretty much. When the kernel generates a data abort, undefined instruction, etc. error it goes through the following recovery process:
The sequence of events isn’t quite the same as listed above, but to an outside observer everything will appear to happen in that order. Also note that the OS performs most of its recovery using the (pre-flattened) ABT/UND stacks – the OS doesn’t contain any code to attempt to deal with situations where the ABT/UND stacks are full or their stack pointers have been corrupted. So if you need a safe stack to use during the recovery, and don’t feel like having your own private stack somewhere, the system ABT/UND stacks should be fine for use. |
Jon Abbott (1421) 2651 posts |
Thanks for the info. I have a few more queries based on this.
I can get the initial SVC (15), IRQ (67) and UND (16) stacks via OS_ReadSysInfo 6 but what should ABT be set too?
Do I simply write zero to it, or are there any relevant SWI’s to either reset or get the reset value?
Am I correct in reading that I should be passing R0=< error pointer > R2=1 to SeriousErrorV? Do I need to do any recovery in CLib / UnixLib if a process was using them? Or do they clean up from the SeriousErrorV_Recover call? I’m presuming there’s a service call passed around so Modules know a task ID has failed. |
Jeffrey Lee (213) 6048 posts |
I guess either make a note of the current value when your code starts, or query OS_Memory 16 for area 4. Reset IRQsema Just write zero. Call SeriousErrorV_Recover Correct.
For application code, the error and exit environment handlers are what will trigger any necessary cleanup. OS_GenerateError will invoke the error handler and OS_Exit will invoke the exit handler. It’s worth noting that the error environment handler can recover from the error if it wants, so if you want to force an app to terminate then OS_Exit would be the way to go (only a very naughty app would refuse to terminate when its exit handler is called). For module code, I think it all has to be tied into service calls like Service_Error (which is issued before the error environment handler, so again no guarantee that the app is exiting), or Service_WimpCloseDown (issued by the Wimp). I’m not sure if there’s any way for a module to detect when a program exits but the wimp task remains (e.g. a program in a task window exits). SeriousErrorV is a recent addition. Currently the only thing that pays attention to SeriousErrorV_Recover is RTSupport, because the old approach (doing recovery in ErrorV) is a bit dangerous. |
Jon Abbott (1421) 2651 posts |
I’ve coded it up and its working on RO3.x and RO5 IOMD. On the Pi however, its not working correctly. Testing at a GOS prompt, invoking the code results in it returning to the * prompt with a flashing cursor, but neither the keyboard or mouse work. Any ideas? Code wise, I’m doing the following
I’ve left the UND/ABT stacks for the time being, as neither are active at the time my code executes.
I’d rather not call any app environment handlers, as the error may have occurred within one of them. I’m tracking all vectors, voices, channel handlers etc so release them on behalf of the app. With CLib, I track all entry points, such as _atexit, _signal, but am not sure how to undo them within CLib for the current task. From the C code I’ve looked at over the years, quite a few error handlers weren’t tested and contain bugs, some simply don’t have error handlers and some don’t even have exit handlers as they were never designed to exit. Based on that, I need to clean up the environment myself.
I think I’m okay with Modules, as any Modules loaded by the task are killed cleanly, or forcibly and any vectors, voices, channel handlers etc claimed by them are released. What happens with CLib, if a task it was supporting simply “disappears”? Does it leak RMA space? How does CLib track entry points on a task based basis, I’m assuming it claims some RMA space to track any task specific entry points? |
Jeffrey Lee (213) 6048 posts |
Steps 5 and 6 look iffy. You should enable FIQs, call SeriousErrorV_Recover, and then enable IRQs. It’s not guaranteed that that’s the cause of the problem, but it’s definitely a deviation from how the kernel does things. (You don’t show where/if you re-enable FIQs, so I guess it’s possible they’re remaining disabled – which would definitely kill USB on the Pi)
atexit and signal only affect the internal state of the C program, so as far as the OS is concerned there’s nothing to undo. I think the only slightly advanced thing CLib does on exit (other than restore any environment handlers it’s installed) is to close any open FILEs. So if you want to kill a C app without going through CLib you could capture all calls to fopen and fclose and manually call fclose to close any still-open files on exit.
CLib has minimal global state – IIRC it’s little more than just a msgtrans block and a counter for use by tmpnam. There’s no master list of clients, and CLib has no idea how many clients are active or what/where they are. For application tasks the client state will be held entirely within its application slot, and for module tasks it will be in the RMA. So it will only be module tasks that leak memory if they’re killed improperly. |
Jon Abbott (1421) 2651 posts |
Apologies, step 5 (now corrected) should have read enable FIQ. I believe I’ve figured out where I’m going wrong. I’m invoking the code with a key combination under EventV, which I’m guessing is causing RISCOS to leave the USB in an unknown state. I’m purposely not using Callback as it would not occur if the machine is stuck in an SVC or IRQ loop. I think my only option is to hang off the IRQ vector and get it to hand over to the recovery process after RISCOS has handled the IRQ, if the current task is marked for termination. |
Jeffrey Lee (213) 6048 posts |
Yeah, that will almost certainly be the cause. |
Jon Abbott (1421) 2651 posts |
I’ve got this working reliably although have a problem with the Obey Module. After recovering and handing back via OS_Exit, the Obey Module immediately carries on executing the Obey file that triggered the Fatal error. Is there a way to tell Obey to terminate execution? |
Rick Murray (539) 13840 posts |
Might be a hack, but can Obey support multiple contexts? If not, try Obey’ing an empty file? Or maybe just *Obey on its own without any filename? |
Colin Ferris (399) 1814 posts |
I’m not sure its what you want – but what about putting in :- *Error 1 msg obey stopped |
Jon Abbott (1421) 2651 posts |
I was trying to avoid generating an error as I’m trying to terminate the task cleanly. I’ll see if generating an error makes any difference though. Looking at the Obey Module source, it looks like the Error and Exit handler should terminate the current Obey file, which is odd as an app run from an Obey file that exits via OS_Exit does not cause the Obey file to terminate. |
Rick Murray (539) 13840 posts |
Test Obey file: Echo One Echo Two Obey Echo Three Result: One Two Press SPACE or click mouse to continue
Not odd at all. When an app is running, the Obey module is not the current exit handler; plus pretty much everything eventually terminates using OS_Exit… |
Rick Murray (539) 13840 posts |
Also seems to work nested. Obey1: Echo One Echo Two Obey $.Obey2 Echo Four Obey2: Echo Three Obey Result: One Two Three Press SPACE or click mouse to continue (doesn’t go back to carry on with Obey1; take out the solitary |
Jon Abbott (1421) 2651 posts |
Thanks Rick, I’ve updated the Wiki page accordingly. |