Fatal error recovery

15 posts, 4 voices

Aug 31, 2016 3:12pm Jon Abbott (1421) 2651 posts	What method should I use for fatal error recovery? Am I right in thinking I need to reset the IRQ stack, SVC stack, IRQsema then switch to USER and use OS_Exit to get back to a known point. Do I need to do anything special with regard to RTSupport or other IRQ stack sensitive Modules?

Aug 31, 2016 4:02pm Jeffrey Lee (213) 6048 posts	Am I right in thinking I need to reset the IRQ stack, SVC stack, IRQsema then switch to USER and use OS_Exit to get back to a known point. Yeah, pretty much. When the kernel generates a data abort, undefined instruction, etc. error it goes through the following recovery process: Disable IRQs & FIQs Flatten all the privileged mode stacks (SVC, IRQ, ABT, UND) Make sure we have a valid stack for the following bits (e.g. you can switch to SVC mode) Call HAL_FIQDisableAll if the abort was from FIQ mode (i.e. FIQ handler is most likely broken, so make sure it can’t cause any further crashes) Reset IRQsema Enable FIQs Call SeriousErrorV_Recover – this lets things like RTSupport know that the stacks have been flattened (RTSupport can operate without this call, but for future-proofing it’s probably best to include it) SeriousErrorV isn’t documented on the wiki yet, but it’s vector &2C and the reason codes are listed here The recover reason is passed an error block pointer, but nothing actually uses it yet, so let’s say that it’s also safe to pass a null pointer if you don’t feel like providing an error block. It should now be safe to call arbitrary SWIs Enable IRQs Call OS_GenerateError to report the error The sequence of events isn’t quite the same as listed above, but to an outside observer everything will appear to happen in that order. Also note that the OS performs most of its recovery using the (pre-flattened) ABT/UND stacks – the OS doesn’t contain any code to attempt to deal with situations where the ABT/UND stacks are full or their stack pointers have been corrupted. So if you need a safe stack to use during the recovery, and don’t feel like having your own private stack somewhere, the system ABT/UND stacks should be fine for use.

Aug 31, 2016 5:59pm Jon Abbott (1421) 2651 posts	Thanks for the info. I have a few more queries based on this. Flatten all the privileged mode stacks (SVC, IRQ, ABT, UND) I can get the initial SVC (15), IRQ (67) and UND (16) stacks via OS_ReadSysInfo 6 but what should ABT be set too? Reset IRQsema Do I simply write zero to it, or are there any relevant SWI’s to either reset or get the reset value? Call SeriousErrorV_Recover Am I correct in reading that I should be passing R0=< error pointer > R2=1 to SeriousErrorV? Do I need to do any recovery in CLib / UnixLib if a process was using them? Or do they clean up from the SeriousErrorV_Recover call? I’m presuming there’s a service call passed around so Modules know a task ID has failed.

Aug 31, 2016 6:59pm Jeffrey Lee (213) 6048 posts	I can get the initial SVC (15), IRQ (67) and UND (16) stacks via OS_ReadSysInfo 6 but what should ABT be set too? I guess either make a note of the current value when your code starts, or query OS_Memory 16 for area 4. Reset IRQsema Do I simply write zero to it, or are there any relevant SWI’s to either reset or get the reset value? Just write zero. Call SeriousErrorV_Recover Am I correct in reading that I should be passing R0=< error pointer > R2=1 to SeriousErrorV? Correct. Do I need to do any recovery in CLib / UnixLib if a process was using them? Or do they clean up from the SeriousErrorV_Recover call? I’m presuming there’s a service call passed around so Modules know a task ID has failed. For application code, the error and exit environment handlers are what will trigger any necessary cleanup. OS_GenerateError will invoke the error handler and OS_Exit will invoke the exit handler. It’s worth noting that the error environment handler can recover from the error if it wants, so if you want to force an app to terminate then OS_Exit would be the way to go (only a very naughty app would refuse to terminate when its exit handler is called). For module code, I think it all has to be tied into service calls like Service_Error (which is issued before the error environment handler, so again no guarantee that the app is exiting), or Service_WimpCloseDown (issued by the Wimp). I’m not sure if there’s any way for a module to detect when a program exits but the wimp task remains (e.g. a program in a task window exits). SeriousErrorV is a recent addition. Currently the only thing that pays attention to SeriousErrorV_Recover is RTSupport, because the old approach (doing recovery in ErrorV) is a bit dangerous.

Aug 31, 2016 8:14pm Jon Abbott (1421) 2651 posts	I’ve coded it up and its working on RO3.x and RO5 IOMD. On the Pi however, its not working correctly. Testing at a GOS prompt, invoking the code results in it returning to the * prompt with a flashing cursor, but neither the keyboard or mouse work. Any ideas? Code wise, I’m doing the following enter SVC with IRQ/FIQ disabled reset SVC stack reset IRQsema reset IRQ stack enable FIQ issue SeriousErrorV_Recover enter USER, IRQ/FIQ enabled OS_Exit I’ve left the UND/ABT stacks for the time being, as neither are active at the time my code executes. For application code, the error and exit environment handlers are what will trigger any necessary cleanup I’d rather not call any app environment handlers, as the error may have occurred within one of them. I’m tracking all vectors, voices, channel handlers etc so release them on behalf of the app. With CLib, I track all entry points, such as _atexit, _signal, but am not sure how to undo them within CLib for the current task. From the C code I’ve looked at over the years, quite a few error handlers weren’t tested and contain bugs, some simply don’t have error handlers and some don’t even have exit handlers as they were never designed to exit. Based on that, I need to clean up the environment myself. For module code, I think it all has to be tied into service calls like Service_Error I think I’m okay with Modules, as any Modules loaded by the task are killed cleanly, or forcibly and any vectors, voices, channel handlers etc claimed by them are released. What happens with CLib, if a task it was supporting simply “disappears”? Does it leak RMA space? How does CLib track entry points on a task based basis, I’m assuming it claims some RMA space to track any task specific entry points?

Aug 31, 2016 8:39pm Jeffrey Lee (213) 6048 posts	Code wise, I’m doing the following Steps 5 and 6 look iffy. You should enable FIQs, call SeriousErrorV_Recover, and then enable IRQs. It’s not guaranteed that that’s the cause of the problem, but it’s definitely a deviation from how the kernel does things. (You don’t show where/if you re-enable FIQs, so I guess it’s possible they’re remaining disabled – which would definitely kill USB on the Pi) With CLib, I track all entry points, such as _atexit, _signal, but am not sure how to undo them within CLib for the current task. atexit and signal only affect the internal state of the C program, so as far as the OS is concerned there’s nothing to undo. I think the only slightly advanced thing CLib does on exit (other than restore any environment handlers it’s installed) is to close any open FILEs. So if you want to kill a C app without going through CLib you could capture all calls to fopen and fclose and manually call fclose to close any still-open files on exit. What happens with CLib, if a task it was supporting simply “disappears”? Does it leak RMA space? How does CLib track entry points on an task basis, I’m assuming it claims some RMA space to track any app specific entry points? CLib has minimal global state – IIRC it’s little more than just a msgtrans block and a counter for use by tmpnam. There’s no master list of clients, and CLib has no idea how many clients are active or what/where they are. For application tasks the client state will be held entirely within its application slot, and for module tasks it will be in the RMA. So it will only be module tasks that leak memory if they’re killed improperly.

Aug 31, 2016 10:25pm Jon Abbott (1421) 2651 posts	Steps 5 and 6 look iffy. You should enable FIQs, call SeriousErrorV_Recover, and then enable IRQs. Apologies, step 5 (now corrected) should have read enable FIQ. I believe I’ve figured out where I’m going wrong. I’m invoking the code with a key combination under EventV, which I’m guessing is causing RISCOS to leave the USB in an unknown state. I’m purposely not using Callback as it would not occur if the machine is stuck in an SVC or IRQ loop. I think my only option is to hang off the IRQ vector and get it to hand over to the recovery process after RISCOS has handled the IRQ, if the current task is marked for termination.

Aug 31, 2016 11:35pm Jeffrey Lee (213) 6048 posts	I believe I’ve figured out where I’m going wrong. I’m invoking the code with a key combination under EventV, which I’m guessing is causing RISCOS to leave the USB in an unknown state. Yeah, that will almost certainly be the cause.

Sep 15, 2016 7:16pm Jon Abbott (1421) 2651 posts	I’ve got this working reliably although have a problem with the Obey Module. After recovering and handing back via OS_Exit, the Obey Module immediately carries on executing the Obey file that triggered the Fatal error. Is there a way to tell Obey to terminate execution?

Sep 15, 2016 7:42pm Rick Murray (539) 13840 posts	Is there a way to tell Obey to terminate execution? Might be a hack, but can Obey support multiple contexts? If not, try Obey’ing an empty file? Or maybe just *Obey on its own without any filename?

Sep 16, 2016 10:35am Colin Ferris (399) 1814 posts	I’m not sure its what you want – but what about putting in :- *Error 1 msg obey stopped

Sep 16, 2016 7:46pm Jon Abbott (1421) 2651 posts	I was trying to avoid generating an error as I’m trying to terminate the task cleanly. I’ll see if generating an error makes any difference though. Looking at the Obey Module source, it looks like the Error and Exit handler should terminate the current Obey file, which is odd as an app run from an Obey file that exits via OS_Exit does not cause the Obey file to terminate.

Sep 16, 2016 8:33pm Rick Murray (539) 13840 posts	Test Obey file: Echo One Echo Two Obey Echo Three Result: One Two Press SPACE or click mouse to continue which is odd as an app run from an Obey file that exits via OS_Exit does not cause the Obey file to terminate. Not odd at all. When an app is running, the Obey module is not the current exit handler; plus pretty much everything eventually terminates using OS_Exit…

Sep 16, 2016 8:36pm Rick Murray (539) 13840 posts	Also seems to work nested. Obey1: Echo One Echo Two Obey $.Obey2 Echo Four Obey2: Echo Three Obey Result: One Two Three Press SPACE or click mouse to continue (doesn’t go back to carry on with Obey1; take out the solitary `Obey` command, it does)

Sep 16, 2016 9:38pm Jon Abbott (1421) 2651 posts	Thanks Rick, I’ve updated the Wiki page accordingly.

Reply

To post replies, please first log in.

Forums → Community Support →

Fatal error recovery

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options