Solution against fatal crashes
|
I just was thinking what is the biggest disadvantage of pre-emptive multitasking (RISC OS) and the answer is that buggy software can crash all RISC OS and you have a data loss in another program. If it is detect that a program didn’t return control in time then it could be terminated and the WIMP loop continues working for the other applications. What are your thoughts? |
|
I would say… not having it?
This depends upon the kind of crash. See below.
You know, Alt-Break is a thing. It can help with these sorts of situations.
How long is a piece of string?
There is a class of crash that I refer to as a “cascading crash” where every app dies, one by one, until you’re left with an empty grey screen. I’m not sure what causes it, but that’s one that nothing will cater for. The truth is, some parts of RISC OS are interently unstable. If you don’t believe me, write a simple module to hook into the keyboard vector. Stack three registers, restore by pulling only two, then watch the machine completely freeze when you poke a key.
A solution looking for a problem. If you think something has crashed, Alt-Break it. If that doesn’t work, you’re stuffed either way. |
|
I would bet that the incidence of that “untraceable cause for a crash” is much less common than it was before certain changes a few years ago. Trying to invent solutions to deal with the symptoms of a crash is rather missing the point, when what is needed is to make the bottom end of the OS more robust and insert a similarly robust layer above that interfaces the base to the nasty application level and even nastier humans. |
|
I’m not sure page zero changes did much here. It seems to throw repeated exceptions somewhere in the RMA that just takes out everything one by one. That being said, the sheer amount of stuff hanging off the SWI mechanism and running in SVC mode when it has no real reason to, yikes.
I think it’s a case of “too late, too late” the maiden cried. I’m all for hardening the OS, but I can’t help but think that doing any large scale change (like pushing most of the RMA over to SYS mode and making the true RMA inaccessible to user mode) will not only be more man-hours than we have, but will also break everything. So…um… |
|
There are several low-effort possibilities that are incremental steps to improving stability. eg. Modules could have a boolean to state they should run in USR mode (the majority). Introduce a second RMA for USR mode only access and protect the SYS one (I’m sure this will break a handful of apps – they can be excluded somehow – maybe an opt-out flag in Wimp_Initialise?). Existing apps/modules can continue to run and crash as they do now. Ideally modules should have memory protection and paged memory (like a task), but that’s a big change. And there’s the inevitable conflict between improving the OS for stability and security, vs. functioning on RiscPC-era hardware with 32k pages. You can’t have both. Similarly dynamic areas. We were going to page dynamic areas, but ran out of time. It was clear the logical memory map could be exhausted. It would also avoid memory corruption from another task, but I doubt I’d considered that at the time. I pushed for sparse DAs as a priority, which was more fun but probably less useful. I doubt anything ever used them! https://www.riscosopen.org/wiki/documentation/show/Dynamic%20Area%20Flags#fnr4 I don’t see a link between crashing and the lack of pre-emption (if I understood that correctly). The crashes will be undoubtably more frequent with pre-emption since the state is less deterministic. |
|
The problem is RISC OS is still fundamentally a single kernel context shared across all the running tasks, if an application causes something to get messed up in the machine, the whole system goes down. The only way to solve it is for the OS, and by extension every module, to have a separate context for each task. Then if an application causes a crash everything to do with that task in the OS and modules is thrown away, so it can’t affect any other tasks. The only way I can think of implementing that with RISC OS baring a fundamental and incompatible rearchitecting, is to essentially invoke a separate VM instance of RISC OS for every task that is run, which contains stub copies of things which need to be shared between applications, such as the filing systems, the Wimp and hardware access. These stubs would pass calls back to an implementation in the hypervisor, which would arbitrate access and present consistent machine state (it looks like a single desktop running multiple applications). That would cure both crashing, and as a bonus enable use of multiple cores, although each task can only use one. I don’t really see this happening though. |
|
The only “easy” improvement that I see is the modifications Gerph made in Select so that errors occurring in callbacks and interrupts don’t kill the foreground task. |
|
If it’s needless then it probably should be in the dog house ;) |
|
This has discussed before – a extra flag could be added to allow modules to run in user mode with a suitable stack. Linux RO seems to be totally running in Usermode. |
|
Returning to Stefan’s original post, the technology he’s describing is generally known as a “watchdog”. RISC OS has made use of watchdogs in the past, specifically in its set-top box and NC flavours back at Acorn. However, it’s not a silver bullet; it’s typically more about rebooting the machine automatically when RISC OS itself becomes unresponsive (to interrupts). |