Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → Wish lists →

Solution against fatal crashes

10 posts, 9 voices

Feb 21, 2025 10:56am Stefan Fröhling (7826) 169 posts	I just was thinking what is the biggest disadvantage of pre-emptive multitasking (RISC OS) and the answer is that buggy software can crash all RISC OS and you have a data loss in another program. Wouldn’t it be possible to have an WIMPControl module with an interrupt that can check ever second if any WIMP program has given back control to the WIMP loop during a specific time? New programs could add a parameter to the WIMP initialization after what time maximum they give back the control to WIMP. That way old programs could be set to infinite time if they don’t specify the parameter ( = 0 = infinite time). If it is detect that a program didn’t return control in time then it could be terminated and the WIMP loop continues working for the other applications. So there will be not catastrophic crash. What are your thoughts?

Feb 21, 2025 12:14pm Rick Murray (539) 14047 posts	I just was thinking what is the biggest disadvantage of pre-emptive multitasking (RISC OS) I would say… not having it? buggy software can crash all RISC OS and you have a data loss in another program. This depends upon the kind of crash. See below. that can check ever second if any WIMP program has given back control to the WIMP loop during a specific time? You know, Alt-Break is a thing. It can help with these sorts of situations. New programs could add a parameter to the WIMP initialization after what time maximum they give back the control to WIMP. How long is a piece of string? So there will be not catastrophic crash. There is a class of crash that I refer to as a “cascading crash” where every app dies, one by one, until you’re left with an empty grey screen. I’m not sure what causes it, but that’s one that nothing will cater for. The truth is, some parts of RISC OS are interently unstable. If you don’t believe me, write a simple module to hook into the keyboard vector. Stack three registers, restore by pulling only two, then watch the machine completely freeze when you poke a key. Unfortunately there is no separation between user and kernel and while this makes it incredibly easy to extend the system in all sorts of exciting ways, it does also come with potential penalties. What are your thoughts? A solution looking for a problem. If you think something has crashed, Alt-Break it. If that doesn’t work, you’re stuffed either way.

Feb 21, 2025 1:15pm Steve Pampling (1551) 8272 posts	I’m not sure what causes it, but that’s one that nothing will cater for. I would bet that the incidence of that “untraceable cause for a crash” is much less common than it was before certain changes a few years ago. One of those, Zero reason for a crash, situations… Trying to invent solutions to deal with the symptoms of a crash is rather missing the point, when what is needed is to make the bottom end of the OS more robust and insert a similarly robust layer above that interfaces the base to the nasty application level and even nastier humans.

Feb 21, 2025 4:38pm Rick Murray (539) 14047 posts	I’m not sure page zero changes did much here. It seems to throw repeated exceptions somewhere in the RMA that just takes out everything one by one. Like I said, it’s rare, and my completely unfounded plucked-from-my-posterior opinion would be casting a critical eye in the direction of MessageTrans (well, it wouldn’t be the first time). That being said, the sheer amount of stuff hanging off the SWI mechanism and running in SVC mode when it has no real reason to, yikes. insert a similarly robust layer above that interfaces the base to the nasty application level I think it’s a case of “too late, too late” the maiden cried. As said, plenty of stuff needlessly runs with kennel privileges and access. All sorts of bits of the OS hand out pointers to bits of the RMA (indeed, it’s one of the few ways that pollwords work). And, as is well known, there’s not a lot of input validation going on. Oh, and a rather weird error handling mechanism means that an error occurring at some low level can bubble up and take out some random innocent application that gets an error it can’t handle because it is an error from something else (or in the case of CLib, the library bombs out disgracefully without even asking the app what it might like to do about this). I’m all for hardening the OS, but I can’t help but think that doing any large scale change (like pushing most of the RMA over to SYS mode and making the true RMA inaccessible to user mode) will not only be more man-hours than we have, but will also break everything. So…um…

Feb 21, 2025 6:42pm Piers (3264) 75 posts	There are several low-effort possibilities that are incremental steps to improving stability. eg. Modules could have a boolean to state they should run in USR mode (the majority). Introduce a second RMA for USR mode only access and protect the SYS one (I’m sure this will break a handful of apps – they can be excluded somehow – maybe an opt-out flag in Wimp_Initialise?). Existing apps/modules can continue to run and crash as they do now. Ideally modules should have memory protection and paged memory (like a task), but that’s a big change. And there’s the inevitable conflict between improving the OS for stability and security, vs. functioning on RiscPC-era hardware with 32k pages. You can’t have both. Similarly dynamic areas. We were going to page dynamic areas, but ran out of time. It was clear the logical memory map could be exhausted. It would also avoid memory corruption from another task, but I doubt I’d considered that at the time. I pushed for sparse DAs as a priority, which was more fun but probably less useful. I doubt anything ever used them! https://www.riscosopen.org/wiki/documentation/show/Dynamic%20Area%20Flags#fnr4 I don’t see a link between crashing and the lack of pre-emption (if I understood that correctly). The crashes will be undoubtably more frequent with pre-emption since the state is less deterministic.

Feb 21, 2025 8:14pm David J. Ruck (33) 1696 posts	The problem is RISC OS is still fundamentally a single kernel context shared across all the running tasks, if an application causes something to get messed up in the machine, the whole system goes down. The only way to solve it is for the OS, and by extension every module, to have a separate context for each task. Then if an application causes a crash everything to do with that task in the OS and modules is thrown away, so it can’t affect any other tasks. The only way I can think of implementing that with RISC OS baring a fundamental and incompatible rearchitecting, is to essentially invoke a separate VM instance of RISC OS for every task that is run, which contains stub copies of things which need to be shared between applications, such as the filing systems, the Wimp and hardware access. These stubs would pass calls back to an implementation in the hypervisor, which would arbitrate access and present consistent machine state (it looks like a single desktop running multiple applications). That would cure both crashing, and as a bonus enable use of multiple cores, although each task can only use one. I don’t really see this happening though.

Feb 21, 2025 10:28pm André Timmermans (100) 658 posts	The only “easy” improvement that I see is the modifications Gerph made in Select so that errors occurring in callbacks and interrupts don’t kill the foreground task.

Feb 22, 2025 3:42am Chris Mahoney (1684) 2177 posts	As said, plenty of stuff needlessly runs with kennel privileges and access. If it’s needless then it probably should be in the dog house ;)

Feb 22, 2025 8:31am Colin Ferris (399) 1847 posts	This has discussed before – a extra flag could be added to allow modules to run in user mode with a suitable stack. Linux RO seems to be totally running in Usermode.

Feb 22, 2025 6:41pm Steve Revill (20) 1393 posts	Returning to Stefan’s original post, the technology he’s describing is generally known as a “watchdog”. RISC OS has made use of watchdogs in the past, specifically in its set-top box and NC flavours back at Acorn. However, it’s not a silver bullet; it’s typically more about rebooting the machine automatically when RISC OS itself becomes unresponsive (to interrupts).

Reply

To post replies, please first log in.

Forums → Wish lists →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

What would you like to see written or changed?

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails