Machine hang in USR mode C
Adrian Lees (1349) 122 posts |
unsigned buf[0x100000]; int main(void) { unsigned *dp = (unsigned*)buf; while (1) *dp++ = 0; return 0; } Now, obviously that code is broken (I’ve stripped it down from in-development test code) but it locks the entire machine solid, from USR application code (RISC OS 5.23 i.MX6 though I doubt that’s important). I’ve not yet made the effort to confirm how, but my suspicion is that trampling all over the heap (which of course contains the stack) stops background SharedCLibrary handlers from executing properly…whether vectors or the exception/signal handling I don’t know*. My purpose in posting is to stimulate thought/discussion on how to prevent it locking the machine solid, because this strikes me as a much more serious threat to system stability than zero page being readable from USR application code.
|
Jon Abbott (1421) 2651 posts |
What does the machine code look like? |
Adrian Lees (1349) 122 posts |
Unexciting (STR ,[],#4 : B ) and not helpful in isolation. I also get a lock up with
Although, interestingly, with just ‘0×100’ bytes cleared the application code gets stuck in a loop, not aborting, and interrupts still work and executed in a TaskWindow, it still multitasks. There are any number of low-level data (abort handler data, interrupts, ‘stub_und|svcstack) stored immediately after the ’buf’ global, so perhaps it’s not even heap/stack corruption. Obviously, in neither case is the ‘sl’ register itself directly corrupted. The ‘Stub$$Data’ area starts immediately after the single word ‘buf.’ |
Rick Murray (539) 13850 posts |
I don’t think it matters… The code will, in short order, set everything from buf to the end of appspace to zero. User mode code should not be capable of stiffing the machine, so clearly there is something lurking in the heap that C (or some part of the environment handlers?) requires, and it’s that getting trampled upon that is causing the stiffage.
There seems to be a certain degree of inherent shonkiness in CLib. A bunch of moons ago I had instant stiffing which was tracked down to something like using a function such as ftell() and an expired (closed) file handle. That should not stiff the machine, but it did. My memory is cloudy here, but I seem to recall that there is no such ftell() function. Instead the compiler emits code to blindly load data from some magic internal offset, but it requires that the file handle make sense. While this is not unusual (using stale handles is a bug), there’s no validation or attempt to check the handle is valid, and it’s this that leads to the stiffage. I think. I ought to revisit this and see if I can track it down in proper detail. Thing is, like a lot of RISC OS: you can have it safe, or you can have it fast, but you can’t have it both. |
nemo (145) 2552 posts |
Well said that man. RISC OS vulnerabilities:
You don’t say. |
Rick Murray (539) 13850 posts |
Found my earlier post on the stiffing problem – https://www.riscosopen.org/forum/forums/4/topics/3218 As nemo (and me, and you, and ’im over there in the corner) says: No checking of inputs. |
Jon Abbott (1421) 2651 posts |
You’re presuming we all know C!
It’s perfectly possible to stiff a machine in User mode, in fact, I’ve been debugging a game all day that does just that. Assuming there’s nothing beyond buf that C is dependent on, the code is going to attempt to write past the end of appspace which should trigger an abort. If you set buf to a few words below the end of appspace, does it result in a stiffed machine? And does it also stiff if buf starts at the end of appspace? Does switching it to read buf instead of write change the behaviour? If you write to a random address up high, does C handle the abort correctly? |
nemo (145) 2552 posts |
User mode code should not be capable of stiffing the machine What was meant was that it shouldn’t be possible to stiff the machine by writing to your own application slot, when there’s no handlers/claimants/callbacks in there. But CLib makes (possibly undocumented) assumptions, it seems. |
Steve Pampling (1551) 8172 posts |
Just to emphasise the point – even I’ve noted that large portions of RO don’t check inputs. There’s only one way it could be noted as more obvious and that involves my cats giving the code a look over. |
Jon Abbott (1421) 2651 posts |
Last week I was seeing a stiffed machine when writing to PMP’d appspace pages after return from a Wimp Poll. I’ve yet to diagnose what’s causing the issue, I instead cheated and read all of appspace first to force the pages in.
Is CLib the root cause? |
nemo (145) 2552 posts |
And my nomination for the it does what? of the week goes to OS_SetColour which (and don’t believe any documentation you’ve ever seen, it’s all wrong) will happily write anywhere in memory, even when you’ve told it not to. You might think that OS_SetColour sets the colour. It does, but it can also Get the colour. Surprise! When it is reading the colour (flags b7) the other flags magically change meaning. So: To SET an ECF b7 (read/write) clear = write To GET the ECF b7 (read/write) set = read So if you naïvely try to read the current colour number using OS_SetColour, you’ll overwrite 32 bytes of memory somewhere… probably 0 if called from Basic. It doesn’t check the pointer is sensible. It doesn’t even check if it’s aligned. Joy. |
Steve Pampling (1551) 8172 posts |
Nemo, your pictured furry person appears to have seen the source for either the client or server code of our pathology system. |
Rick Murray (539) 13850 posts |
Oh yeah. The furry one looks suitably miffed. That is not a cat that enjoys debugging other people’s half-assery. |
Clive Semmens (2335) 3276 posts |
I’d love to meet one that does! |
Jeffrey Lee (213) 6048 posts |
Yes, that’s caught out other people as well. The APCS chapter in PRM4 says “This standard does not define the values of fp, sp and sl at arbitrary moments during a procedure’s execution, but only at the instants of (external) call and return. Further standards and restrictions may apply under particular operating systems, to aid event handling or debugging. In general, you are strongly encouraged to preserve fp, sp and sl, at all times.” However, the CLib chapter doesn’t appear to make any mention of any requirement to ensure that the registers are always valid. Which is a bit of an oversight, considering that the APCS chapter doesn’t give a clear answer as far as RISC OS is concerned.
For this case, I suspect the pertinent assumption is “my workspace hasn’t been trampled”. I can probably look at this in a day or two (i.e. once I’ve worked up the courage). |
Adrian Lees (1349) 122 posts |
Okay, yes, fair enough. The memory layout is basically: Code Other data 'buf' <- the array in question. the broken code is writing beyond the end of this array Stubs data for the SharedCLibrary to use Heap, for dynamic allocation, including program stack which is allocated in 'chunks' that are chained together Simply attempting to access beyond the end of application space will take down just the guilty application. It’s understand – if lamented – that passing rogue pointers to RISC OS SWIs is apt to kill the system, but the fact that overflowing a global static array can take out the system is not just a big concern, it’s also a mare for C programmers to debug! I’ll try to spend some time trawling through the background/exception handlers to see if they can be made more robust. |
Jon Abbott (1421) 2651 posts |
If Stubs and the Heap are beyond buf and overwritten, won’t the behaviour of C’s Abort handler be unpredictable? What aborts are seen at the hardware vector? Does an abort occur and then stiff the machine when RISCOS or C’s abort handler deal with it? I’m interested to see if it’s the OS or C that’s stiffing the machine. Logic says its C, but PMP still needs to be ruled out as a factor on the OS side. |
nemo (145) 2552 posts |
Put them somewhere else! A one-page DA per CLib client would be a small price to pay. |
Rick Murray (539) 13850 posts |
…and make softloaded CLib refuse to die, so we can at least get rid of that lurking stupidity. [a CLib that refuses to quit with no clients is less destructive than one that does and leaves every jump table pointing at who knows what… |
nemo (145) 2552 posts |
It would be possible for stubs in DAs to go into a detached ‘safe mode’ that would reconnect after the new CLib starts. |
Jeffrey Lee (213) 6048 posts |
Good luck! If we can work out a valid set of restrictions for applications (e.g. “application clients must always be in application space”) then that would allow CLib to more easily detect invalid pointers in its workspace.
Possible but not feasible. Too many complications to deal with (suspended wimp tasks, interrupt handlers which call into CLib while the module is being reloaded, etc.) A much more reasonable proposition would be to have CLib store its core code outside of the CLib module (e.g. in a DA or another RMA block), so that the existing clients can continue to use the old version, and only new clients are made to use the new version. This could also be used as a way of solving the your CLib bugfix broke my buggy code problem, by allowing apps to request specific CLib versions to be used (or have some app compatibility thing in the OS to control this selection) |
Rick Murray (539) 13850 posts |
It’s possible to write module applications in C. I have one, and there’s some part of the standard OS that does the same sort of thing.
The slippery slope to DLL hell.
Now that memory is cheap and plentiful… Yeah, why not? |
Jeffrey Lee (213) 6048 posts |
That’ll be a module client as far as CLib is concerned (assuming the standard stubs are used). |
Adrian Lees (1349) 122 posts |
And down the rabbit hole he falls… The clib code is copying some stub code to the locations immediately following the abused array, and registering environment handlers which are dutifully called by the kernel in SVC mode. Pretty much game over, although exactly how that hangs the box in this specific case is not yet clear. ISTR that it registers an event handler too, which is perhaps even more dangerous…(ongoing) Anyway, whilst experimenting, I’ve discovered that a simple busy loop, executed within a TaskWindow, does not multitask?!
I’ve tried that on three different boxes and even with an older CLib module (5.83, 23 Aug 2014), so I really don’t understand that…anyone? Within a TaskWindow, a simple ‘branch to self’ (B &8000) instruction built using *MemoryA and executed with *Go, permits multitasking to continue as expected. |
nemo (145) 2552 posts |
Can someone explain why those handlers couldn’t be in a DA?
Well it does in RO4, using GCC. Is it not in USR mode for some reason? I was surprised that it was slightly choppy when dragging the taskwindow around, so I went back to a couple of versions from 2001 – one I made that was supposed to be ‘faster’ (and no, I can’t remember how/why) and one that Dan had made. Convinced myself they were slightly less choppy than 0.76 from 2003. How much do you need to put back in before it behaves again? Does the watchdog keypress kill it? Does a KeyV-triggered OS_Exit kill it? Strangeness. |