Machine hang in USR mode C
Jon Abbott (1421) 2651 posts |
I’ve just tried turning off the code that touches every page to prevent PMP aborts and then start seeing aborts I shouldn’t see. Are you sure PMP aborts are handled before OS_ClaimProcessorVector? Where’s the source so I can see what it’s doing when an abort is raised and how it retries the instruction? Could this be added to the RO download? Not easily, its for shutting down tasks running under the ADFFS JIT. Its something that could easily be implemented as ALT-BREAK within the OS however, its not exactly rocket science as the OS already knows about every entry point into appspace (voices, vectors, environment handlers etc) and just needs to kill them before collapsing the stacks and doing an OS_Exit. |
Rick Murray (539) 13850 posts |
Okay, it’s a fair point. So I looked in the 5a PRM, and it says: For graphics colours you must supply a pattern block, whereas text colours are returned in R1 as colour numbers.The point being that it shouldn’t be writing to unknown memory locations. Something documented somewhere does make mention…
Yup. That plus random insanity. Thank you for breaking your brain on this stuff. Saves us from wandering too far into the abyss. |
nemo (145) 2552 posts |
Indeed. It is unfortunate that all that has not been pulled together into a controlling API (a kind of all-flavours DelinkApplication) coordinated through a Service Call (or something more modern). Because although you say ‘every entry point’ that’s certain to not be the case. Have a controlling protocol though, and any other systems the Kernel is unaware of could be tied in too. |
Jeffrey Lee (213) 6048 posts |
2. Kernel pre-veneer (deals with lazy task swapping) Looks like I got it wrong; the kernel pre-veneers occur after OS_ClaimProcessorVector. I’ll correct the post. |
Jeffrey Lee (213) 6048 posts |
Anyway, whilst experimenting, I’ve discovered that a simple busy loop, executed within a TaskWindow, does not multitask?! Actually, it looks like it’s the kernel that’s been left in the funny state. It looks like the last SWI that’s executed by the C runtime startup is returning an error, causing the callback postpone flag to be set. And since all that main() does is sit in a loop, no other SWIs are executed to trigger that postpone flag to be cleared, blocking TaskWindow’s task switch callback (and all other callbacks) from occurring. |
Jeffrey Lee (213) 6048 posts |
I’ve now submitted a change to CLib to try and prevent callbacks being blocked – but it’s not a great solution for a number of reasons (e.g. errors returned by _swix will still block them, since that doesn’t use CopyError). But it is at least something that can be softloaded on older machines. For a proper fix, I’m thinking of going straight to the source. Either by adding a safety timer to the postponement flag (so callbacks can only be postponed for so many centiseconds), or changing the kernel to make a copy of the error block in some kernel workspace. However that second option would also require the contents of that error block to be preserved by anything which performs task swapping, which hits the “we don’t have a proper DelinkApplication call” issue that nemo mentions. |
Jon Abbott (1421) 2651 posts |
If I understand correctly, an SWI X that returns with V set, blocks all further callbacks until another SWI is successful? Are they blocked system wide, or just for the current task? If the reasoning behind that is to prevent an error block being overwritten, why not allocate a new error block from the RMA, track the RMA allocations and remove them on the next SWI success, without blocking callbacks? …or is there more to it? Is this behaviour documented somewhere, there’s no mention of it in the CallBack Handler documentation. |
nemo (145) 2552 posts |
s/stop thing from exploding/stop thing from exploding for a short time/
That’s the one. The perceived problem is that the error block that will be returned to the caller after the callbacks could be corrupted by the callbacks. I’m not terribly impressed by the could there – copying all error blocks because of theoretical error buffer reuse is a fix in the wrong place, but it is a fix that will work.
There are a number of scenarios: R0 is in ROM – do nothing. R0 is in appslot – do nothing. SWI is XOS_GenerateError – do nothing. No pending callbacks – do nothing. Otherwise, copy the error into a cyclic array of error buffers, and update R0. As for ‘task swapping’, that was already the case – error buffers that can be reused during callbacks can definitely be reused by another task. |
Rick Murray (539) 13850 posts |
Just out of interest… What SWI is failing? |
Jeffrey Lee (213) 6048 posts |
It’s a bit more complicated than that.
So to clear the flag you either need to call a SWI (from USR mode with interrupts enabled) which doesn’t return an error, or call a non-X SWI which will cause an error to be generated. However, as outlined in the CallbackChange document the behaviour in earlier OS versions has been different. The flag isn’t explicitly cleared when swapping task, but since swapping tasks usually involves calling a number of SWIs (most/all of which should succeed) I think it’s safe to say that the flag should be cleared as a side-effect. But maybe it’s worth investigating just to make sure there aren’t any edge cases.
I don’t think there’s any need to use an RMA block (compared to the kernel just having a static buffer somewhere), although doing so might help reduce the amount of data that needs copying during task switches.
Judging by the comments in the CallbackChange document, no it wasn’t/isn’t documented anywhere.
MessageTrans is responsible for handing out most of the error blocks that programs use. It has one pool of error buffers which are used by interrupt handlers, and another pool which is used by foreground tasks. But as far as it’s concerned a callback is a foreground task, so any callbacks that occur will run the risk of overwriting errors which the foreground program hasn’t had a chance to examine yet. Things could be tweaked so that the system is more reliable (e.g. it should be possible for the kernel to expose a flag for whether you’re in a transient callback or not), potentially that would be enough to produce a reliable solution to the problem. Except, all bets are off for anything that dynamically generates errors but doesn’t use MessageTrans to manage the blocks.
OS_Args 2, because it was getting called with a file handle of 0. |
Jon Abbott (1421) 2651 posts |
Could you not allocate a page of memory that’s way above appspace, but is swapped along with it, to hold that tasks error block? |
nemo (145) 2552 posts |
I don’t know why various people are talking about task swapping, it’s completely irrelevant. |
Jon Abbott (1421) 2651 posts |
Because you want the contents of memory pointers returned by the OS to be persistent on a per task basis, without reserving large chunks of kernel space for cyclic buffers? This goes for Error blocks, MODE specifiers and countless other transient items. Is there a way to programmatically determine if CallBacks are blocked? |
nemo (145) 2552 posts |
In what way do you think anything has changed? The perceived vulnerability is that a TaskWindow client could be pre-empted before it has finished with an error block – normal Wimp tasks wouldn’t dream of holding onto an error pointer over a call to Wimp_Poll. So in the case of a TaskWindow client, is it your belief that error pointers are currently longer-lasting than they would be if copied to a cyclic buffer? Because that’s not true. It is already the case that having received an error pointer returned from an XSWI (with callbacks avoided) the TaskWindow client may well call some other SWI (in fact it’s almost inevitable) before it has finished with the error pointer. Assuming that this SWI does not return an error, callbacks will be triggered, and the TaskWindow client may be pre-empted. This has always been the case. MessageTrans uses cyclic buffers – 32 if memory serves (for foreground callers) – but any non-ROM module may only have a single error buffer which could be overwritten by the actions of other tasks before TaskWindow gets its idle poll back again. So the suggested change only makes error pointers more reliable, not less, and if the SWI Handler had cyclic error block copying, then MessageTrans could probably do away with its similar scheme. (MessageTrans uses a cyclic array of buffers, but a truly cyclic buffer would be much more memory efficient). Don’t get me wrong, I do believe that task management should be much nearer to the Kernel than it is at the moment (and I’ve argued that it should be removed from the Wimp and bundled into TaskManager, but that’s a detail). There are things that would benefit from spotting context switches (via some mechanism more efficient than a Service Call) – a system-wide DelinkApplication would be one of the benefits. And error buffering during the callbacks would be neater if integrated into that task swap… but it’s not on a per-task basis, but a per-pre-emption basis. It ought to be TaskWindow’s responsibility, not the Wimp’s, for example. But this is finessing, it is not necessary. A cyclic buffer is already a huge improvement. Ultimately the extent of the cyclic buffer is dictated by how many TaskWindow clients have been simultaneously (serially) pre-empted during their error handling… so I’m going to stick my neck out and say MessageTrans’ choice of 32 does not look recklessly small. |
Jeffrey Lee (213) 6048 posts |
Which suggested change?
Under RISC OS 5, MessageTrans uses 4 buffers for interrupt handlers and 16 for “foreground” code. |
nemo (145) 2552 posts |
Copying of error buffer (if there is one) around callbacks (if there are any).
Ah so. RO4 is (I’ve just checked) 8 and 32. Either because “it goes up to 11” or post hoc ergo propter hoc – I know not which. |
Rick Murray (539) 13850 posts |
I have a module (written in C) which is reading its data wrongly. Something that should have zero indexed items is claiming to have 208. I’m not sure why yet, in the middle of debugging it. However… It was setting a backpointer to a previous index entry. Which didn’t exist. So it was returning the pointer as a file offset to 538976288. The C code was attempting to fseek() to this location, and then fwrite() if the seek succeeded. This would invariably trigger a crash with an undefined instruction around &30 or so (I’m using a low vector image). I don’t know if it’s RISC OS or CLib at fault. I do know that passing a crap file offset (no matter how out of range) to fseek() should never muck up the system that badly. [my current anti-stiffage fix is to simply bail out if the previous message pointer (514MB) is greater than the file size (64 bytes)] Edit: Oops. I was looking up array offsets using the non-consequential array references rather than the actual array offsets. So, oops, trying to read index value 60 from an array with maybe 15 elements ain’t gonna work. ;-) |
nemo (145) 2552 posts |
I had the pleasure of mentoring a graduate in C, who had been taught nothing but Java. “Array bounds checking? What’s that?” he infamously asked. |
Steffen Huber (91) 1953 posts |
Hopefully you answered “It’s the mechanism that throws those IndexOutOfBoundsExceptions”. |