SWI &2B returned a bad error pointer
François Vanzeveren (2221) 241 posts |
Hello I am following the book “A Beginner’s Guide to Wimp Programming on RISC OS Computers (Second Edition)” from Martyn Fox.
This generates the following error:
I need help to understand why this piece of code, copied from the book, does not work while the err% reported by *Report looks like
Thank you for your help François |
Jeffrey Lee (213) 6048 posts |
Recently RISC OS 5 has started doing sanity checks on error pointers, to try and avoid crashes (or other bad behaviour) as a result of buggy code (typically assembler code which sets the V flag but fails to set R0 to the error pointer). In this case, I think it’s complaining because bit 30 of the error number is reserved and should be zero. https://www.riscosopen.org/wiki/documentation/show/SWI%20Error%20Block |
François Vanzeveren (2221) 241 posts |
Hi Jeffrey Well spotted! Have replaced this by a global variable fatal%. Thank you for your help. Regards François |
Colin (478) 2433 posts |
According to the PRM1-43 bit 30 can be used by programmers to flag internal errors |
François Vanzeveren (2221) 241 posts |
Hi Colin, yes, but setting this bit to 1 causes the error. Regards |
Colin (478) 2433 posts |
My point – poorly made – is that its a bug in the OS. Someone has read wiki instead of the PRM when coding the changes. It should be changed so that bit 30’s behaviour is unchanged. |
Martin Avison (27) 1494 posts |
In the Wiki bit30 in the error number is ‘defined to be zero’ – are you saying that should now be ‘Reserved. Must be zero’? Which would prohibit the previously documented use in the PRMs. The error number validation only says in the Wiki it checks no reserved bits are set.
so should it also be exempt from error number validity checks? |
nemo (145) 2547 posts |
XOS_GenerateError must do no checks whatsoever. OS_GenerateError is entitled to check the validity of the error ptr in R0, as it may not be valid. However if it finds any reserved bits set in the error number (including b30) then this merely indicates that it cannot interpret what the error number means… it does not mean the error should be obliterated by another. In other words, if any reserved bits are set (including b30) the error number should be treated as a basic error number. There seems to be some confusion on the part of the developer between what a programmer might want to see, and what a user definitely doesn’t need to see. “SWI &2b returned a bad pointer” is a chocolate teapot. It’s not correct or helpful. Does the checking also ensure that the text message is printable characters only, terminated within a reasonable length? I’d have thought that was more useful than quibbling about a bit in a number that probably isn’t going to be shown to the user anyway. |
Steve Pampling (1551) 8170 posts |
Well, saying that OS_GenerateError is the source of an error is a bit erm, “recursive” certainly but I’m not sure that’s totally unhelpful. I suppose it’s one of those things – you have to be very, very sure of how the error handling will behave and particularly that it won’t cause errors because any such errors will mask the original error that sent you into the error handler. More of a concrete cushion than chocolate fireguard perhaps? |
Jeffrey Lee (213) 6048 posts |
But as soon you call OS_GenerateError, surely it is no longer an internal error? You’re broadcasting the error to world + dog via ErrorV. Perhaps the behaviour should be changed, but raising objections when ROOL are on the cusp of making a stable release (if it’s not already too late) is not helpful. These changes originally went in over two years ago, after first being mooted on the forums several months prior. We have a code review forum and a CVS “recent changes” view for a reason. https://www.riscosopen.org/forum/forums/3/topics/3540 For compatibility with existing programs, XOS_GenerateError is exempt from SWI error pointer validity checks Correct. XOS_GenerateError allows you to do anything. XMessageTrans_ErrorLookup will have the pointer checked (a bit redundant, unless MessageTrans is completely broken) but won’t check the error number. Everything else will be subject to both pointer + error checks (with exceptions for things like FileCore’s disc error handling). |
Colin (478) 2433 posts |
I have always interpreted the meaning of bit 30 in the error number to be the same as my interpretation of bits 18:19 = 3 in the swi number ie I can use any number with these bits set in house so I use them while developing. Are you saying that if I’m writing programs for my personal consumption I need an error chunk from ROOL to guarantee I won’t break 3rd party programs? Is my interpretation of bits 18:19 for the swi chunk also wrong? The mere fact that you defend the change makes me doubt my interpretations. |
Chris Mahoney (1684) 2165 posts |
The PRM says that “bit 30 is defined to be clear”, so my interpretation is that once you set it yourself, it’s no longer a valid error as far as the OS is concerned (and is therefore only valid in your own custom error handling code). |
Jeffrey Lee (213) 6048 posts |
The PRM doesn’t really explain itself in any detail, which doesn’t help much. Bits 18-19 of the SWI number set to 3 are described as for “user applications”, which is generally interpreted as meaning you can pick any SWI chunk within that range for local development, and it’s your responsibility to make sure it doesn’t clash with SWI chunks used by any other software on your machine (which should be easy, as long as released software sticks to registered chunk numbers). Meanwhile bit 30 of the error number “is defined to be clear, and can therefore be used by programmers to flag internal errors”. I’d interpret this in a similar way to how Chris interprets it – those error numbers will be valid inside your code, but should never be passed outside of your code. If module A uses an internal error number to mean one thing, and module B uses that same internal error number to mean something else, and somehow an error from module A finds its way into module B (SWI, service call, etc.) then module B could get confused about what the problem is. Unless module B is expected to recognise that the error came from an external source, in which case the “internal error” flag is essentially useless since code which uses internal errors will have to store a flag outside of the error number/block in order to identify whether the error was self-generated or came from some external source (i.e. whether it’s an internal error or not!) However, I do agree that banning bit 30 isn’t terribly useful for developers who haven’t yet registered an error chunk and want to return errors from their code (e.g. if they’re developing two programs which will be talking to each other). There’s no “user error” chunk, so all it’ll do is force them to pick some other arbitrary error chunk, with no greater chance of safety than if they were to use internal error numbers. |
nemo (145) 2547 posts |
There is a difference between what a developer might want reported, and what a user might want reported. If a program emits an error saying “You have forgotten to insert your woggle” why would a user want that to be reported as “SWI &2b returned a bad pointer”. Can someone explain why the user would want that? There’s a good argument for Developer Mode which issues warnings (though preferably through SysLog) about deprecated APIs (of which I suspect this is one – I don’t remember Arthur having this restriction, but I could be wrong), unaligned pointers (which should have been – a repeated sin in RO4), etc. That would be useful. However, the user is not a QA engineer. They will find the real error report more useful. Jeffrey lamented
It’s not convenient.
And if you call OS_GenerateError with R0=0 is it no longer a null ptr? The fact that the code can tell it’s an internal error means
Steve cautioned
So the error has to be errored because the error might cause an error in things that ignore b30 in error? Chris interpreted
It says This is the source of Jeffrey’s confusion when he said
Yes of course it is. It is an error flagged in the body of the application, not returned to it by the OS. That is what in internal error is. My error handlers prefix “Internal error:" to that kind of thing (though I haven’t used that bit myself) before passing it on. The OS defines that it will not return an error number with that bit set, but it also defines that the bit can be used by the programmer. To claim that it cannot be passed to OS_GenerateError (or Wimp_ReportError, by extension) is an interpretation of “use” that I don’t understand. If b31 is set, it’s a fatal error with a restricted range of reason codes. |
nemo (145) 2547 posts |
And?
Wouldn’t that be the sensible thing to check, if checking is to be done? |
Steve Pampling (1551) 8170 posts |
Why would a developer want something as standard as a known SWI to be identified as &“something in hex” ? Wouldn’t it be so much more useful if was in ‘human’ ? Steve cautioned My intention was to point to the obvious case that any error handling code MUST be robust and pretty much guaranteed not to cause strange errors itself. Mind you it’s probably our fault we got to this point in time without properly testing this error handling, after all I think the less informed user/twiddler/developer is better placed for producing errors to test such things. 1 It’s probably a little strange that someone like me that merely dibbles on the edge of coding actually knows what SWI &2B represents. |
Rick Murray (539) 13841 posts |
I’ve been watching this discussion, and now it’s time to throw in my 2p. Essentially, it is this: Whoever made the change to RISC OS to flag it as wrong if bit 30 is set should have thought a little more carefully about the consequences.
>DIM x% 255 >x%!0 = 17 >$(x%+4) = "Oh noes!"+CHR$(0) >SYS "OS_GenerateError", x% Oh noes! >x%!0 = 17 + (1 << 30) >SYS "OS_GenerateError", x% SWI &2B returned a bad error pointer >
Maybe in the PRM index? I wrote this, called it WhatSWI and put it in my Library: DIM x% 255 REPEAT INPUT "SWI number : "s$ IF (s$ = "") THEN END s%=EVAL(s$) SYS "OS_SWINumberToString", s%, x%, 256 TO ,,l% x%?l% = 13 IF ((x%!0 = &72657355) OR (x%!0 = &65735558)) THEN PRINT "This SWI is undefined."' ELSE PRINT "This is """+$x%+"""."' ENDIF UNTIL FALSE END |
edwardx (1628) 37 posts |
According to the wiki: It’s not in the PRMs though, so I don’t know where this comes from. |
Jeffrey Lee (213) 6048 posts |
So in the very real situation where a program returns a bad error pointer, which then causes a crash in RTSupport’s ErrorV handler, you’re saying that a message of “Internal error: Abort on data transfer at &xxx” (where &xxx is inside RTSupport) is more useful than “SWI &xxx has returned a bad error pointer” (where &xxx almost certainly correctly identifies the buggy SWI)? And you’re saying that user’s would prefer to see “Internal error: Abort on data transfer” over the other error? The aim of the changes is to catch bad error pointers which will cause the system to crash, because (a) crashes are bad, and (b) crashing while handling an error is likely to make it very hard to identify what the original error was. Primarily this works by sanity checking the error pointer, to make sure the pointer is safe. But as an extension to that it also aims to check that it points to a valid error block, since SWIs which attempt to pass off garbage memory as being a valid error block are just as useless as SWIs which attempt to pass of invalid pointers as being valid error blocks. It was not the aim of the change to hide valid error pointers with the “SWI &xxx has returned a bad error pointer” message.
Only because the kernel now checks for such nonsense and replaces the null pointer with something that will avoid a crash in completely unrelated code further down the line. In this case the error message isn’t particularly useful (OS_GenerateError isn’t the one that created the bad error pointer, it’s the code that called OS_GenerateError), but it’s still better than having an abort occur. Does the checking also ensure that the text message is printable characters only, terminated within a reasonable length? I’d have thought that was more useful than quibbling about a bit in a number that probably isn’t going to be shown to the user anyway. Yes, checking that the error message is valid would be sensible. But checking the error message for validity while avoiding false-positives is much harder than checking the error number, and as this thread has shown checking the error number for false-positives isn’t straightforward either.
Yes, using a SWI name would be convenient. But I decided it would be better to go with the safe approach of just using the SWI number, rather than risk stack overflows / recursion / etc. from attempting to perform SWI name → number translation. |
Colin (478) 2433 posts |
It’s more than that. If I was an organisation that was developing a suite of programs for use in house I don’t necessarily want to inform the os developer about anything I do. So I can set bit 30 for all error chunks that I allocate for my organisation in the knowledge that ROOL registered error chunks will always have bit 30 clear. But I need to be able to display the error. To me internal always meant internal to the organisation not to the program. And I can’t see how you can interpret it any other way that makes it useful. Together with swi bits 18:19 = 3 it makes it possible to have an organisation wide module suite. |
Jeffrey Lee (213) 6048 posts |
There’s no “user error” chunk Looks like it was me who added that, most likely after I spotted that it was listed in CVS.
But if your interpretation is correct, why does the PRM describe the two (error blocks & SWI chunks) using different terminology? |
Colin (478) 2433 posts |
Because they were written by different people and or at different times? It doesn’t actually say ‘internal to the program’. Using my interpretation causes no problems if you interpreted it the other way, but using your interpretation and stopping error reporting with bit 30 set causes problems for those who interpreted the phrase my way. So it’s sensible to allow errors to be shown with bit 30 set as it’s much more useful isnt it? |
nemo (145) 2547 posts |
Steve historically proclaimed
And that, ladies and gentlemen, is my vote for the official RISC OS motto. Rick leaned out and waved his fist
I’ve always liked the witty criticism “it’s not even wrong”. This, however, is not even not even wrong. Jeffrey asked, hyperbolically
I’m sorry to quote that in its entirety, but every word I omitted made it less silly, so I thought I’d leave them all. I will make it clear what I’m saying by quoting it again:
Or to put it another way as I already did: The fact that the code can tell it’s an internal error means Please don’t do the straw man thing. I have already said regarding bad pointers:
All APIs should check all their parameters for bad pointers. Now that’s clear, can we get back to the subject of throwing perfectly functional and informative error reports away because somebody didn’t say the magic word? You seem to think b30 is a Shibboleth. It is not. Colin confessed
No, that’s not what “internal error” means. It means something going wrong with the internals of a program, as opposed to an error being reported to it from outside the program, or the correct refusal to perform some action. eg, IBM’s usage in this document Here’s my final word on the matter. An error block contains two fields, a machine readable one, and a human readable one. The human readable one can be guided by convention and style, but ultimately the developer (acting upon feedback from their users) is best placed to formulate the most useful terminology. The machine readable one must be defined if it is to be useful. An OS call can fail in multiple ways – OS_File might complain “File not found”, or “That’s a directory you fool”, but a program can only reasonably deal with those failures (it it wishes to deal with them differently) via the machine-readable part – the error number. Defined error ranges make it easier to construct that kind of error handling, allowing one to quickly tell the difference between an FS error (can’t save there for some reason) and a systemic one (your hair is on fire). However, an internal error is one which no other system can deal with. By definition it is not generic, it is not generalised, it is not part of a defined API and it is not intended to be handled uniquely, in the way one might handle “Read only”. The whole point of b30 is to state don’t bother reading anything into this error number, it only means something to me. That is what an internal error is. That’s what b30 is for. Somebody has misunderstood. |
Steve Fryatt (216) 2105 posts |
Despite the apparent certainty about this now, I don’t seem to be able to find Nemo making any effort to offer constructive suggestions back in 2015 and 2016 when these changes were being discussed before implementation… |
Jeffrey Lee (213) 6048 posts |
Yes, we’re in agreement on that.
Sorry. I think I conflated a few different arguments and ended up poorly presenting a point I was trying to make.
And it is the lack of (robust, unambiguous) definition which is causing problems here. My interpretation of bit 30 is that if it is set, then it it must not be passed outside of your program. Colin’s interpretation is that bit 30 can be set, but if it is set then you can only safely interpret the error number if you know who generated the error. nemo’s interpretation is that bit 30 can be set, but if it is set then the error number is intended more as a diagnostic to the programmer/user (like the error message) rather than being a machine-readable code. As I said in my first post:
Detecting unaligned pointers is easy. Detecting pointers which don’t point to an area of memory containing a 32 bit word followed by a null-terminated string (no longer than 252 bytes, including terminator) is also relatively easy. But there are many pointers which satisfy those requirements without actually being pointers to valid error blocks (where “a valid error block” is an error number and error message which has been deliberately written to memory, in the format dictated by PRM1, for passing across OS-level API boundaries, e.g. by returning from a SWI with V set and R0 as a pointer to the block). An invalid error block/pointer is useless, regardless of whether it’s something that will cause an abort on data transfer or not. So with that in mind I thought it would be useful to have the kernel check to see if the error number is invalid. However this has proven to be much less straightforward, because not everything follows the rules stated in the PRMs, and we can’t even agree on what the rules stated in the PRMs actually are. Outcome: The error number checks have now been disabled, but it’s too late for the changes to make it into 5.24. |