SWI &2B returned a bad error pointer

78 posts, 15 voices

Pages: 1 2 3 4

Apr 23, 2018 1:29pm François Vanzeveren (2221) 241 posts	Hello I am following the book “A Beginner’s Guide to Wimp Programming on RISC OS Computers (Second Edition)” from Martyn Fox. In chapter 14, there are a few procedures to intercept and handle non-fatal errors. One of them is `DEFPROCload LOCAL err%,flags% SYS "XOS_CLI","LOAD Shapefile "+STR$~list% TO err%;flags% Report flags%,!err%,$(err%+4) IF (flags% AND 1)<>0 THEN !err%=1<<30 Report !err%,$(err%+4) SYS "OS_GenerateError",err% ENDIF PROCforce_redraw(main%) ENDPROC` This generates the following error: SWI &2b returned a bad pointer at line 369 which refers to statement `SYS "OS_GenerateError",err%` I need help to understand why this piece of code, copied from the book, does not work while the err% reported by *Report looks like `!err%=1073741824,$(err%+4)="File 'Shapefile' not found[00] not found[00]ilingSystem' not found[00]....."` Thank you for your help François

Apr 23, 2018 1:37pm Jeffrey Lee (213) 6048 posts	Recently RISC OS 5 has started doing sanity checks on error pointers, to try and avoid crashes (or other bad behaviour) as a result of buggy code (typically assembler code which sets the V flag but fails to set R0 to the error pointer). In this case, I think it’s complaining because bit 30 of the error number is reserved and should be zero. https://www.riscosopen.org/wiki/documentation/show/SWI%20Error%20Block

Apr 23, 2018 1:54pm François Vanzeveren (2221) 241 posts	Hi Jeffrey Well spotted! So, the trick give by Martyn Fox in the book to distinguish between fatal and non-fatal errors is not allowed anymore… Have replaced this by a global variable fatal%. Thank you for your help. Regards François

Apr 23, 2018 1:59pm Colin (478) 2433 posts	According to the PRM1-43 bit 30 can be used by programmers to flag internal errors

Apr 23, 2018 2:03pm François Vanzeveren (2221) 241 posts	Hi Colin, yes, but setting this bit to 1 causes the error. Could it be the programmer cannot create his own error code? Regards

Apr 23, 2018 2:07pm Colin (478) 2433 posts	My point – poorly made – is that its a bug in the OS. Someone has read wiki instead of the PRM when coding the changes. It should be changed so that bit 30’s behaviour is unchanged.

Apr 23, 2018 2:25pm Martin Avison (27) 1494 posts	In the Wiki bit30 in the error number is ‘defined to be zero’ – are you saying that should now be ‘Reserved. Must be zero’? Which would prohibit the previously documented use in the PRMs. The error number validation only says in the Wiki it checks no reserved bits are set. Mind you, the Wiki page also says For compatibility with existing programs, XOS_GenerateError is exempt from SWI error pointer validity checks so should it also be exempt from error number validity checks?

Apr 23, 2018 2:36pm nemo (145) 2546 posts	XOS_GenerateError must do no checks whatsoever. OS_GenerateError is entitled to check the validity of the error ptr in R0, as it may not be valid. However if it finds any reserved bits set in the error number (including b30) then this merely indicates that it cannot interpret what the error number means… it does not mean the error should be obliterated by another. In other words, if any reserved bits are set (including b30) the error number should be treated as a basic error number. There seems to be some confusion on the part of the developer between what a programmer might want to see, and what a user definitely doesn’t need to see. “SWI &2b returned a bad pointer” is a chocolate teapot. It’s not correct or helpful. Does the checking also ensure that the text message is printable characters only, terminated within a reasonable length? I’d have thought that was more useful than quibbling about a bit in a number that probably isn’t going to be shown to the user anyway.

Apr 23, 2018 3:56pm Steve Pampling (1551) 8170 posts	“SWI &2b returned a bad pointer” is a chocolate teapot. It’s not correct or helpful. Well, saying that OS_GenerateError is the source of an error is a bit erm, “recursive” certainly but I’m not sure that’s totally unhelpful. I suppose it’s one of those things – you have to be very, very sure of how the error handling will behave and particularly that it won’t cause errors because any such errors will mask the original error that sent you into the error handler. More of a concrete cushion than chocolate fireguard perhaps?

Apr 23, 2018 7:23pm Jeffrey Lee (213) 6048 posts	According to the PRM1-43 bit 30 can be used by programmers to flag internal errors But as soon you call OS_GenerateError, surely it is no longer an internal error? You’re broadcasting the error to world + dog via ErrorV. Perhaps the behaviour should be changed, but raising objections when ROOL are on the cusp of making a stable release (if it’s not already too late) is not helpful. These changes originally went in over two years ago, after first being mooted on the forums several months prior. We have a code review forum and a CVS “recent changes” view for a reason. https://www.riscosopen.org/forum/forums/3/topics/3540 https://www.riscosopen.org/viewer/revisions/ For compatibility with existing programs, XOS_GenerateError is exempt from SWI error pointer validity checks so should it also be exempt from error number validity checks? Correct. XOS_GenerateError allows you to do anything. XMessageTrans_ErrorLookup will have the pointer checked (a bit redundant, unless MessageTrans is completely broken) but won’t check the error number. Everything else will be subject to both pointer + error checks (with exceptions for things like FileCore’s disc error handling).

Apr 24, 2018 7:54am Colin (478) 2433 posts	I have always interpreted the meaning of bit 30 in the error number to be the same as my interpretation of bits 18:19 = 3 in the swi number ie I can use any number with these bits set in house so I use them while developing. Are you saying that if I’m writing programs for my personal consumption I need an error chunk from ROOL to guarantee I won’t break 3rd party programs? Is my interpretation of bits 18:19 for the swi chunk also wrong? The mere fact that you defend the change makes me doubt my interpretations.

Apr 24, 2018 8:34am Chris Mahoney (1684) 2165 posts	The PRM says that “bit 30 is defined to be clear”, so my interpretation is that once you set it yourself, it’s no longer a valid error as far as the OS is concerned (and is therefore only valid in your own custom error handling code).

Apr 24, 2018 12:58pm Jeffrey Lee (213) 6048 posts	The PRM doesn’t really explain itself in any detail, which doesn’t help much. Bits 18-19 of the SWI number set to 3 are described as for “user applications”, which is generally interpreted as meaning you can pick any SWI chunk within that range for local development, and it’s your responsibility to make sure it doesn’t clash with SWI chunks used by any other software on your machine (which should be easy, as long as released software sticks to registered chunk numbers). Meanwhile bit 30 of the error number “is defined to be clear, and can therefore be used by programmers to flag internal errors”. I’d interpret this in a similar way to how Chris interprets it – those error numbers will be valid inside your code, but should never be passed outside of your code. If module A uses an internal error number to mean one thing, and module B uses that same internal error number to mean something else, and somehow an error from module A finds its way into module B (SWI, service call, etc.) then module B could get confused about what the problem is. Unless module B is expected to recognise that the error came from an external source, in which case the “internal error” flag is essentially useless since code which uses internal errors will have to store a flag outside of the error number/block in order to identify whether the error was self-generated or came from some external source (i.e. whether it’s an internal error or not!) However, I do agree that banning bit 30 isn’t terribly useful for developers who haven’t yet registered an error chunk and want to return errors from their code (e.g. if they’re developing two programs which will be talking to each other). There’s no “user error” chunk, so all it’ll do is force them to pick some other arbitrary error chunk, with no greater chance of safety than if they were to use internal error numbers.

Apr 24, 2018 1:04pm nemo (145) 2546 posts	There is a difference between what a developer might want reported, and what a user might want reported. If a program emits an error saying “You have forgotten to insert your woggle” why would a user want that to be reported as “SWI &2b returned a bad pointer”. Can someone explain why the user would want that? There’s a good argument for Developer Mode which issues warnings (though preferably through SysLog) about deprecated APIs (of which I suspect this is one – I don’t remember Arthur having this restriction, but I could be wrong), unaligned pointers (which should have been – a repeated sin in RO4), etc. That would be useful. However, the user is not a QA engineer. They will find the real error report more useful. Jeffrey lamented raising objections when ROOL are on the cusp of making a stable release (if it’s not already too late) is not helpful It’s not convenient. But as soon you call OS_GenerateError, surely it is no longer an internal error? And if you call OS_GenerateError with R0=0 is it no longer a null ptr? The fact that the code can tell it’s an internal error means It shouldn’t try to interpret any other bits in the error number It doesn’t therefore have to complain about it Steve cautioned you have to be very, very sure…that it won’t cause errors So the error has to be errored because the error might cause an error in things that ignore b30 in error? Chris interpreted The PRM says that “bit 30 is defined to be clear” It says `Bit 30 is defined to be clear, and can therefore be used by programmers to flag internal errors.` This is the source of Jeffrey’s confusion when he said as soon you call OS_GenerateError, surely it is no longer an internal error? Yes of course it is. It is an error flagged in the body of the application, not returned to it by the OS. That is what in internal error is. My error handlers prefix “Internal error:" to that kind of thing (though I haven’t used that bit myself) before passing it on. The OS defines that it will not return an error number with that bit set, but it also defines that the bit can be used by the programmer. To claim that it cannot be passed to OS_GenerateError (or Wimp_ReportError, by extension) is an interpretation of “use” that I don’t understand. If b31 is set, it’s a fatal error with a restricted range of reason codes. If b30 is set, it’s an internal error. Treat it as a basic error. If restricted bits are set, there’s no way of interpreting it as anything but a basic error. Otherwise, error ranges can be interpreted.

Apr 24, 2018 1:09pm nemo (145) 2546 posts	And? Does the checking also ensure that the text message is printable characters only, terminated within a reasonable length? I’d have thought that was more useful than quibbling about a bit in a number that probably isn’t going to be shown to the user anyway. Wouldn’t that be the sensible thing to check, if checking is to be done?

Apr 24, 2018 1:31pm Steve Pampling (1551) 8170 posts	If a program emits an error saying “You have forgotten to insert your woggle” why would a user want that to be reported as “SWI &2b returned a bad pointer”. Why would a developer want something as standard as a known SWI to be identified as &“something in hex” ? Wouldn’t it be so much more useful if was in ‘human’ ? Steve cautioned you have to be very, very sure…that it won’t cause errors So the error has to be errored because the error might cause an error in things that ignore b30 in error? My intention was to point to the obvious case that any error handling code MUST be robust and pretty much guaranteed not to cause strange errors itself. In the original case quoted when presented with dodgy input it proceeded to generate an error that cryptically¹ said “I’ve done something wrong” and not bother reporting the original user error thus proving the error handler can’t handle all possible user input. Mind you it’s probably our fault we got to this point in time without properly testing this error handling, after all I think the less informed user/twiddler/developer is better placed for producing errors to test such things. ¹ It’s probably a little strange that someone like me that merely dibbles on the edge of coding actually knows what SWI &2B represents. I’m not sure we have a easily accessible lookup table for SWI numbers in the documentation

Apr 24, 2018 2:35pm Rick Murray (539) 13840 posts	I’ve been watching this discussion, and now it’s time to throw in my 2p. Essentially, it is this: Whoever made the change to RISC OS to flag it as wrong if bit 30 is set should have thought a little more carefully about the consequences. For a start, the error returned is entirely wrong. “SWI &2b returned a bad error pointer”. No, the pointer is not wrong. A bit in the error number is wrong. This message is misleading. >DIM x% 255 >x%!0 = 17 >$(x%+4) = "Oh noes!"+CHR$(0) >SYS "OS_GenerateError", x% Oh noes! >x%!0 = 17 + (1 << 30) >SYS "OS_GenerateError", x% SWI &2B returned a bad error pointer > The very obvious question – if bit 30 can be used for something to flag an internal error, then it is perfectly valid to ban its propagation through other things and other places, however the two places that must accept it is Wimp_ReportError and OS_GenerateError. Why? Because if not, at which point does the program report this error to the user? To throw away an error message to replace it with a different message should only happen in two very specific circumstances: The Wimp when it ‘prettifies’ the messages (and you can still see the original) If something catastrophic happens that takes precedence. For example, an unhandled exception happening within the exception handler. That sort of thing. And, as Steve suggests, it isn’t too much to have the system look up the value of the SWI and output it as text instead of hex. I’m not sure we have a easily accessible lookup table for SWI numbers in the documentation Maybe in the PRM index? I wrote this, called it WhatSWI and put it in my Library: DIM x% 255 REPEAT INPUT "SWI number : "s$ IF (s$ = "") THEN END s%=EVAL(s$) SYS "OS_SWINumberToString", s%, x%, 256 TO ,,l% x%?l% = 13 IF ((x%!0 = &72657355) OR (x%!0 = &65735558)) THEN PRINT "This SWI is undefined."' ELSE PRINT "This is """+$x%+"""."' ENDIF UNTIL FALSE END

Apr 24, 2018 2:43pm edwardx (1628) 37 posts	There’s no “user error” chunk According to the wiki: “&7FFF00-&7FFFFF User programs – not for public distribution” It’s not in the PRMs though, so I don’t know where this comes from.

Apr 24, 2018 2:58pm Jeffrey Lee (213) 6048 posts	However, the user is not a QA engineer. They will find the real error report more useful. So in the very real situation where a program returns a bad error pointer, which then causes a crash in RTSupport’s ErrorV handler, you’re saying that a message of “Internal error: Abort on data transfer at &xxx” (where &xxx is inside RTSupport) is more useful than “SWI &xxx has returned a bad error pointer” (where &xxx almost certainly correctly identifies the buggy SWI)? And you’re saying that user’s would prefer to see “Internal error: Abort on data transfer” over the other error? The aim of the changes is to catch bad error pointers which will cause the system to crash, because (a) crashes are bad, and (b) crashing while handling an error is likely to make it very hard to identify what the original error was. Primarily this works by sanity checking the error pointer, to make sure the pointer is safe. But as an extension to that it also aims to check that it points to a valid error block, since SWIs which attempt to pass off garbage memory as being a valid error block are just as useless as SWIs which attempt to pass of invalid pointers as being valid error blocks. It was not the aim of the change to hide valid error pointers with the “SWI &xxx has returned a bad error pointer” message. And if you call OS_GenerateError with R0=0 is it no longer a null ptr? Only because the kernel now checks for such nonsense and replaces the null pointer with something that will avoid a crash in completely unrelated code further down the line. In this case the error message isn’t particularly useful (OS_GenerateError isn’t the one that created the bad error pointer, it’s the code that called OS_GenerateError), but it’s still better than having an abort occur. Does the checking also ensure that the text message is printable characters only, terminated within a reasonable length? I’d have thought that was more useful than quibbling about a bit in a number that probably isn’t going to be shown to the user anyway. Wouldn’t that be the sensible thing to check, if checking is to be done? Yes, checking that the error message is valid would be sensible. But checking the error message for validity while avoiding false-positives is much harder than checking the error number, and as this thread has shown checking the error number for false-positives isn’t straightforward either. Why would a developer want something as standard as a known SWI to be identified as &“something in hex” ? Wouldn’t it be so much more useful if was in ‘human’ ? Yes, using a SWI name would be convenient. But I decided it would be better to go with the safe approach of just using the SWI number, rather than risk stack overflows / recursion / etc. from attempting to perform SWI name → number translation.

Apr 24, 2018 4:56pm Colin (478) 2433 posts	I do agree that banning bit 30 isn’t terribly useful for developers who haven’t yet registered an error chunk. It’s more than that. If I was an organisation that was developing a suite of programs for use in house I don’t necessarily want to inform the os developer about anything I do. So I can set bit 30 for all error chunks that I allocate for my organisation in the knowledge that ROOL registered error chunks will always have bit 30 clear. But I need to be able to display the error. To me internal always meant internal to the organisation not to the program. And I can’t see how you can interpret it any other way that makes it useful. Together with swi bits 18:19 = 3 it makes it possible to have an organisation wide module suite.

Apr 24, 2018 5:26pm Jeffrey Lee (213) 6048 posts	There’s no “user error” chunk According to the wiki: “&7FFF00-&7FFFFF User programs – not for public distribution” It’s not in the PRMs though, so I don’t know where this comes from. Looks like it was me who added that, most likely after I spotted that it was listed in CVS. https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Programmer/HdrSrc/hdr/NewErrors?rev=4.34;content-type=text%2Fx-cvsweb-markup#l1339 To me internal always meant internal to the organisation not to the program. But if your interpretation is correct, why does the PRM describe the two (error blocks & SWI chunks) using different terminology?

Apr 24, 2018 5:59pm Colin (478) 2433 posts	why does the PRM describe the two (error blocks & SWI chunks) using different terminology? Because they were written by different people and or at different times? It doesn’t actually say ‘internal to the program’. Using my interpretation causes no problems if you interpreted it the other way, but using your interpretation and stopping error reporting with bit 30 set causes problems for those who interpreted the phrase my way. So it’s sensible to allow errors to be shown with bit 30 set as it’s much more useful isnt it?

Apr 24, 2018 9:04pm nemo (145) 2546 posts	Steve historically proclaimed it’s probably our fault we got to this point in time without properly testing this And that, ladies and gentlemen, is my vote for the official RISC OS motto. Rick leaned out and waved his fist For a start, the error returned is entirely wrong I’ve always liked the witty criticism “it’s not even wrong”. This, however, is not even not even wrong. Jeffrey asked, hyperbolically So in the very real situation where a program returns a bad error pointer, which then causes a crash in RTSupport’s ErrorV handler, you’re saying that a message of “Internal error: Abort on data transfer at &xxx” (where &xxx is inside RTSupport) is more useful than “SWI &xxx has returned a bad error pointer” (where &xxx almost certainly correctly identifies the buggy SWI)? I’m sorry to quote that in its entirety, but every word I omitted made it less silly, so I thought I’d leave them all. I will make it clear what I’m saying by quoting it again: If a program emits an error saying “You have forgotten to insert your woggle” why would a user want that to be reported as “SWI &2b returned a bad pointer”. Or to put it another way as I already did: The fact that the code can tell it’s an internal error means 1 It shouldn’t try to interpret any other bits in the error number 2 It doesn’t therefore have to complain about it Please don’t do the straw man thing. I have already said regarding bad pointers: OS_GenerateError is entitled to check the validity of the error ptr in R0, as it may not be valid. All APIs should check all their parameters for bad pointers. Now that’s clear, can we get back to the subject of throwing perfectly functional and informative error reports away because somebody didn’t say the magic word? You seem to think b30 is a Shibboleth. It is not. Colin confessed To me internal always meant internal to the organisation not to the program No, that’s not what “internal error” means. It means something going wrong with the internals of a program, as opposed to an error being reported to it from outside the program, or the correct refusal to perform some action. eg, IBM’s usage in this document Here’s my final word on the matter. An error block contains two fields, a machine readable one, and a human readable one. The human readable one can be guided by convention and style, but ultimately the developer (acting upon feedback from their users) is best placed to formulate the most useful terminology. The machine readable one must be defined if it is to be useful. An OS call can fail in multiple ways – OS_File might complain “File not found”, or “That’s a directory you fool”, but a program can only reasonably deal with those failures (it it wishes to deal with them differently) via the machine-readable part – the error number. Defined error ranges make it easier to construct that kind of error handling, allowing one to quickly tell the difference between an FS error (can’t save there for some reason) and a systemic one (your hair is on fire). However, an internal error is one which no other system can deal with. By definition it is not generic, it is not generalised, it is not part of a defined API and it is not intended to be handled uniquely, in the way one might handle “Read only”. The whole point of b30 is to state don’t bother reading anything into this error number, it only means something to me. That is what an internal error is. That’s what b30 is for. Somebody has misunderstood.

Apr 24, 2018 11:01pm Steve Fryatt (216) 2105 posts	This, however, is not even not even wrong. Despite the apparent certainty about this now, I don’t seem to be able to find Nemo making any effort to offer constructive suggestions back in 2015 and 2016 when these changes were being discussed before implementation…

Apr 24, 2018 11:15pm Jeffrey Lee (213) 6048 posts	So it’s sensible to allow errors to be shown with bit 30 set as it’s much more useful isnt it? Yes, we’re in agreement on that. Please don’t do the straw man thing Sorry. I think I conflated a few different arguments and ended up poorly presenting a point I was trying to make. An error block contains two fields, a machine readable one, and a human readable one. […] The machine readable one must be defined if it is to be useful. And it is the lack of (robust, unambiguous) definition which is causing problems here. My interpretation of bit 30 is that if it is set, then it it must not be passed outside of your program. Colin’s interpretation is that bit 30 can be set, but if it is set then you can only safely interpret the error number if you know who generated the error. nemo’s interpretation is that bit 30 can be set, but if it is set then the error number is intended more as a diagnostic to the programmer/user (like the error message) rather than being a machine-readable code. As I said in my first post: Recently RISC OS 5 has started doing sanity checks on error pointers, to try and avoid crashes (or other bad behaviour) as a result of buggy code (typically assembler code which sets the V flag but fails to set R0 to the error pointer). Detecting unaligned pointers is easy. Detecting pointers which don’t point to an area of memory containing a 32 bit word followed by a null-terminated string (no longer than 252 bytes, including terminator) is also relatively easy. But there are many pointers which satisfy those requirements without actually being pointers to valid error blocks (where “a valid error block” is an error number and error message which has been deliberately written to memory, in the format dictated by PRM1, for passing across OS-level API boundaries, e.g. by returning from a SWI with V set and R0 as a pointer to the block). An invalid error block/pointer is useless, regardless of whether it’s something that will cause an abort on data transfer or not. So with that in mind I thought it would be useful to have the kernel check to see if the error number is invalid. However this has proven to be much less straightforward, because not everything follows the rules stated in the PRMs, and we can’t even agree on what the rules stated in the PRMs actually are. Outcome: The error number checks have now been disabled, but it’s too late for the changes to make it into 5.24.