SWI &2B returned a bad error pointer
Rick Murray (539) 13841 posts |
The only reason this issue arrived at all is because a listing in a book disobeyed the advice of the PRM and did Quite often the dusty corners and edge cases aren’t spotted until somebody tries using them and it goes weird. Some examples of my own: localtime() with a different timezone, that must have been broken forever. OS_BreakPt which is probably the least used SWI in the OS, and probing different IIC buses (which originally went spectacularly wrong). It’s an operating system. It’s complicated. Not every issue will be found first time and the significance of a change is not always fully evident in a first discussion. |
Steve Pampling (1551) 8170 posts |
I think we already knew that one (Brexit), just wondering how soon the slow learners will figure it out. On topic – yes it took a while for someone to do the trigger action. The result of the trigger wasn’t what it should have been and now the change in the OS (nightly build) means the same code should give the correct response and point out the programming mistake. If it doesn’t there’s still a problem. |
Colin (478) 2433 posts |
I’m quite happy be shown that my interpretation is wrong but I’d like to know what is best practice. I gather from this discussion that b30 is for use by a program but should not be returned by the program. From edwardx’s post I gather I have use of errorblock 0×7fff00 for prototyping my programs. Presumably I could use errorblock 0×7fff00 internally in my program as long as I never return an error from that block from my program? If so what problem does setting b30 solve? If I leak an error with b30 set what does that say? If as Nemo states
Then how does the default error handler or Wimp_ReportError for example read into it that the human part should be displayed? But then again it can’t read into it that shouldn’t be displayed either. I ‘think’ Jeffrey wants that reported as an error but Nemo wants the human part displayed. We may as well clear up other points of confusion I have around errors. If 0×7fff00 is only for private use do I need a registered error block in any public application where I use Wimp_ReportError, OS_GenerateError, ERROR in BASIC or only if I’m returning errors? When a C program returns I usually return EXIT_SUCCESS or EXIT_FAILURE should the EXIT_FAILURE really be a registered riscos error number. I don’t think anyone does this but it has always bothered me. In C you get situations where you don’t get an os_error returned eg opening a file/ allocating memory where you just get a NULL pointer if it fails. Should I convert that failure to a newly allocated number or can I reuse a number eg 0×101, ‘No room in RMA’. I was just wondering if it is possible to write a C module without registering an error block. |
David Pitt (3386) 1248 posts |
This is what the origin author said on the matter. Making Errors Non-Fatal So how do we distinguish between a fatal and non-fatal error? The answer is RISC OS error numbers are allocated rather like filetypes, in that all the lower We will use error number &40000000, which we can also write as 1<<30, for DEF PROCload SYS "XOS_CLI","LOAD Shapefile "+STR$~list% TO err%;flags% IF (flags% AND 1)<>0 !err%=1<<30:SYS "OS_GenerateError",err% PROCforce_redraw(main%) ENDPROC : DEF PROCsave n%=FNend+4 SYS "XOS_CLI","SAVE Shapefile "+STR$~list%+" "+STR$~n% TO err%;flags% IF (flags% AND 1)<>0 !err%=1<<30:SYS "OS_GenerateError",err% SYS "XOS_CLI","SETTYPE Shapefile &012" TO err%;flags% IF (flags% AND 1)<>0 !err%=1<<30:SYS "OS_GenerateError",err% ENDPROC : DEF FNerror !b%=ERR CASE !b% OF WHEN 1<<30:err_str$="":box%=3 OTHERWISE:err_str$=" at line "+STR$ ERL:box%=2 ENDCASE $(b%+4)=REPORT$+err_str$+CHR$0 SYS "Wimp_ReportError",b%,box%,"Shapes" TO ,response% =(response%=2) : |
Colin (478) 2433 posts |
Ok. So in your example I don’t see an advantage to b30 over rethrowing the error using 0×7fff00 though I suppose you could preserve the error number if you just set that bit – should you actually be modifying the returned error block to rethrow it – didn’t know you could do that? If you use ERROR 1 << 31, “…” instead of OS_GenerateError then the external error handler isn’t called if there error is trapped by ON ERROR so you could I suppose at that point clear the bit before displaying the error. So it would be possible to use it with Jeffreys new error handler – though that’s no help to existing programs. |
Chris Evans (457) 1614 posts |
CVS change last night: |
Jeffrey Lee (213) 6048 posts |
I’d assume that the rules for 0×7fff00 are roughly the same as those for the “user” SWI range, and the “personal undistributed” filetypes, etc. I.e. you can use them however you wish, but it’s the user’s responsibility to deal with any issues that may arise from allocation clashes. And to make things easier for users, authors shouldn’t (openly?) distribute software which uses those values.
My interpretation of bit 30 is that it should be used to flag errors that never leave your program. E.g. if you have a C function which returns a _kernel_oserror *, and that function is capable of returning errors which are relevant to the user (where “user” is whoever/whatever has invoked your program), along with internal errors (which your program can deal with before e.g. retrying the operation) then bit 30 can be used as a way to differentiate between the two. The rest of the error number / error block could be used for other purposes (e.g. storing more detail about the error so that the recovery can be performed). If an error with bit 30 set finds its way to the outer boundary of your program, you could have a utility function which replaces the error number with a regular error number and writes an appropriate message (“internal error in foo: wibble has gone pop with code 1234”) Whether that interpretation is correct (or safe), is still unknown.
If I tell you that I’m pointing at a picture of a cat, but I’m actually pointing at a picture of a dog, what does that say? Is my pointing wrong? (i.e. i’ve returned a bad pointer and attempted to pass it off as a valid error pointer) When faced with the above, the OS (used to) generalise and say “this is not a valid error pointer”. Because in many cases there’s no way for it to give a more specific message. If as Nemo states Nemo said not to read anything into the (low 29 bits) of the error number; he didn’t say anything about not displaying the human part of the error block (i.e. the message text).
Correct (because under my definition it’s not a valid error block – it could just be a pointer to ~256 random bytes). The example I gave earlier was a null pointer causing an abort inside RTSupport – but it probably should have been an example of a null pointer which returns the famous “oflaoflaofla” message due to page zero being present. Would a user prefer to see “oflaoflaofla”, or would they prefer to see “SWI &xxx has returned a bad error pointer”? (And this problem isn’t exclusive to null pointers, since there are many pointers which point to garbage error blocks)
If you’re only returning errors which you’ve received from elsewhere (or are able to re-use generic error numbers), then no. If you’ve got some other error to return (“You have forgotten to insert your woggle”) then yes you should, otherwise a program which knows how to insert a woggle (computer controlled robot arm?) will have no sensible way of detecting that error and inserting the woggle on the user’s behalf.
Return codes from programs are not required to be error numbers. If you wanted to you could return a RISC OS error number, but it’s unlikely you’d find anything which automatically interprets the return code in that way. The convention is that 0 is success and any non-zero value is some other condition (usually failure), but the presence of things like Sys$RCLimit means that there’s no hard boundary between “success” and “failure” as far as the system is concerned. (This is also true on other OS’s) If you want to return an error block from a C program, you may have to invoke OS_Exit directly.
In many cases _kernel_last_oserror() will return the underlying RISC OS error that caused the C standard library function to fail. The value is cleared when read, so you could go with the generic approach of: _kernel_last_oserror(); // Throw away any stored error buffer = malloc(123); // Call arbitrary library function if (!buffer) { // Decide whether to use OS error, or a custom one _kernel_oserror *oserr = _kernel_last_oserror(); if (oserr) { return oserr; } return new_error("something bad"); } In most cases you probably could just use a standard OS error and nobody would get upset, as long as it’s most likely accurate. The above code could also be extended to check errno (and IIRC UnixLib has an errno value which means ‘check _kernel_last_oserror’) |
nemo (145) 2547 posts |
Gadammit I said I wasn’t going to say any more, but
If it is a pointer to “random bytes” then your b30 interpretation will fire 50% of the time. There’s literally no way for that to be less reliable. |
Jeffrey Lee (213) 6048 posts |
Please show your working. |
Steve Pampling (1551) 8170 posts |
Oooh, [ sits in comfy seat, grabs popcorn and beer… ] |
Rick Murray (539) 13841 posts |
Please show your working. My mind is on other stuff, but for what it’s worth I threw this together: 10s% = 2 * 1024 * 1024 20y% = 0 : n% = 0 30b% = &20000000 40FOR l% = 0 TO s% STEP 4 50 x% = b%!l% 60 IF x% AND (1<<30) THEN y%=y%+1 ELSE n%=n%+1 70NEXT 80PRINT y%, n% It whizzes through 2MB of the RMA, counting how many words do and do not have bit 30 set. On my machine, that’s 88,457 for DO and 435,832 for NOT. I’ll let somebody else work out the percentages, suffice to say that there’s zero chance of having one bit in a random selection of words being set equating to a 50% possibility. It’s not a binary choice like heads or tails. We cannot say it is 1-in-32 (because there are 32 possible bits in the word) as the values will be skewed by the data. In this case, since we’re looking at the RMA, bit 30 is in the ARM condition flags, so will be set for the following: MI, PL, VS, VC, GT, LE, AL, and NV (obsolete). Note well the AL code. That may explain why the occurence in the RMA is around one in six.
That you don’t know what a cat looks like?
Invalid comparison. You don’t know the entire error structure is wrong because of one bit. That would be like trying to determine whether the animal is a cat or a dog by looking at a photograph of the tip of the tail…
Conflated issue. We can easily discard the Chant Of The God Of Ofla because the error pointer will be a NULL pointer (which may or may not be a valid address). If the address is bogus, then “a bad error pointer” is a completely justifiable response. If a part of the error number is considered bogus, then it’s an illogical and unconstructive message as any programmer receiving it may wonder why it’s a bad pointer (well, it’s what the error says). How to tell if an error block is valid? It is a word of essentially unknowable contents (the bits are defined, but with various caveats and exceptions), followed by a sequence of printable characters terminated by a NULL, the entire structure being no more than 256 bytes and word aligned (as specified in PRM 1-42). Should probably accept BASIC’s termination as well… ;-) |
Steve Pampling (1551) 8170 posts |
Fascinating, however statistically speaking the probability of a bit (which can have only one of two values, 0 or 1) being 1 is 50% and in a truly random sset of samples one value may present more for any number of samples but, statistically, eventually the other value will match the number of appearances.
Funnily enough it is precisely a binary choice (0 or 1) unless you can make bit 30 assume a different value that I haven’t mentioned. All the other bits in the word are irrelevant in the case study.
Um. Things ain’t always a good match to the label. Most people would state that a cat does a miaow and a dog a bark (wuff or whataver). No matter how often we told him, and ourselves, that he was a cat Moet never seemed to accept the label and I still have doubts. BTW.
Turned not to be since I was at work and my to do list, on which I just ticked off two of four, now has seven items. |
Rick Murray (539) 13841 posts |
The thing I learned with probability is that if it looks obvious and simple, you’re doing it wrong. Consider the Monty Hall Problem.
Dogs, actually being a good example. Cats more or less look a lot like cats (even those that deny their cat nature), but dogs are a freaky mixture of quadrupeds. There’s probably more in common with a cat and a poodle them with a poodle and a labrador… |
nemo (145) 2547 posts |
I previously stated the obvious truism
Anyone doubting that should consider whether they thought I had written
But I didn’t write that. They are two different things. I’d not expect a programmer to get the two confused though. The first has a known probability: 50%. The second is rather dependent on what you are doing with your memory, but let’s take a snapshot and find out. There’s zero page, then the application space for one of the tasks, then various dynamic areas, and then the ROM. Let’s save all that out and see what we have. I’m running 4.39, so here’s mine:
Now if we have a random pointer into that lot, that has already been checked to be aligned and in valid memory, then we can check to see what proportion of those words have ‘the offending’ b30 set. If the contents were random bytes, the probability would be exactly 50%. For the test to be useful, the probability would have to be higher than 50%. Considerably higher, I would suggest. Let’s count the actually occurrences, the total number of words, and hence calculate the actual probability of a random validated pointer pointing to a word with b30 set… are you ready? It’s 9.66%. There’s a less than 10% chance that the ‘b30 check’ will identify a random pointer (as opposed to “a pointer to random bytes”, which is what Jeffrey wrote and what I responded to). There is a greater than 90% chance that checking b30 will utterly fail to spot a ‘random pointer’ (which, to be fair, is not what Jeffrey wrote. Perhaps it is what he meant. It is difficult to tell.) I have no data as to what proportion of error pointers are ‘random’… but I’m going to guess it’s a very rare occurrence. So we’re talking less than 10% of a very rare event. Consequently, if the error number has b30 set, there is a far greater than 90% chance that it is completely deliberate, knowing nothing about any documentation or convention. However, given that the documentation says that b30 “can be used by the programmer”, I think the probability that b30 is accidental is so vanishingly tiny it can be completely disregarded. I really am going to say nothing further on this matter. I think it’s settled. |
Rick Murray (539) 13841 posts |
That seems remarkably low for the case where an instruction with no specific conditional code (hence AL) would have the bit set.
Ah, but the error pointer is not the same as the bit 30 check (which is in the first word of what the pointer points to).
Ah, but yet again the sample data is essentially non random. While I looked in the RMA (and got 1 in 6), and you looked at a random sample (and got 1 in 10). If we look at error numbers then the value should greater than one in hundreds of thousands, if not more (as the PRM says the bit should be zero). That said, faulting it does seem excessive – for if bit 30 is assigned a purpose at some time in the future, it would go wrong on earlier versions that fault bit 30. But, then, the check has been disabled so I guess this discussion is just going to go around in circles now… ;-) |
Jeffrey Lee (213) 6048 posts |
nemo seems to have completely failed to grasp the meaning behind my words. Is anyone else having similar difficulties, or is it just him? |
Chris (121) 472 posts |
I don’t really understand any of the stuff linking the b30 check and the random pointer issue. As I saw it, the (recently partially revised) code performs two separate checks: (a) that the pointer in R0 is valid, and (b) that the error number is in a valid format. The disagreement, if I’ve got it right, seems to be over whether it’s clear just what a valid format is for the error number (specifically, whether b30 can be set or not), and on the basis of that whether we ought to be faulting non-conforming error numbers. The majority view seems to be that checking the error number for b30 being set is too zealous. If it’s re-instated, though, would it help to have two different error messages, to reflect the two different checks? After all, in the OP’s example, there wasn’t actually anything wrong with the pointer. |
Colin (478) 2433 posts |
I find this a problem with most discussions I enter into here. I ‘ve come to the conclusion that I’m absolute rubbish at explaining stuff. |
Colin (478) 2433 posts |
I think Jeffrey disagrees because the bad error number makes it a bad pointer. The pointer is pointing to something as all pointers do but you can’t recognise it as an error pointer if b30 is set. So it is a bad pointer. |
Chris (121) 472 posts |
Ah – get it! (See – you’re great at explaining stuff :) ) |
Jeffrey Lee (213) 6048 posts |
Yes, both Chris and Colin are correct. I’m not against using two or more different error messages as Chris suggests, but have (mostly) been trying to make sure that people are aware of the fact that upon seeing a pointer to an invalid error block, the kernel can’t say for sure whether it’s the pointer that’s wrong or the memory being pointed to that’s wrong. On the subject of my “Correct (because under my definition it’s not a valid error block – it could just be a pointer to ~256 random bytes)” comment: As far as I can tell, nemo’s become confused because he’s failed to associate my comment with the wider context of my other comments. The background to my statement is as follows:
Now, assume that f(x) always returns true if bit 30 of the error number is set. If we were to take an implementation of g(x) that currently ignores bit 30 (e.g. the simple version that always returns true), and introduce the same bit 30 check that f(x) performs, then this would make g(x) a more reliable approximation of f(x). (Proof available on request). Since g(x) has become more reliable as a result of introducing the bit 30 check, it proves that nemo’s statement that “There’s literally no way for that to be less reliable” is incorrect. We can easily make g(x) less reliably by removing the check we just added. |
Colin (478) 2433 posts |
The conclusion I come to is that I’ll never use b30 again – I can’t think I ever had a use for an invalid error. Unfortunately I find myself agreeing with Jeffrey that it should be faulted for the same reason that zero page access is faulted. |
Andrew Conroy (370) 740 posts |
It might be worth noting that Martyn’s book is supplied as part of the RC16 disc image, so it probably should now come with an addendum to say that the code contained therein is no longer correct! |
nemo (145) 2547 posts |
Rick said
You are assuming that the memory of a RISC OS computer is mostly full of ARM code. It isn’t. It’s mostly full of data.
The PRMs do not say the bit should be zero, it says the exact opposite:
Errors are returned to the program from X-SWIs. The PRM specifically defines b30 to be used by the programmer for flagging internal errors… but flagging to what? To the error handler, of course. By definition that requires going through OS_GenerateError. The bits in the error number, whatever construction is in use, are nothing to do with OS_GenerateError. They are none of its business. OS_GenerateError does only two things: It issues a Service Call and then calls the Error Vector. It does not interpret error numbers. The Error Handler does, but GenerateError is a simple bit of plumbing. (Typically, the default handler for ErrorV will copy the error block and then call the error handler, if there aren’t any pending callbacks) I really can’t stress this strongly enough… the PRMs say b30 is to be used for flagging internal errors, and for your error handler to get that error, you call OS_GenerateError. eg: ONERRORPROCerr ERROR&40000001,"Woggle" DEFPROCerr A$=REPORT$:IFERRAND(1<<30):A$="Internal error: "+A$ SYS"Wimp_ReportError",A$ IFERRAND(1<<30):END ENDPROC The PRMs are extremely clear in the distinction between b30 – “can therefore be used by programmers”, and b29 – “should be cleared for compatibility with any future extensions”. Anyone arguing that b30 “should be clear” is plain wrong. Read it again. |
nemo (145) 2547 posts |
Chris asked
The theory was expounded that b30 must be zero in a valid error number. This is wrong. According to this theory, finding b30 set would indicate that the error pointer was not pointing at an error block, but at “random bytes”. If a pointer was pointing at “random bytes”, then b30 would be set 50% of the time, and clear 50% of the time. In reality, the memory of the computer isn’t filled with perfectly random bytes, so it turns out that a random pointer would be pointing at something with b30 set less than 10% of the time. IF it were wrong… which we’ve no particular reason to expect. So given that
the former test for b30 on entry to OS_GenerateError was bogus. It is now gone, all is well. |