Catching bad error pointers
Pages: 1 2
Jeffrey Lee (213) 6048 posts |
I’m trying to work out what the best solution would be for having the kernel catch bad error pointers – e.g. when a SWI performs its operation correctly but accidentally returns with V set (programmer didn’t realise a CMP could have resulted in V being set), or when a SWI fails to pass on an error from another SWI (e.g. accidentally restores R0 to the original value instead of preserving the error pointer). At the moment these aren’t particularly easy to track down – either the error will go unnoticed or you’ll get a crash in a completely different area of code when it tries to dereference the error pointer. I’ve implemented a basic system that checks if R0 is < 256 (or < &4000 if zero page is relocated) and generates a “Bad error pointer returned by SWI xxx” message instead. The address range check is simple, but it would certainly be enough to catch one case I saw a couple of days ago (R0 was &C7 or thereabouts – presumably a reason code passed to a SWI). But I’m wondering if there’s a better approach. The three approaches I can think of are:
What are people’s thoughts? Of course one problem with the above ideas is that they only catch illegal pointers, not pointers to valid memory which doesn’t contain an error block. So maybe (e.g. as part of option 3?) we could have the kernel at least check that the error number looks sensible – we know that bits 24-30 should be zero, if the pointer is pointing to random code/data then there’s a good chance those bits will be non-zero. |
Jon Abbott (1421) 2651 posts |
There’s also the potential issue caused by unaligned error pointers, as returned by FileCore_DiscOp for example. |
Jeffrey Lee (213) 6048 posts |
I’ve realised that (if taken on their own) options 2 and 3 aren’t going to be that useful for systems which aren’t using high processor vectors, or if something is providing emulated zero page. We need to include an explicit check for null pointers. So maybe a reasonable check would be: CMP r0, #&4000 BLO bad_error_pointer TST r0, #3 LDREQ temp, [r0] TSTEQ temp, #&7f<<24 BNE bad_error_pointer That will deal with “null” pointers, misaligned pointers, pointers to invalid addresses (kernel will crash – but the crash will occur at a point which is easier to debug compared to where I’ve seen the OS crash in the past), and some cases of pointers to things which aren’t actually error blocks. |
Martin Avison (27) 1494 posts |
I thought the page size was 4k, so should this be |
Jeffrey Lee (213) 6048 posts |
Yes, the page size is 4K, but using zero page relocation will relocate 16K of workspace. Also it doesn’t look like there are any errors blocks stored in there, so I shouldn’t run into any problems with using the &4000 cutoff on both high & low zero page versions. |
Martin Avison (27) 1494 posts |
So should any checks for potential ZeroPain problems be checking for references to the first 16K or just the 4k size of the first page? |
Jeffrey Lee (213) 6048 posts |
It all depends on what you’re checking, and why. If you’ve got some crusty old code which is full of zero page accesses then adding some debug code which checks against 16K would be the way to go, since that’s the amount which is being moved away from &0. But for well-written code you should only really need to check against NULL (i.e. 0), because you should be able to assume that any pointer you’re using is either (a) valid or (b) NULL. Dealing with input variables (e.g. parameters passed into your SWI calls if you’re writing a module) is generally where the more complicated checks need to go – if a programmer gives your SWI a bad pointer then he’d usually appreciate a useful error message rather than a crash. For this case you can either use a simple range check or a call to something like OS_ValidateAddress – generally it depends on whether you’re happy to take the performance hit of the SWI call or not. For a quick check for whether an external pointer is safe I’d say that checking against 16K is good enough (or even 32K, if you know you won’t be supplied a pointer to something held in scratch space). But be aware that there are some structures held in kernel workspace which the kernel will return pointers to – e.g. the OS_ChangedBox structure. So if you think it’s possible you’ve been given a pointer to kernel workspace then 256 is a better check value, as that’s the boundary where the processor vectors end and kernel workspace starts. |
Martin Avison (27) 1494 posts |
Thanks for the clarification. My current interest is in trying to trap the occasional access that would cause ZeroPain, to help debugging them. Back to yor original posts: I agree it would be a good idea to trap invalid error blocks in the OS – some time ago I had to add code to Reporter to avoid problems when trying to Report error blocks which were invalid, which suprised me at the time as being neccesary. |
Jon Abbott (1421) 2651 posts |
Interestingly, the solution that’s been implemented breaks Burn ’Out on RO5. SWI XBurnOut_LoadPacked,“< archive >”,0,“< file >” returns with V set and R0=-1 if the file isn’t found in the archive. With the new Bad Pointer detection, R0 is obviously being changed to a string pointer reporting the SWI returned a bad error pointer. I’ve changed the Module, so it doesn’t set V on exit, but it makes you wonder how many other private Modules might be exiting with V set on purpose. |
Jon Abbott (1421) 2651 posts |
This is breaking FileCore_DiscOp which returns an error number instead of error pointer. Could an exclusion be added for this SWI please? There’s probably other disc operations that return error numbers, possibly all FileSystems. eg FileCore_SectorDiscOp |
Jeffrey Lee (213) 6048 posts |
Curses! I knew FileCore used its own error reporting scheme but hadn’t twigged it was being used with SWIs.
Yeah, it looks like this will affect all (FileCore-based) filing systems. ADFS_DiscOp is just a wrapper around FileCore_DiscOp, so any non-standard error which FileCore returns will also be returned by ADFS. I guess the best option for now is to just disable the error validation when the X form of a SWI is used. Then in the future maybe we can find a better way of doing things (some kind of whitelist system so modules can indicate they use a non-standard error scheme?) |
Jon Abbott (1421) 2651 posts |
Excluding the relevant FileCore SWI’s might be sufficient as I believe they’re translated to a message with a valid error pointer by the filesystem, before it’s passed back to the caller. Having said that, the documentation for ADFS_IDEUserOp says it returns an error number. |
Jeffrey Lee (213) 6048 posts |
Yeah, ADFS doesn’t seem to do any error translation. The SWI handler consists many routines that just call the corresponding FileCore SWI and then return. Possibly this is a bug, but it wouldn’t surprise me if it was by design. |
Jon Abbott (1421) 2651 posts |
They’re stub entries into FileCore, FileCore then calls the relevant filesystem DiscOp/MiscOp entry point. I’ve just double checked PRM2, which states: Your module has to return errors through FileCore as follows: I’m beginning to wonder if this is actually a bug elsewhere, as FileCore should be translating the error before it exits. EDIT: FileCore should translate the error by calling FindErrBlock EDIT2: The error scheme changed in FileCore 4.6, so this is possibly my own code at fault. EDIT3: I’ve checked my code and it does use the new error scheme if the filesystem is using them, I’ve also confirmed that it’s passing a valid error number back to FileCore, so the issue is possibly in FileCore itself. |
Jon Abbott (1421) 2651 posts |
What build did this go live? I want to see what FileCore is passing back without this trapping it. |
Jeffrey Lee (213) 6048 posts |
April 5th is when the code was committed to CVS. |
Jon Abbott (1421) 2651 posts |
I don’t have 5th/6th April to test, however… 19th March works as expected – the error address and block look valid. Doesn’t look like FileCore changed during that period. |
Jeffrey Lee (213) 6048 posts |
What’s the error pointer and the error number? Those are the two things the kernel checks (pointer must be word aligned and >= &4000, number must have bits 24-30 clear) |
Jon Abbott (1421) 2651 posts |
That will be the issue, the error number is &220108C7 in this case. You might want to consider dropping that check as it’s going to catch some FileCore errors. |
Jeffrey Lee (213) 6048 posts |
It checks those bits because they’re meant to be reserved :) I guess I’ll have to dig through the code a bit and see if I can work out where it’s coming from. |
Jon Abbott (1421) 2651 posts |
See my post above, with the paste from PRM2. The error number details the disc address the error occurred at and is converted into a textual error in FindErrBlock I believe the only fix is to remove the error number check from the error pointer validation, as any upstream SWI could be passing the error back up the chain. |
Jeffrey Lee (213) 6048 posts |
Yeah, but filesystem errors are meant to be of the form &1XXYY where XX is the filesystem number and YY is the error code. FindErrBlock is meant to translate the “internal” error number to a valid filesystem error block/number (and it looks like it gets it half right – &C7 = disc error, FS number 8 = ADFS) but it has corrupted the upper bits. https://www.riscosopen.org/wiki/documentation/show/Error%20Generators |
Jon Abbott (1421) 2651 posts |
The textual version of it is “Disc error 22 at :0/00034800”, which explains the &22. I’ve tested RO6 and all the way back to RO3.11, they all have that same error number, so it’s been like this since day 1. Certainly looks like a bug in FileCore going by the Error Generators description of what it should be returning. |
Jon Abbott (1421) 2651 posts |
Is it line 909/910 in FindErrBlock that’s setting bits 30..24 of the error number? |
Jeffrey Lee (213) 6048 posts |
Yep – well spotted. Some closer examination of the code suggests that for most errors FileCore sticks to the documented &0001XXYY format, but for disc errors it uses &ZZ01XXC7 (ZZ = disc error code, &C7 = “disc error”), so it’s been pretty easy to add an extra rule to the kernel to allow that format of error number through. So with tomorrow’s ROM everything should be back to normal. https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Kernel/s/Kernel.diff?r1=4.18&r2=4.19 |
Pages: 1 2