RISC OS Open: Forum: Catching bad error pointers

Aug 8, 2015 11:28am

Jeffrey Lee (213) 6048 posts

I’m trying to work out what the best solution would be for having the kernel catch bad error pointers – e.g. when a SWI performs its operation correctly but accidentally returns with V set (programmer didn’t realise a CMP could have resulted in V being set), or when a SWI fails to pass on an error from another SWI (e.g. accidentally restores R0 to the original value instead of preserving the error pointer). At the moment these aren’t particularly easy to track down – either the error will go unnoticed or you’ll get a crash in a completely different area of code when it tries to dereference the error pointer.

I’ve implemented a basic system that checks if R0 is < 256 (or < &4000 if zero page is relocated) and generates a “Bad error pointer returned by SWI xxx” message instead. The address range check is simple, but it would certainly be enough to catch one case I saw a couple of days ago (R0 was &C7 or thereabouts – presumably a reason code passed to a SWI). But I’m wondering if there’s a better approach.

The three approaches I can think of are:

Do a simple address range check. This is fast (only two instructions added to the error return code in the SWI dispatcher, or four if we wanted to detect unaligned pointers too), but could still allow lots of bad pointers through.
Do a full address range check (i.e. OS_Memory 24). This should catch almost all the cases where dereferencing the pointer would lead to a crash, but may have an unacceptable performance impact (I know the kernel caches error blocks for commonly-issued errors); how far are we willing to go to make debugging easier?
A slightly off-the-wall idea: Have the kernel dereference the error pointer. If it’s fine, there’ll be minimal performance impact (just one LDR). If it’s bad then you’ll get a data abort… but if the pointer had gone unchecked you would have aborted anyway, and the register dump will contain the SWI number. Potentially the kernel could even detect that the abort came from the error checking code and swap out the standard data abort message with a “Bad error pointer” message. ZeroPain might also need some tweaking so that it won’t attempt to emulate the zero page access.

What are people’s thoughts?

Of course one problem with the above ideas is that they only catch illegal pointers, not pointers to valid memory which doesn’t contain an error block. So maybe (e.g. as part of option 3?) we could have the kernel at least check that the error number looks sensible – we know that bits 24-30 should be zero, if the pointer is pointing to random code/data then there’s a good chance those bits will be non-zero.

Oct 24, 2015 1:48am

Jon Abbott (1421) 2651 posts

There’s also the potential issue caused by unaligned error pointers, as returned by FileCore_DiscOp for example.

Jan 12, 2016 1:42pm

Jeffrey Lee (213) 6048 posts

I’ve realised that (if taken on their own) options 2 and 3 aren’t going to be that useful for systems which aren’t using high processor vectors, or if something is providing emulated zero page. We need to include an explicit check for null pointers.

So maybe a reasonable check would be:

CMP r0, #&4000
BLO bad_error_pointer
TST r0, #3
LDREQ temp, [r0]
TSTEQ temp, #&7f<<24
BNE bad_error_pointer

That will deal with “null” pointers, misaligned pointers, pointers to invalid addresses (kernel will crash – but the crash will occur at a point which is easier to debug compared to where I’ve seen the OS crash in the past), and some cases of pointers to things which aren’t actually error blocks.

Jan 12, 2016 5:49pm

Martin Avison (27) 1494 posts

CMP r0, #&4000

I thought the page size was 4k, so should this be
CMP r0, #&1000
or am I confused again?!

Jan 12, 2016 6:22pm

Jeffrey Lee (213) 6048 posts

Yes, the page size is 4K, but using zero page relocation will relocate 16K of workspace.

Also it doesn’t look like there are any errors blocks stored in there, so I shouldn’t run into any problems with using the &4000 cutoff on both high & low zero page versions.

Jan 12, 2016 6:30pm

Martin Avison (27) 1494 posts

So should any checks for potential ZeroPain problems be checking for references to the first 16K or just the 4k size of the first page?

Jan 12, 2016 7:00pm

Jeffrey Lee (213) 6048 posts

It all depends on what you’re checking, and why.

If you’ve got some crusty old code which is full of zero page accesses then adding some debug code which checks against 16K would be the way to go, since that’s the amount which is being moved away from &0.

But for well-written code you should only really need to check against NULL (i.e. 0), because you should be able to assume that any pointer you’re using is either (a) valid or (b) NULL.

Dealing with input variables (e.g. parameters passed into your SWI calls if you’re writing a module) is generally where the more complicated checks need to go – if a programmer gives your SWI a bad pointer then he’d usually appreciate a useful error message rather than a crash. For this case you can either use a simple range check or a call to something like OS_ValidateAddress – generally it depends on whether you’re happy to take the performance hit of the SWI call or not. For a quick check for whether an external pointer is safe I’d say that checking against 16K is good enough (or even 32K, if you know you won’t be supplied a pointer to something held in scratch space). But be aware that there are some structures held in kernel workspace which the kernel will return pointers to – e.g. the OS_ChangedBox structure. So if you think it’s possible you’ve been given a pointer to kernel workspace then 256 is a better check value, as that’s the boundary where the processor vectors end and kernel workspace starts.

Jan 12, 2016 9:01pm

Martin Avison (27) 1494 posts

Thanks for the clarification. My current interest is in trying to trap the occasional access that would cause ZeroPain, to help debugging them.

Back to yor original posts: I agree it would be a good idea to trap invalid error blocks in the OS – some time ago I had to add code to Reporter to avoid problems when trying to Report error blocks which were invalid, which suprised me at the time as being neccesary.

Jun 8, 2016 10:01am

Jon Abbott (1421) 2651 posts

Interestingly, the solution that’s been implemented breaks Burn ’Out on RO5.

SWI XBurnOut_LoadPacked,“< archive >”,0,“< file >” returns with V set and R0=-1 if the file isn’t found in the archive. With the new Bad Pointer detection, R0 is obviously being changed to a string pointer reporting the SWI returned a bad error pointer.

I’ve changed the Module, so it doesn’t set V on exit, but it makes you wonder how many other private Modules might be exiting with V set on purpose.

Oct 19, 2016 9:44pm

Jon Abbott (1421) 2651 posts

This is breaking FileCore_DiscOp which returns an error number instead of error pointer.

Could an exclusion be added for this SWI please?

There’s probably other disc operations that return error numbers, possibly all FileSystems. eg FileCore_SectorDiscOp

Oct 20, 2016 1:06pm

Jeffrey Lee (213) 6048 posts

Curses! I knew FileCore used its own error reporting scheme but hadn’t twigged it was being used with SWIs.

There’s probably other disc operations that return error numbers, possibly all FileSystems.

Yeah, it looks like this will affect all (FileCore-based) filing systems. ADFS_DiscOp is just a wrapper around FileCore_DiscOp, so any non-standard error which FileCore returns will also be returned by ADFS.

I guess the best option for now is to just disable the error validation when the X form of a SWI is used. Then in the future maybe we can find a better way of doing things (some kind of whitelist system so modules can indicate they use a non-standard error scheme?)

Oct 20, 2016 2:11pm

Jon Abbott (1421) 2651 posts

Excluding the relevant FileCore SWI’s might be sufficient as I believe they’re translated to a message with a valid error pointer by the filesystem, before it’s passed back to the caller.

Having said that, the documentation for ADFS_IDEUserOp says it returns an error number.

Oct 20, 2016 4:42pm

Jeffrey Lee (213) 6048 posts

Yeah, ADFS doesn’t seem to do any error translation. The SWI handler consists many routines that just call the corresponding FileCore SWI and then return. Possibly this is a bug, but it wouldn’t surprise me if it was by design.

Oct 20, 2016 6:48pm

Jon Abbott (1421) 2651 posts

The SWI handler consists many routines that just call the corresponding FileCore SWI and then return.

They’re stub entries into FileCore, FileCore then calls the relevant filesystem DiscOp/MiscOp entry point.

I’ve just double checked PRM2, which states:

Your module has to return errors through FileCore as follows:
The V flag must be set, and R0 is used to indicate the error:

If bit 30 of R0 is set then, after clearing bit 30 of R0, it is a pointer to an error block.

If bit 31 of R0 is set and bit 30 is clear, then R0 is a disc error:
bits 0 – 20 are the disc byte address / 256
bits 21 – 23 are the drive number
bits 24 – 29 are the disc error number

Else bits 30 – 31 are clear, and R0 is an error number:
bits 0 – 7 are an error number (see list below)
bits 8 – 29 are clear
In the latter two cases FileCore will generate a suitable error block.

I’m beginning to wonder if this is actually a bug elsewhere, as FileCore should be translating the error before it exits.

EDIT: FileCore should translate the error by calling FindErrBlock

EDIT2: The error scheme changed in FileCore 4.6, so this is possibly my own code at fault.

EDIT3: I’ve checked my code and it does use the new error scheme if the filesystem is using them, I’ve also confirmed that it’s passing a valid error number back to FileCore, so the issue is possibly in FileCore itself.

Oct 20, 2016 8:10pm

Jon Abbott (1421) 2651 posts

What build did this go live? I want to see what FileCore is passing back without this trapping it.

Oct 20, 2016 8:14pm

Jeffrey Lee (213) 6048 posts

April 5th is when the code was committed to CVS.

Oct 20, 2016 8:43pm

Jon Abbott (1421) 2651 posts

I don’t have 5th/6th April to test, however…

19th March works as expected – the error address and block look valid.
24th April – “SWI &40540 returned an invalid error block”

Doesn’t look like FileCore changed during that period.

Oct 20, 2016 10:11pm

Jeffrey Lee (213) 6048 posts

What’s the error pointer and the error number? Those are the two things the kernel checks (pointer must be word aligned and >= &4000, number must have bits 24-30 clear)

Oct 21, 2016 8:08am

Jon Abbott (1421) 2651 posts

number must have bits 24-30 clear

That will be the issue, the error number is &220108C7 in this case. You might want to consider dropping that check as it’s going to catch some FileCore errors.

Oct 21, 2016 8:41am

Jeffrey Lee (213) 6048 posts

It checks those bits because they’re meant to be reserved :) I guess I’ll have to dig through the code a bit and see if I can work out where it’s coming from.

Oct 21, 2016 11:29am

Jon Abbott (1421) 2651 posts

I guess I’ll have to dig through the code a bit and see if I can work out where it’s coming from.

See my post above, with the paste from PRM2. The error number details the disc address the error occurred at and is converted into a textual error in FindErrBlock

I believe the only fix is to remove the error number check from the error pointer validation, as any upstream SWI could be passing the error back up the chain.

Oct 21, 2016 12:35pm

Jeffrey Lee (213) 6048 posts

Yeah, but filesystem errors are meant to be of the form &1XXYY where XX is the filesystem number and YY is the error code. FindErrBlock is meant to translate the “internal” error number to a valid filesystem error block/number (and it looks like it gets it half right – &C7 = disc error, FS number 8 = ADFS) but it has corrupted the upper bits.

https://www.riscosopen.org/wiki/documentation/show/Error%20Generators

Oct 21, 2016 8:48pm

Jon Abbott (1421) 2651 posts

The textual version of it is “Disc error 22 at :0/00034800”, which explains the &22. I’ve tested RO6 and all the way back to RO3.11, they all have that same error number, so it’s been like this since day 1.

Certainly looks like a bug in FileCore going by the Error Generators description of what it should be returning.

Oct 22, 2016 6:44pm

Jon Abbott (1421) 2651 posts

Is it line 909/910 in FindErrBlock that’s setting bits 30..24 of the error number?

Oct 26, 2016 8:06pm

Jeffrey Lee (213) 6048 posts

Yep – well spotted.

Some closer examination of the code suggests that for most errors FileCore sticks to the documented &0001XXYY format, but for disc errors it uses &ZZ01XXC7 (ZZ = disc error code, &C7 = “disc error”), so it’s been pretty easy to add an extra rule to the kernel to allow that format of error number through. So with tomorrow’s ROM everything should be back to normal.

https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Kernel/s/Kernel.diff?r1=4.18&r2=4.19

Catching bad error pointers

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Aug 8, 2015 11:28am Jeffrey Lee (213) 6048 posts	I’m trying to work out what the best solution would be for having the kernel catch bad error pointers – e.g. when a SWI performs its operation correctly but accidentally returns with V set (programmer didn’t realise a CMP could have resulted in V being set), or when a SWI fails to pass on an error from another SWI (e.g. accidentally restores R0 to the original value instead of preserving the error pointer). At the moment these aren’t particularly easy to track down – either the error will go unnoticed or you’ll get a crash in a completely different area of code when it tries to dereference the error pointer. I’ve implemented a basic system that checks if R0 is < 256 (or < &4000 if zero page is relocated) and generates a “Bad error pointer returned by SWI xxx” message instead. The address range check is simple, but it would certainly be enough to catch one case I saw a couple of days ago (R0 was &C7 or thereabouts – presumably a reason code passed to a SWI). But I’m wondering if there’s a better approach. The three approaches I can think of are: Do a simple address range check. This is fast (only two instructions added to the error return code in the SWI dispatcher, or four if we wanted to detect unaligned pointers too), but could still allow lots of bad pointers through. Do a full address range check (i.e. OS_Memory 24). This should catch almost all the cases where dereferencing the pointer would lead to a crash, but may have an unacceptable performance impact (I know the kernel caches error blocks for commonly-issued errors); how far are we willing to go to make debugging easier? A slightly off-the-wall idea: Have the kernel dereference the error pointer. If it’s fine, there’ll be minimal performance impact (just one LDR). If it’s bad then you’ll get a data abort… but if the pointer had gone unchecked you would have aborted anyway, and the register dump will contain the SWI number. Potentially the kernel could even detect that the abort came from the error checking code and swap out the standard data abort message with a “Bad error pointer” message. ZeroPain might also need some tweaking so that it won’t attempt to emulate the zero page access. What are people’s thoughts? Of course one problem with the above ideas is that they only catch illegal pointers, not pointers to valid memory which doesn’t contain an error block. So maybe (e.g. as part of option 3?) we could have the kernel at least check that the error number looks sensible – we know that bits 24-30 should be zero, if the pointer is pointing to random code/data then there’s a good chance those bits will be non-zero.

Oct 24, 2015 1:48am Jon Abbott (1421) 2651 posts	There’s also the potential issue caused by unaligned error pointers, as returned by FileCore_DiscOp for example.

Jan 12, 2016 1:42pm Jeffrey Lee (213) 6048 posts	I’ve realised that (if taken on their own) options 2 and 3 aren’t going to be that useful for systems which aren’t using high processor vectors, or if something is providing emulated zero page. We need to include an explicit check for null pointers. So maybe a reasonable check would be: CMP r0, #&4000 BLO bad_error_pointer TST r0, #3 LDREQ temp, [r0] TSTEQ temp, #&7f<<24 BNE bad_error_pointer That will deal with “null” pointers, misaligned pointers, pointers to invalid addresses (kernel will crash – but the crash will occur at a point which is easier to debug compared to where I’ve seen the OS crash in the past), and some cases of pointers to things which aren’t actually error blocks.

Jan 12, 2016 5:49pm Martin Avison (27) 1494 posts	CMP r0, #&4000 I thought the page size was 4k, so should this be CMP r0, #&1000 or am I confused again?!

Jan 12, 2016 6:22pm Jeffrey Lee (213) 6048 posts	Yes, the page size is 4K, but using zero page relocation will relocate 16K of workspace. Also it doesn’t look like there are any errors blocks stored in there, so I shouldn’t run into any problems with using the &4000 cutoff on both high & low zero page versions.

Jan 12, 2016 6:30pm Martin Avison (27) 1494 posts	So should any checks for potential ZeroPain problems be checking for references to the first 16K or just the 4k size of the first page?

Jan 12, 2016 7:00pm Jeffrey Lee (213) 6048 posts	It all depends on what you’re checking, and why. If you’ve got some crusty old code which is full of zero page accesses then adding some debug code which checks against 16K would be the way to go, since that’s the amount which is being moved away from &0. But for well-written code you should only really need to check against NULL (i.e. 0), because you should be able to assume that any pointer you’re using is either (a) valid or (b) NULL. Dealing with input variables (e.g. parameters passed into your SWI calls if you’re writing a module) is generally where the more complicated checks need to go – if a programmer gives your SWI a bad pointer then he’d usually appreciate a useful error message rather than a crash. For this case you can either use a simple range check or a call to something like OS_ValidateAddress – generally it depends on whether you’re happy to take the performance hit of the SWI call or not. For a quick check for whether an external pointer is safe I’d say that checking against 16K is good enough (or even 32K, if you know you won’t be supplied a pointer to something held in scratch space). But be aware that there are some structures held in kernel workspace which the kernel will return pointers to – e.g. the OS_ChangedBox structure. So if you think it’s possible you’ve been given a pointer to kernel workspace then 256 is a better check value, as that’s the boundary where the processor vectors end and kernel workspace starts.

Jan 12, 2016 9:01pm Martin Avison (27) 1494 posts	Thanks for the clarification. My current interest is in trying to trap the occasional access that would cause ZeroPain, to help debugging them. Back to yor original posts: I agree it would be a good idea to trap invalid error blocks in the OS – some time ago I had to add code to Reporter to avoid problems when trying to Report error blocks which were invalid, which suprised me at the time as being neccesary.

Jun 8, 2016 10:01am Jon Abbott (1421) 2651 posts	Interestingly, the solution that’s been implemented breaks Burn ’Out on RO5. SWI XBurnOut_LoadPacked,“< archive >”,0,“< file >” returns with V set and R0=-1 if the file isn’t found in the archive. With the new Bad Pointer detection, R0 is obviously being changed to a string pointer reporting the SWI returned a bad error pointer. I’ve changed the Module, so it doesn’t set V on exit, but it makes you wonder how many other private Modules might be exiting with V set on purpose.

Oct 19, 2016 9:44pm Jon Abbott (1421) 2651 posts	This is breaking FileCore_DiscOp which returns an error number instead of error pointer. Could an exclusion be added for this SWI please? There’s probably other disc operations that return error numbers, possibly all FileSystems. eg FileCore_SectorDiscOp

Oct 20, 2016 1:06pm Jeffrey Lee (213) 6048 posts	Curses! I knew FileCore used its own error reporting scheme but hadn’t twigged it was being used with SWIs. There’s probably other disc operations that return error numbers, possibly all FileSystems. Yeah, it looks like this will affect all (FileCore-based) filing systems. ADFS_DiscOp is just a wrapper around FileCore_DiscOp, so any non-standard error which FileCore returns will also be returned by ADFS. I guess the best option for now is to just disable the error validation when the X form of a SWI is used. Then in the future maybe we can find a better way of doing things (some kind of whitelist system so modules can indicate they use a non-standard error scheme?)

Oct 20, 2016 2:11pm Jon Abbott (1421) 2651 posts	Excluding the relevant FileCore SWI’s might be sufficient as I believe they’re translated to a message with a valid error pointer by the filesystem, before it’s passed back to the caller. Having said that, the documentation for ADFS_IDEUserOp says it returns an error number.

Oct 20, 2016 4:42pm Jeffrey Lee (213) 6048 posts	Yeah, ADFS doesn’t seem to do any error translation. The SWI handler consists many routines that just call the corresponding FileCore SWI and then return. Possibly this is a bug, but it wouldn’t surprise me if it was by design.

Oct 20, 2016 6:48pm Jon Abbott (1421) 2651 posts	The SWI handler consists many routines that just call the corresponding FileCore SWI and then return. They’re stub entries into FileCore, FileCore then calls the relevant filesystem DiscOp/MiscOp entry point. I’ve just double checked PRM2, which states: Your module has to return errors through FileCore as follows: The V flag must be set, and R0 is used to indicate the error: If bit 30 of R0 is set then, after clearing bit 30 of R0, it is a pointer to an error block. If bit 31 of R0 is set and bit 30 is clear, then R0 is a disc error: bits 0 – 20 are the disc byte address / 256 bits 21 – 23 are the drive number bits 24 – 29 are the disc error number Else bits 30 – 31 are clear, and R0 is an error number: bits 0 – 7 are an error number (see list below) bits 8 – 29 are clear In the latter two cases FileCore will generate a suitable error block. I’m beginning to wonder if this is actually a bug elsewhere, as FileCore should be translating the error before it exits. EDIT: FileCore should translate the error by calling FindErrBlock EDIT2: The error scheme changed in FileCore 4.6, so this is possibly my own code at fault. EDIT3: I’ve checked my code and it does use the new error scheme if the filesystem is using them, I’ve also confirmed that it’s passing a valid error number back to FileCore, so the issue is possibly in FileCore itself.

Oct 20, 2016 8:10pm Jon Abbott (1421) 2651 posts	What build did this go live? I want to see what FileCore is passing back without this trapping it.

Oct 20, 2016 8:14pm Jeffrey Lee (213) 6048 posts	April 5th is when the code was committed to CVS.

Oct 20, 2016 8:43pm Jon Abbott (1421) 2651 posts	I don’t have 5th/6th April to test, however… 19th March works as expected – the error address and block look valid. 24th April – “SWI &40540 returned an invalid error block” Doesn’t look like FileCore changed during that period.

Oct 20, 2016 10:11pm Jeffrey Lee (213) 6048 posts	What’s the error pointer and the error number? Those are the two things the kernel checks (pointer must be word aligned and >= &4000, number must have bits 24-30 clear)

Oct 21, 2016 8:08am Jon Abbott (1421) 2651 posts	number must have bits 24-30 clear That will be the issue, the error number is &220108C7 in this case. You might want to consider dropping that check as it’s going to catch some FileCore errors.

Oct 21, 2016 8:41am Jeffrey Lee (213) 6048 posts	It checks those bits because they’re meant to be reserved :) I guess I’ll have to dig through the code a bit and see if I can work out where it’s coming from.

Oct 21, 2016 11:29am Jon Abbott (1421) 2651 posts	I guess I’ll have to dig through the code a bit and see if I can work out where it’s coming from. See my post above, with the paste from PRM2. The error number details the disc address the error occurred at and is converted into a textual error in FindErrBlock I believe the only fix is to remove the error number check from the error pointer validation, as any upstream SWI could be passing the error back up the chain.

Oct 21, 2016 12:35pm Jeffrey Lee (213) 6048 posts	Yeah, but filesystem errors are meant to be of the form &1XXYY where XX is the filesystem number and YY is the error code. FindErrBlock is meant to translate the “internal” error number to a valid filesystem error block/number (and it looks like it gets it half right – &C7 = disc error, FS number 8 = ADFS) but it has corrupted the upper bits. https://www.riscosopen.org/wiki/documentation/show/Error%20Generators

Oct 21, 2016 8:48pm Jon Abbott (1421) 2651 posts	The textual version of it is “Disc error 22 at :0/00034800”, which explains the &22. I’ve tested RO6 and all the way back to RO3.11, they all have that same error number, so it’s been like this since day 1. Certainly looks like a bug in FileCore going by the Error Generators description of what it should be returning.

Oct 22, 2016 6:44pm Jon Abbott (1421) 2651 posts	Is it line 909/910 in FindErrBlock that’s setting bits 30..24 of the error number?

Oct 26, 2016 8:06pm Jeffrey Lee (213) 6048 posts	Yep – well spotted. Some closer examination of the code suggests that for most errors FileCore sticks to the documented &0001XXYY format, but for disc errors it uses &ZZ01XXC7 (ZZ = disc error code, &C7 = “disc error”), so it’s been pretty easy to add an extra rule to the kernel to allow that format of error number through. So with tomorrow’s ROM everything should be back to normal. https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Kernel/s/Kernel.diff?r1=4.18&r2=4.19