Zero page protection
Jeffrey Lee (213) 6048 posts |
Should be fixed now. I’m not sure exactly when I last ran a build with zero page relocation, so don’t be surprised if a bug or two have crept into the ROM modules. |
Jon Abbott (1421) 2651 posts |
Having spent the best part of the past six months looking into this issue and coding specific support for it in the next release of ADFFS, I can probably add some detail into the problem you’re facing, all be it from a legacy game standpoint. Although some of the issues are game specific, the bulk aren’t and could affect any software legacy or new. Of the 50+ games ADFFS now supports on the Pi, I’ve detailed the page zero access issues in this post on the JASPP forum. In summary, the bulk of the issues stem from C code failing to initialise pointers before use and a bug that was probably introduced into the RISCOS 3.0+ sound stack, where the GateOn entry doesn’t seem to be called prior to Fill, resulting in the SCCB not being initialized by the GateOn code. These two cover around 95% of all page zero access, the remaining 5% are down to coding mistakes. To resolve these issues, I’ve used a combination of fixing the original code, coding an Abort handler to handle writes and an pre-interpreter to handle reads – the later would be better implemented by relocating page zero and handling reads via an Abort as currently every LDR/LDM has to be executed too and then interpreted by the JIT prior to copying/re-interpretation. I intend to revisit the read handler once page zero relocation is implemented and recode it via Aborts. Addressing the key points detailed in posts above:
In my experience, the bulk of the problem is writes and from the code I’ve looked at to date, the majority of the code is under IRQ or CallEvery’s and may trigger repeatedly. This may result in log overload if you’re not careful.
To cover the bulk of the page zero access, any load/store instruction that is used by a C compiler will need handling, otherwise it’s going to be ineffective against modern reincarnations of issue. However, it’s probably fair to assume that ARMv5+ specific instructions could be left to Abort, as it’s likely the programmer or source is available to fix the problem. StrongARM support does add complications to the instruction interpreter and considering the number of 32bit StrongARM code about could possibly be dropped, leaving the bulk of the issues coming from Iyonix era code. One thing you need to be aware of, if StrongARM is considered, is that it may break under RPCEmu as it doesn’t implement the MMU correctly, some Abort types aren’t handled Although we’re discussing Page Zero here, consideration should also be given to the scratch space at 4000-8000. This has been read only for a while however, so it may no longer be an issue. Of the 50+ games I’ve looked at, only one that I’m aware of (Jahangir Khan World Championship Squash) uses the scratch space.
From my investigations I don’t believe it will get 90% working, as the bulk of the issues are writes, but it will certainly highlight how widespread the issue is in compiled code. From my experience over 50% of C code suffers from inadvertent page zero access.
I’d write the entries to a FIFO buffer in the Abort handler and write out to the log via IRQ driven code. I’d also code into the IRQ side an Abort tracker, so it’s not continually writing the same Abort to the logs.
From experience, next to no software actively reads/writes to Page Zero. As Jeffrey has pointed out, there are some special cases such as IRQsema, but the bulk of the issue is coding errors.
I wouldn’t typecast legacy software with the problem, the bulk of the issues I’ve seen are C related and down to coding mistakes or compiler issues. I’d expect a fair percentage of compiled code to exhibit inadvertent Page Zero access.
I concur with this statement, it’s pretty much what I’ve seen in my investigations. Page-protect test on RPCEmu IOMD would seem the first “see what you can break option” I agree, moving Page Zero is the correct way forward as read access needs to be handled for legal reads.
I don’t believe this is realistic, the problem is simply too widespread to create patches. You’d have to analyse the code and try to interpret what its trying to do – this isn’t always straight forward and is very time consuming. To give you some idea of the task you’d face, I’ve probably spent a whole man month over the past six months, coding patches for the Page Zero access issues I’ve detailed on the JASPP forum. You’ll also have to deal with protected/encrypted code, meaning any patching will need to be implemented on-the-fly. In the unlikely event the code is self-modifying or self-checking, patching won’t work – admittedly this probably isn’t an issue in post Iyonix code.
I’d agree with this approach, it’s not the responsibility of the OS developers to patch software bugs. You may however need to make exception for widely used software that breaks, where the developer or source isn’t available to fix it. |
Jeffrey Lee (213) 6048 posts |
Thanks for the feedback, Jon. Interesting that you say that writes were your main problem – although that’s perhaps not too unsurprising considering the age of the software you have to work with. Hopefully I’ll be able to knock out the first version of the compatibility module over the weekend so that I can start getting some real-world testing results.
Hmm, is it the abort restart bug that you’re talking about? The one that stops lazy task swapping from being used on pre rev-T CPUs? Yeah, I guess that would be a problem. Although I’m not actually sure if it counts in this case, considering that I’ll be returning to the instruction after the aborting one. Luckily(?) the current version of the zero page relocation code in the kernel won’t work on StrongARM anyway, as it’s only supported on CPUs where the processor vectors can also be moved up high – which is all ARMv6+ CPUs and some ARMv5 (including Iyonix). At a later date I might look into support for older CPUs (move zero page kernel workspace up high, but keep processor vectors low, and make them fully inaccessible from user mode). There’s provision in the API for this (OS_PlatformFeatures 0 bit 20 only mentions the processor vectors, kernel workspace locations are kept separate via OS_ReadSysInfo 6), and the kernel modifications should be fairly straightforward, so it shouldn’t be a large amount of work to get it working. But I guess I’ll have to wait and see if StrongARM support causes problems for the compatibility module. Or maybe we’ll get lucky and all the buggy software will have been fixed by the time ARMv3/v4 zero page relocation is implemented!
Yeah, moving/deleting scratch space is one of my future goals. |
Rick Murray (539) 13840 posts |
Given the propensity for zero page access to be broken C programs – would it be possible to have an option to output a backtrace? |
Jeffrey Lee (213) 6048 posts |
I think that would fall into the category of “Don’t trap the access and let the C runtime deal with it instead”. CLib and UnixLib are both capable of producing stack traces when things go wrong, and adding the necessary backtrace logic into the module would be a lot of work. Having said that, I am still entertaining the idea of writing my own stack trace code at some point – something which can work with my growing collection of debug tools (which may themselves one day find their way into the OS as standard features). I’ve got the equivalent of ‘addr2line’ for working out which code corresponds to an address within the ROM (see this thread), but the process of turning a raw stack dump into a usable stack trace is still a manual process. |
Sprow (202) 1158 posts |
I’d not meant to imply that, rather to use !Patch as a vehicle to deliver patches provided by others. If the author/source is still around then clearly fixing it at its origin makes much more sense, and only in exceptional cases be staring at a disassembly.
Having had a brief look at !Patch’s keyword support it doesn’t currently appear to allow search by magic fingerprint (only by absolute offset) but adding a new fingerprint keyword would be pretty simple to do from the looks of it. There’s also the AppPatcher module which does hunt for fingerprints at time of expansion but before execution. It has 6 patches in its repertoire at the moment. It’s currently loaded by !Boot on anything pre RISC OS 5 but that’s not set in stone. Personally I’d rather patch the executable once and be done with it – hence the !Patch suggestion. There’s also UnsqueezeAIF which does fingerprint match & patch just before execution too, though gaffing in patches there feels a little inappropriate given the module’s original purpose & name.
Cunning, yes, that could be useful in those oflaofla’s. At least in all this it’s only 32 bit apps (ie. post 2002) that are going to get snared, any 26 bit stuff will be nesting in a cosy emulation/JIT which can fake up zero page however it sees fit. |
Jon Abbott (1421) 2651 posts |
Not specifically although this can be an issue on physical SA with Aborts crossing page boundaries with paging implemented. The complications come from handling early/late abort mode – which isn’t implemented correctly in some emulators. The workaround is fairly straightforward – actively generate aborts for LDR and LDM in your Module initialization, so you know how instructions are behaving. Your interpreter then needs to rollback LDM and LDR with writeback as appropriate before interpreting the instruction and updating the index register on exit. You can’t assume LDM is in late Abort just because LDR is – under emulation they may be different.
Is there a reason to move the scratch space? I fully understand and support relocating Page Zero, but can’t see any advantages to moving the scratch space unless there’s a plan to use the pages for something else. It was my understanding that this area was free to use as per Acorn documentation, I don’t think I’ve seen anything post Acorn that changes this – with the exception of RO5 making the pages read only. |
Jeffrey Lee (213) 6048 posts |
Yeah, I remembered about that over the weekend.
Ah, that’s a shame. However since one of the points of moving zero page is to expose lots of nasty bugs, I think I’d rather implement late abort handling properly (should an ARMv3/v4 version of the code be needed) and leave the emulator bugs for the emulator authors to fix!
Do you happen to have a reference for that documentation? I’m not doubting you, but I’m sure that I tried looking in the PRMs once and couldn’t find any reference to scratch space (Maybe it’s only mentioned in an application note or something?) The fact that the OS uses it itself in some situations means that the rules under which applications can use it must be pretty non-trivial, to the point where I’d be surprised if anyone really did use it – especially post-Archimedes where computers started getting more memory than the programmers would know what to do with. There’s no pressing need for me to move scratch space, it’s just something that it would seem sensible to try and move/get rid of since (a) it’s next to zero page, and (b) is one of the last few hardcoded addresses within the OS. It might also get in the way a bit if we were to implement the plan of giving apps full control over the mapping of application space (from 0 right up to the app space limit). |
Rick Murray (539) 13840 posts |
“An example client would be FileCore using the scratch space to hold structures while working out how to allocate some free space. Another example would be the Filer using the scratch space to hold structures for OS_HeapSort.” I think there’s a doc here that says similar. |
Dave Higton (1515) 3525 posts |
re. scratch space being free to use Use by what processes? Certainly, if it’s read only to applications in RO5, applications can’t use it. It follows that any app that has been running successfully under RO5 doesn’t need it. So whose scratch space is it? |
Jeffrey Lee (213) 6048 posts |
I think we’re getting a bit confused between apps and modules. Judging by the doc Rick found, scratch space is free for modules to use (under certain circumstances), but not for applications. That’s how it’s been possible to make it read-only to user mode in RO 5/6 without breaking anything (except the odd Arc game!) Anyway, the fact that it’s mentioned as being free for use in the RISC OS 6 docs is probably good enough reason for me to leave it alone until we have a more compelling reason to mess with it. |
Steve Drain (222) 1620 posts |
The paragraph about the scratch space that Rick quotes is also in the PRM Vol 5a p39. He omitted this paragraph from his quotation:
So that is the same as RO5, I think. In days long past the definition of this area was much less closely defined and it was suggested as a handy place for the equivalent of DIM LOCAL from BASIC. I have written applications that did this, but none released publicly. Also, BASIC will not deal with $string% below &8000. I am now ‘clean’. ;-) |
Jon Abbott (1421) 2651 posts |
From the PRM it’s used in two OS locations:
It’s use is covered in the memory map: PRM5a-41 describes it as a “Public” area which may be used by any module that is not:
Up to RISCOS 4 is was R/W and free to use by any application, from RISCOS 4 on, it became Read only for applications, which complies with the documentation, but did break some legacy software which made use of the area as temporary space.
You’d still be implementing it properly, you just need to have separate flags for the late/early state of LDR and LDM – it’s a few extra lines of code in your Module Init and one extra flag. If you don’t implement it this way, some emulators will fail. I certainly wouldn’t count on bugs getting fixed, I’ve highlighted several major bugs in actively developed emulators and none have been fixed in several years. I implemented the Early/Late Abort handler in ADFFS back in 2012, here’s my post on the issues with emulators. |
Rick Murray (539) 13840 posts |
To my mind, this area ought to be flagged as “for OS use only”. While it may be useful to have some general “scratch space” (shouldn’t this be a DA these days?), its use should not be encouraged as it could end up a mess with functions trashing other function’s working data. It made sense, back in the days when memory was limited, DAs did not exist, the RMA was obviously never going to be able to self-compact, slot claims worked in multiples of 32K, and a little bit of scratch space was required for things that did not ordinarily lay claim to any RMA allocations (OS_HeapSort being a good example). Would it be possible to have an option, when trapping low page zero addresses, to also trap and report scratch space accesses? In this way, it can be seen exactly how often this part of memory is used, and by what. BTW, I noticed an interesting comment here: https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Lib/RISC_OSLib/kernel/s/k_body?rev=4.28#l361 I suspect the comment is out of date, for it looks as if the command line is copied to the stack, and when invoking OS_GetEnv to read the command line, it returned the address &FAF40400 – which is in the “Kernel buffers” DA – so… um… hasn’t half of this work already been done? ;-) |
Jon Abbott (1421) 2651 posts |
It wouldn’t be possible as its already read only for User, you’d have to remove the pages to trap access from IRQ/FIQ and SVC which would mean the OS would have to me modified first, as would any FileCore Modules which use the area. The fact it’s been read only for over 10 years and not caused a problem tells me it’s being used as documented in the PRM, okay there’s some legacy games that use it from the RISCOS 2 era, but that’s my problem to fix, should the area ever be removed. |
Steve Drain (222) 1620 posts |
Certainly the scratch space is writeable from User mode under RO4 and from what I have read, I do not think read-only was introduced until RO6, although that is hardly relevant now. ;-) From poking about in the scratch space, I seem to recall that Fresco and perhaps other ANT applications used it, and I certainly ran those under RO4. |
Steve Drain (222) 1620 posts |
It is not mentioned in the PRM, but from Gerph’s rambles I discovered that OS_EvaluateExpression used it until he extracted that as a separate module. Does it still in RO5? |
Dave Higton (1515) 3525 posts |
Scratch space looks to me very 1980s. Its use has not been properly defined. I can’t see any reason for a module written today to use it – there’s enough RAM available from RMA and DA, so why would you? And apps can’t use it because it’s read only. I’d suggest that we should officially deprecate its use. Given the limitations on its use, and that the OS does appear to use it in one or two places, I wonder if it’s responsible for any of these rare non-repeatable crashes that we occasionally see? |
Steve Pampling (1551) 8170 posts |
Actually, that needs checking. If people know RO6 has that lock down and 4.39 does not then the applications that are already known to have problems under RO6 but not RO4.39 can be noted as likely to have a problem with this change (assuming they haven’t already died because they aren’t 32 bit) Handy information derived from obscure sources sort of thing. |
Jeffrey Lee (213) 6048 posts |
It is not mentioned in the PRM, but from Gerph’s rambles I discovered that OS_EvaluateExpression used it until he extracted that as a separate module. Yes, it looks like RO5 OS_EvaluateExpression is still using it. There’s a (partial, perhaps also inaccurate) list of users in the KernelWS header. But from a quick search of the OS sources last night I know that there are lots of other places where it’s used (e.g. various bits of filesystem code)
It’s possible. There was a bug I fixed in FileCore where it was sometimes reading off of the start of scratch space. In that case I think it was harmless (for regular builds), but with scratch space being such a poorly controlled block of memory I wouldn’t be surprised if there are other issues still yet to be found (e.g. anything running off the end and into application space, or cases where two systems might try using it at the same time). Maybe the way forward is to keep the current scratch space where it is, but deprecate its use by third-party code, and update the OS to use a new location. The new location can have unmapped memory on either side, and access can be controlled via a simple mutex SWI to ensure only one system is allowed to use it at a time. Then over time review why and how it’s being used by the OS and see if modules could be updated to use private memory instead. |
Jeffrey Lee (213) 6048 posts |
I do not think read-only was introduced until RO6, although that is hardly relevant now. ;-) Scratch space (and the rest of zero page) was change to be read-only to user mode during RISC OS 5 development (2001/2002, ‘lots of Tungsten work’, line 462 in the right-hand column). |
Steve Pampling (1551) 8170 posts |
I was thinking that if Justin implemented similar change work then the effect would be similar and give clues where to look. Two branches both changed in a similar area, both would have similar effect on applications. No? |
Jeffrey Lee (213) 6048 posts |
Which change are you talking about? I’m getting confused here. Finding where scratch space is used in the OS source is trivial, it’s just a case of searching for ‘ScratchSpace’, as that’s the constant that defines its address. |
Steve Pampling (1551) 8170 posts |
Reference to breaking of legacy applications. It may not be applicable anyway as they may not have survived the 26/32 bit change. I may be getting vague – I’ve been up since 03:30, in work at 05:00 and the only people sleeping round here are small and furry. |
Steve Drain (222) 1620 posts |
Although the PRM places those limitations on its use, it is not beyond possibility that a routine called by a routine etc… does re-use the scratch space and cause those rare crashes. So, yes, let’s get free of it if we can. |