Zero page protection

122 posts, 25 voices

Pages: 1 2 3 4 5

Jun 19, 2015 12:50am Jeffrey Lee (213) 6048 posts	Building an OMAP4 rom with zero page protection fails at various places with the following or similar, Should be fixed now. I’m not sure exactly when I last ran a build with zero page relocation, so don’t be surprised if a bug or two have crept into the ROM modules.

Jun 19, 2015 8:34am Jon Abbott (1421) 2651 posts	Having spent the best part of the past six months looking into this issue and coding specific support for it in the next release of ADFFS, I can probably add some detail into the problem you’re facing, all be it from a legacy game standpoint. Although some of the issues are game specific, the bulk aren’t and could affect any software legacy or new. Of the 50+ games ADFFS now supports on the Pi, I’ve detailed the page zero access issues in this post on the JASPP forum. In summary, the bulk of the issues stem from C code failing to initialise pointers before use and a bug that was probably introduced into the RISCOS 3.0+ sound stack, where the GateOn entry doesn’t seem to be called prior to Fill, resulting in the SCCB not being initialized by the GateOn code. These two cover around 95% of all page zero access, the remaining 5% are down to coding mistakes. To resolve these issues, I’ve used a combination of fixing the original code, coding an Abort handler to handle writes and an pre-interpreter to handle reads – the later would be better implemented by relocating page zero and handling reads via an Abort as currently every LDR/LDM has to be executed too and then interpreted by the JIT prior to copying/re-interpretation. I intend to revisit the read handler once page zero relocation is implemented and recode it via Aborts. Addressing the key points detailed in posts above: Writes will not be trapped and will trigger an abort as normal In my experience, the bulk of the problem is writes and from the code I’ve looked at to date, the majority of the code is under IRQ or CallEvery’s and may trigger repeatedly. This may result in log overload if you’re not careful. Only LDM/LDR/LDR[S]B/LDR[S]H/LDRD instructions will be trapped. More exotic stuff (FPA/VFP/NEON/LDREX/RFE/LDRT/etc.) will be left to abort. To cover the bulk of the page zero access, any load/store instruction that is used by a C compiler will need handling, otherwise it’s going to be ineffective against modern reincarnations of issue. However, it’s probably fair to assume that ARMv5+ specific instructions could be left to Abort, as it’s likely the programmer or source is available to fix the problem. StrongARM support does add complications to the instruction interpreter and considering the number of 32bit StrongARM code about could possibly be dropped, leaving the bulk of the issues coming from Iyonix era code. One thing you need to be aware of, if StrongARM is considered, is that it may break under RPCEmu as it doesn’t implement the MMU correctly, some Abort types aren’t handled Although we’re discussing Page Zero here, consideration should also be given to the scratch space at 4000-8000. This has been read only for a while however, so it may no longer be an issue. Of the 50+ games I’ve looked at, only one that I’m aware of (Jahangir Khan World Championship Squash) uses the scratch space. In my mind that should be enough to get 90% of legacy software working without any impact. From my investigations I don’t believe it will get 90% working, as the bulk of the issues are writes, but it will certainly highlight how widespread the issue is in compiled code. From my experience over 50% of C code suffers from inadvertent page zero access. How/when to write to the log file will need a bit of thought I’d write the entries to a FIFO buffer in the Abort handler and write out to the log via IRQ driven code. I’d also code into the IRQ side an Abort tracker, so it’s not continually writing the same Abort to the logs. Why is software in 2015 reading page zero locations instead of the legal SWI? From experience, next to no software actively reads/writes to Page Zero. As Jeffrey has pointed out, there are some special cases such as IRQsema, but the bulk of the issue is coding errors. Because it’s legacy? (where legacy can mean “written in the previous century”) I wouldn’t typecast legacy software with the problem, the bulk of the issues I’ve seen are C related and down to coding mistakes or compiler issues. I’d expect a fair percentage of compiled code to exhibit inadvertent Page Zero access. Judging by my experience with fixing up the OS, I expect 90% of the breakages to be down to apps performing ‘harmless’ null pointer dereferences. There certainly won’t be any user mode code writing to zero page (since it’s already read-only in RISC OS 5), but there will be a small percentage of code which either reads or writes to known zero page locations (processor vectors, kernel workspace, etc.) I concur with this statement, it’s pretty much what I’ve seen in my investigations. Page-protect test on RPCEmu IOMD would seem the first “see what you can break option” That will actually result in more breakage than simply moving zero page – there are a few locations which need to be readable from user mode (CLib workspace pointer, CLib tmpnam() counter, OS_ChangedBox buffer, etc.) I agree, moving Page Zero is the correct way forward as read access needs to be handled for legal reads. From reading your bullet list I think the aim here is to spend a watching for problematic apps, and using the logged data to formulate patches (applied with !Patch, that’s what it’s there for) like Acorn did when StrongARM came out. I don’t believe this is realistic, the problem is simply too widespread to create patches. You’d have to analyse the code and try to interpret what its trying to do – this isn’t always straight forward and is very time consuming. To give you some idea of the task you’d face, I’ve probably spent a whole man month over the past six months, coding patches for the Page Zero access issues I’ve detailed on the JASPP forum. You’ll also have to deal with protected/encrypted code, meaning any patching will need to be implemented on-the-fly. In the unlikely event the code is self-modifying or self-checking, patching won’t work – admittedly this probably isn’t an issue in post Iyonix code. Actually I was hoping it was more a case of me writing the compatibility module and then forgetting about it, leaving the app developers to fix their broken code themselves. If the onus is on us to patch everyone else’s code then we’ll never get any other work done! I’d agree with this approach, it’s not the responsibility of the OS developers to patch software bugs. You may however need to make exception for widely used software that breaks, where the developer or source isn’t available to fix it.

Jun 19, 2015 9:16am Jeffrey Lee (213) 6048 posts	Thanks for the feedback, Jon. Interesting that you say that writes were your main problem – although that’s perhaps not too unsurprising considering the age of the software you have to work with. Hopefully I’ll be able to knock out the first version of the compatibility module over the weekend so that I can start getting some real-world testing results. StrongARM support does add complications to the instruction interpreter and considering the number of 32bit StrongARM code about could possibly be dropped, leaving the bulk of the issues coming from Iyonix era code. One thing you need to be aware of, if StrongARM is considered, is that it may break under RPCEmu as it doesn’t implement the MMU correctly, some Abort types aren’t handled Hmm, is it the abort restart bug that you’re talking about? The one that stops lazy task swapping from being used on pre rev-T CPUs? Yeah, I guess that would be a problem. Although I’m not actually sure if it counts in this case, considering that I’ll be returning to the instruction after the aborting one. Luckily(?) the current version of the zero page relocation code in the kernel won’t work on StrongARM anyway, as it’s only supported on CPUs where the processor vectors can also be moved up high – which is all ARMv6+ CPUs and some ARMv5 (including Iyonix). At a later date I might look into support for older CPUs (move zero page kernel workspace up high, but keep processor vectors low, and make them fully inaccessible from user mode). There’s provision in the API for this (OS_PlatformFeatures 0 bit 20 only mentions the processor vectors, kernel workspace locations are kept separate via OS_ReadSysInfo 6), and the kernel modifications should be fairly straightforward, so it shouldn’t be a large amount of work to get it working. But I guess I’ll have to wait and see if StrongARM support causes problems for the compatibility module. Or maybe we’ll get lucky and all the buggy software will have been fixed by the time ARMv3/v4 zero page relocation is implemented! Although we’re discussing Page Zero here, consideration should also be given to the scratch space at 4000-8000. This has been read only for a while however, so it may no longer be an issue. Of the 50+ games I’ve looked at, only one that I’m aware of (Jahangir Khan World Championship Squash) uses the scratch space. Yeah, moving/deleting scratch space is one of my future goals.

Jun 19, 2015 12:03pm Rick Murray (539) 13840 posts	Given the propensity for zero page access to be broken C programs – would it be possible to have an option to output a backtrace?

Jun 19, 2015 12:31pm Jeffrey Lee (213) 6048 posts	Given the propensity for zero page access to be broken C programs – would it be possible to have an option to output a backtrace? I think that would fall into the category of “Don’t trap the access and let the C runtime deal with it instead”. CLib and UnixLib are both capable of producing stack traces when things go wrong, and adding the necessary backtrace logic into the module would be a lot of work. Having said that, I am still entertaining the idea of writing my own stack trace code at some point – something which can work with my growing collection of debug tools (which may themselves one day find their way into the OS as standard features). I’ve got the equivalent of ‘addr2line’ for working out which code corresponds to an address within the ROM (see this thread), but the process of turning a raw stack dump into a usable stack trace is still a manual process.

Jun 19, 2015 7:55pm Sprow (202) 1158 posts	If the onus is on us to patch everyone else’s code then we’ll never get any other work done! I’d not meant to imply that, rather to use !Patch as a vehicle to deliver patches provided by others. If the author/source is still around then clearly fixing it at its origin makes much more sense, and only in exceptional cases be staring at a disassembly. I’m not sure how well !Patch is able to cope with that situation – I’d assume that the dodgy code sequence will be at different locations for different apps. Having had a brief look at !Patch’s keyword support it doesn’t currently appear to allow search by magic fingerprint (only by absolute offset) but adding a new fingerprint keyword would be pretty simple to do from the looks of it. There’s also the AppPatcher module which does hunt for fingerprints at time of expansion but before execution. It has 6 patches in its repertoire at the moment. It’s currently loaded by !Boot on anything pre RISC OS 5 but that’s not set in stone. Personally I’d rather patch the executable once and be done with it – hence the !Patch suggestion. There’s also UnsqueezeAIF which does fingerprint match & patch just before execution too, though gaffing in patches there feels a little inappropriate given the module’s original purpose & name. For ofla errors I’ve realised that we could quite easily add some code to the kernel (perhaps for odd-numbered development versions only?) which checks for null pointers on the SWI error exit and replaces it with something more useful (“SWI xxx returned null error pointer”?). Cunning, yes, that could be useful in those oflaofla’s. At least in all this it’s only 32 bit apps (ie. post 2002) that are going to get snared, any 26 bit stuff will be nesting in a cosy emulation/JIT which can fake up zero page however it sees fit.

Jun 22, 2015 7:59am Jon Abbott (1421) 2651 posts	is it the abort restart bug that you’re talking about? Not specifically although this can be an issue on physical SA with Aborts crossing page boundaries with paging implemented. The complications come from handling early/late abort mode – which isn’t implemented correctly in some emulators. The workaround is fairly straightforward – actively generate aborts for LDR and LDM in your Module initialization, so you know how instructions are behaving. Your interpreter then needs to rollback LDM and LDR with writeback as appropriate before interpreting the instruction and updating the index register on exit. You can’t assume LDM is in late Abort just because LDR is – under emulation they may be different. moving/deleting scratch space is one of my future goals. Is there a reason to move the scratch space? I fully understand and support relocating Page Zero, but can’t see any advantages to moving the scratch space unless there’s a plan to use the pages for something else. It was my understanding that this area was free to use as per Acorn documentation, I don’t think I’ve seen anything post Acorn that changes this – with the exception of RO5 making the pages read only.

Jun 22, 2015 11:07am Jeffrey Lee (213) 6048 posts	The complications come from handling early/late abort mode Yeah, I remembered about that over the weekend. You can’t assume LDM is in late Abort just because LDR is – under emulation they may be different. Ah, that’s a shame. However since one of the points of moving zero page is to expose lots of nasty bugs, I think I’d rather implement late abort handling properly (should an ARMv3/v4 version of the code be needed) and leave the emulator bugs for the emulator authors to fix! It was my understanding that this area was free to use as per Acorn documentation, I don’t think I’ve seen anything post Acorn that changes this – with the exception of RO5 making the pages read only. Do you happen to have a reference for that documentation? I’m not doubting you, but I’m sure that I tried looking in the PRMs once and couldn’t find any reference to scratch space (Maybe it’s only mentioned in an application note or something?) The fact that the OS uses it itself in some situations means that the rules under which applications can use it must be pretty non-trivial, to the point where I’d be surprised if anyone really did use it – especially post-Archimedes where computers started getting more memory than the programmers would know what to do with. There’s no pressing need for me to move scratch space, it’s just something that it would seem sensible to try and move/get rid of since (a) it’s next to zero page, and (b) is one of the last few hardcoded addresses within the OS. It might also get in the way a bit if we were to implement the plan of giving apps full control over the mapping of application space (from 0 right up to the app space limit).

Jun 22, 2015 12:24pm Rick Murray (539) 13840 posts	“An example client would be FileCore using the scratch space to hold structures while working out how to allocate some free space. Another example would be the Filer using the scratch space to hold structures for OS_HeapSort.” http://www.riscos.com/support/developers/riscos6/memory/logicalmemorymap.html I think there’s a doc here that says similar. Given that FileCore uses scratch space to work out allocations – is it / was it ever safe for user applications, or was it just “some free memory for OS use”?

Jun 22, 2015 12:39pm Dave Higton (1515) 3525 posts	re. scratch space being free to use Use by what processes? Certainly, if it’s read only to applications in RO5, applications can’t use it. It follows that any app that has been running successfully under RO5 doesn’t need it. So whose scratch space is it?

Jun 22, 2015 12:55pm Jeffrey Lee (213) 6048 posts	I think we’re getting a bit confused between apps and modules. Judging by the doc Rick found, scratch space is free for modules to use (under certain circumstances), but not for applications. That’s how it’s been possible to make it read-only to user mode in RO 5/6 without breaking anything (except the odd Arc game!) Anyway, the fact that it’s mentioned as being free for use in the RISC OS 6 docs is probably good enough reason for me to leave it alone until we have a more compelling reason to mess with it.

Jun 22, 2015 3:55pm Steve Drain (222) 1620 posts	The paragraph about the scratch space that Rick quotes is also in the PRM Vol 5a p39. He omitted this paragraph from his quotation: Under RISC OS 4, the system workspace (also known as ‘zero page’) and scratch space were only protected against being written to from USR mode for the first 256 bytes. The entire 32K of this region is now only readable in USR mode. So that is the same as RO5, I think. In days long past the definition of this area was much less closely defined and it was suggested as a handy place for the equivalent of DIM LOCAL from BASIC. I have written applications that did this, but none released publicly. Also, BASIC will not deal with $string% below &8000. I am now ‘clean’. ;-)

Jun 22, 2015 4:44pm Jon Abbott (1421) 2651 posts	Do you happen to have a reference for that documentation? From the PRM it’s used in two OS locations: OS_HeapSort if R1 bit 29 isn’t set (PRM1-972) FileCore modules when FileCore allocates a temporary buffer, if bit 3 in the FileCore Descriptor Block flags is set (PRM2-598) It’s use is covered in the memory map: PRM5a-41 describes it as a “Public” area which may be used by any module that is not: Used in an IRQ routine Used if you call something else that might also use it Up to RISCOS 4 is was R/W and free to use by any application, from RISCOS 4 on, it became Read only for applications, which complies with the documentation, but did break some legacy software which made use of the area as temporary space. I think I’d rather implement late abort handling properly (should an ARMv3/v4 version of the code be needed) and leave the emulator bugs for the emulator authors to fix! You’d still be implementing it properly, you just need to have separate flags for the late/early state of LDR and LDM – it’s a few extra lines of code in your Module Init and one extra flag. If you don’t implement it this way, some emulators will fail. I certainly wouldn’t count on bugs getting fixed, I’ve highlighted several major bugs in actively developed emulators and none have been fixed in several years. I implemented the Early/Late Abort handler in ADFFS back in 2012, here’s my post on the issues with emulators.

Jun 22, 2015 8:01pm Rick Murray (539) 13840 posts	To my mind, this area ought to be flagged as “for OS use only”. While it may be useful to have some general “scratch space” (shouldn’t this be a DA these days?), its use should not be encouraged as it could end up a mess with functions trashing other function’s working data. It made sense, back in the days when memory was limited, DAs did not exist, the RMA was obviously never going to be able to self-compact, slot claims worked in multiples of 32K, and a little bit of scratch space was required for things that did not ordinarily lay claim to any RMA allocations (OS_HeapSort being a good example). It makes a little less sense these days – perhaps “System workspace” (the 32K at the bottom of TaskManager) could be made a true DA and the few things that did changed to use that? Would it be possible to have an option, when trapping low page zero addresses, to also trap and report scratch space accesses? In this way, it can be seen exactly how often this part of memory is used, and by what. BTW, I noticed an interesting comment here: https://www.riscosopen.org/viewer/view/castle/RiscOS/Sources/Lib/RISC_OSLib/kernel/s/k_body?rev=4.28#l361 I suspect the comment is out of date, for it looks as if the command line is copied to the stack, and when invoking OS_GetEnv to read the command line, it returned the address &FAF40400 – which is in the “Kernel buffers” DA – so… um… hasn’t half of this work already been done? ;-)

Jun 23, 2015 6:25am Jon Abbott (1421) 2651 posts	Would it be possible to have an option, when trapping low page zero addresses, to also trap and report scratch space accesses? It wouldn’t be possible as its already read only for User, you’d have to remove the pages to trap access from IRQ/FIQ and SVC which would mean the OS would have to me modified first, as would any FileCore Modules which use the area. The fact it’s been read only for over 10 years and not caused a problem tells me it’s being used as documented in the PRM, okay there’s some legacy games that use it from the RISCOS 2 era, but that’s my problem to fix, should the area ever be removed.

Jun 23, 2015 10:12am Steve Drain (222) 1620 posts	Up to RISCOS 4 is was R/W and free to use by any application, from RISCOS 4 on, it became Read only for applications, which complies with the documentation, but did break some legacy software which made use of the area as temporary space. Certainly the scratch space is writeable from User mode under RO4 and from what I have read, I do not think read-only was introduced until RO6, although that is hardly relevant now. ;-) From poking about in the scratch space, I seem to recall that Fresco and perhaps other ANT applications used it, and I certainly ran those under RO4.

Jun 23, 2015 10:17am Steve Drain (222) 1620 posts	From the PRM it’s used in two OS locations: It is not mentioned in the PRM, but from Gerph’s rambles I discovered that OS_EvaluateExpression used it until he extracted that as a separate module. Does it still in RO5?

Jun 23, 2015 11:50am Dave Higton (1515) 3525 posts	Scratch space looks to me very 1980s. Its use has not been properly defined. I can’t see any reason for a module written today to use it – there’s enough RAM available from RMA and DA, so why would you? And apps can’t use it because it’s read only. I’d suggest that we should officially deprecate its use. Given the limitations on its use, and that the OS does appear to use it in one or two places, I wonder if it’s responsible for any of these rare non-repeatable crashes that we occasionally see?

Jun 23, 2015 12:47pm Steve Pampling (1551) 8170 posts	I do not think read-only was introduced until RO6, although that is hardly relevant now. ;-) Actually, that needs checking. If people know RO6 has that lock down and 4.39 does not then the applications that are already known to have problems under RO6 but not RO4.39 can be noted as likely to have a problem with this change (assuming they haven’t already died because they aren’t 32 bit) Handy information derived from obscure sources sort of thing.

Jun 23, 2015 12:50pm Jeffrey Lee (213) 6048 posts	It is not mentioned in the PRM, but from Gerph’s rambles I discovered that OS_EvaluateExpression used it until he extracted that as a separate module. Does it still in RO5? Yes, it looks like RO5 OS_EvaluateExpression is still using it. There’s a (partial, perhaps also inaccurate) list of users in the KernelWS header. But from a quick search of the OS sources last night I know that there are lots of other places where it’s used (e.g. various bits of filesystem code) Given the limitations on its use, and that the OS does appear to use it in one or two places, I wonder if it’s responsible for any of these rare non-repeatable crashes that we occasionally see? It’s possible. There was a bug I fixed in FileCore where it was sometimes reading off of the start of scratch space. In that case I think it was harmless (for regular builds), but with scratch space being such a poorly controlled block of memory I wouldn’t be surprised if there are other issues still yet to be found (e.g. anything running off the end and into application space, or cases where two systems might try using it at the same time). Maybe the way forward is to keep the current scratch space where it is, but deprecate its use by third-party code, and update the OS to use a new location. The new location can have unmapped memory on either side, and access can be controlled via a simple mutex SWI to ensure only one system is allowed to use it at a time. Then over time review why and how it’s being used by the OS and see if modules could be updated to use private memory instead.

Jun 23, 2015 1:13pm Jeffrey Lee (213) 6048 posts	I do not think read-only was introduced until RO6, although that is hardly relevant now. ;-) Actually, that needs checking. If people know RO6 has that lock down and 4.39 does not then the applications that are already known to have problems under RO6 but not RO4.39 can be noted as likely to have a problem with this change (assuming they haven’t already died because they aren’t 32 bit) Scratch space (and the rest of zero page) was change to be read-only to user mode during RISC OS 5 development (2001/2002, ‘lots of Tungsten work’, line 462 in the right-hand column).

Jun 23, 2015 2:02pm Steve Pampling (1551) 8170 posts	Scratch space (and the rest of zero page) was change to be read-only to user mode during RISC OS 5 development (2001/2002, ‘lots of Tungsten work’, line 462 in the right-hand column). I was thinking that if Justin implemented similar change work then the effect would be similar and give clues where to look. Two branches both changed in a similar area, both would have similar effect on applications. No?

Jun 23, 2015 2:22pm Jeffrey Lee (213) 6048 posts	Which change are you talking about? I’m getting confused here. Finding where scratch space is used in the OS source is trivial, it’s just a case of searching for ‘ScratchSpace’, as that’s the constant that defines its address.

Jun 23, 2015 2:36pm Steve Pampling (1551) 8170 posts	Which change are you talking about? I’m getting confused here Reference to breaking of legacy applications. It may not be applicable anyway as they may not have survived the 26/32 bit change. No great importance. I may be getting vague – I’ve been up since 03:30, in work at 05:00 and the only people sleeping round here are small and furry.

Jun 24, 2015 9:46am Steve Drain (222) 1620 posts	I’d suggest that we should officially deprecate its use. Given the limitations on its use, and that the OS does appear to use it in one or two places, I wonder if it’s responsible for any of these rare non-repeatable crashes that we occasionally see? Although the PRM places those limitations on its use, it is not beyond possibility that a routine called by a routine etc… does re-use the scratch space and cause those rare crashes. So, yes, let’s get free of it if we can.

Pages: 1 2 3 4 5

Reply

To post replies, please first log in.

Forums → Wish lists →

Zero page protection

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options