OS_ChangeDynamicArea freezes the system
Terje Slettebø (285) 275 posts |
Hi all. I’m working on a small graphics demo for the BeagleBoard, which uses double-buffering for the screen, so I’m using OS_ChangeDynamicArea to set the screen memory to at least twice the size of a screen buffer. This works fine the first time the call is made, but running the program again freezes the system. Here’s a BASIC program to demonstrate the problem:SYS "OS_ReadDynamicArea",2 TO ,size% new_size%=1280*720*4*2-size% PRINT "size, new size=",size%,new_size% SYS "OS_ChangeDynamicArea",2,new_size% PRINT "Done"Could somebody confirm that the problem happens for them, as well? What could be the cause of this, and might there be some other way to achieve the same effect (changing the available screen memory)? If I drag the slider in the Task Manager, it’s still reset to the minimal value when the application is run. Regards, Terje |
Jeffrey Lee (213) 6048 posts |
It looks like it might be a bug in one of the ‘fixes’ I made to the cache/TLB maintenance ops a week or two ago, since the machine also freezes if I follow the steps from this bug. Changing the abort handler to ignore aborts caused by all CP15 MCR ops doesn’t seem to fix the issue, so it must be a regular abort caused by some bad code. But at the same time, I can’t see anything obviously wrong with the new cache/TLB code. I’ll have a proper look tomorrow. |
Terje Slettebø (285) 275 posts |
Thanks for the reply. I forgot to mention: I’m running an older version of RISC OS from a couple of months back, so this problem can’t be due to changes done a few weeks ago. When I get home tonight, I’ll update to the latest version, and see if that changes anything. Could you confirm if the above BASIC program also freezes your system when run twice? |
Jeffrey Lee (213) 6048 posts |
I didn’t really test the BASIC program properly – I did it all from the command line, so the second OS_ChangeDynamicArea call didn’t have to move any memory at all.
How old is the ROM image? If it’s from before March then it won’t contain the initial fix for the aborting MCR ops, which is the only reason I can think of why the code could crash. |
Terje Slettebø (285) 275 posts |
That’s hard to tell, because both the one I had and the new one I just downloaded says “RISC OS 5.17 (19 Jan 2010)” when using *FX0. A couple of questions: 1) Am I doing something wrong, such as downloading the wrong RISC OS image? If I’m getting the right one, how come the version or date hasn’t been updated to reflect the new version? 2) I read a while back that it would now boot into the Desktop, but it doesn’t. Have that perhaps been reversed? And, yes, it unfortunately still crashes when running the above BASIC program twice… (each time it displays the same numbers) |
Jeffrey Lee (213) 6048 posts |
No, you’re downloading it from the right location. I’m not sure off the top of my head where *FX0 gets the date from (probably from the UtilityModule), but it’s not from somewhere that gets updated for every ROM that’s built. That’s probably something we should change, as it could easily result in confusion. We’ve standardised on using odd version numbers for development ROM images, so it shouldn’t be too hard to make *FX0 report a different date if it detects a development build (I’m fairly certain the ROM linker places a date in the checksum at the end of the ROM image, so that would be a good candidate) In the meantime, the easiest way to work out when the ROM image was built/downloaded is just to check the timestamp of the ROM file!
It’ll boot into the desktop if you have a USB mass storage device connected which SCSIFS treats as a removable device (i.e. it appears as SCSI::0). For non-removable devices like hard discs you’ll currently need to edit the default CMOS settings and compile your own ROM image. Obviously this will get fixed once we finally write some code for handling CMOS/NVRAM settings! |
Terje Slettebø (285) 275 posts |
Thanks for the info. Having experimented a little more, I’ve found the following: 1) Repeatedly resizing the screen buffer (back and forth) in the same program does not cause the system to hang. However, running this program again, hangs the system on the first call to OS_ChangeDynamicArea. 2) If the screen buffer is altered to its desired size or above, in the Task Manager, prior to running the program (leading the program to reduce the allocated memory instead of increasing it), then even if it has ran before, it will not hang the system. It seems the only current workaround is the latter one: Manually increasing the screen memory in the Task Manager prior to running the program, every time the program is run. Of course, this is rather cumbersome, but it’s at least easier than a complete reboot every second run… |
Sprow (202) 1158 posts |
The FX0 text is manually set in the kernel and becomes the date/version of the utility module as noted. What you want is SYS”OS_ReadSysInfo”,9,1 TO builddate$ |
Jeffrey Lee (213) 6048 posts |
This bug’s proving to be a bit tricky – the machine seems to be hanging somewhere with FIQs disabled, or the processor vectors are getting trashed, so the tricks I usually use to find out what’s going on aren’t working. And rolling back my most recent Kernel changes doesn’t make the problem go away! |
Jeffrey Lee (213) 6048 posts |
After spending a few hours stepping through code, I think I’ve found the cause of the problem. It looks like it’s a flaw in the way that the AMBControl stuff works. AMBControl seems to have only two behaviours when an abort is encountered:
I’m still not 100% sure why OS_ChangeDynamicArea does what it does, but in the situation where it’s failing it’s attempting to clean the ‘nowhere’ page from the cache. This triggers a data abort due to the page having no mapping. AMBControl correctly detects the page as not being one of its own, and (assuming that the environment abort handler is about to be triggered) attempts to map in all of the other pages, using a fairly dumb piece of code that just LDRs one word from each page, relying on nested aborts to trigger any missing pages to be mapped in. This piece of code doesn’t even bother checking to see if the page is already marked as mapped in. Except that, for one reason or another, one of the pages which AMBControl thinks is mapped in, actually isn’t. This causes the above-mentioned loop to trigger an abort for that page, which triggers AMBControl again. But since AMBControl thinks the page is mapped in, instead of attempting to map it in it starts running through the dumb loop again, causing the machine to hang in a runaway sequence of nested aborts. So:
A few extra observations:
|
Jeffrey Lee (213) 6048 posts |
I think I’ve worked out why AMBControl is going wrong. ChangeDynamicArea builds a list of page mappings on the stack, creating a list of what pages need to be moved where. At the time it builds the mappings, some of the pages in the AMB aren’t mapped in, i.e. they are mapped to ‘nowhere’. Even though the page isn’t mapped in the OS knows that it’s in use, so it temporarily maps the page in and copies the contents to a replacement page, then releases the temporary mapping. Then the code foolishly tries to flush the page from the cache, using its real address (i.e. the ‘nowhere’ address since the page isn’t mapped in). This triggers an abort, which causes AMBControl to panic and map in all of its pages. Once the cache clean is complete, the code calls BangCamUpdate, which performs the actual page table/CAM updates, to move the page to its new location in the screen DA. But BangCamUpdate does its own check for whether the page is currently mapped in, causing it to detect the copy of the page that AMBControl just mapped in. To avoid the page being doubly mapped it unmaps the page, leading to the situation where AMBControl thinks the page is mapped in but the CPU doesn’t. Then the next time ChangeDynamicArea tries to clean the nowhere page, everything dies horribly. I think the easiest way to fix this at the moment is to do the following:
Another option (perhaps for when I finish up the abort handler reworking for the unaligned load/store handler) would be to split the AMBControl code in two parts – one part which maps in an aborting page, and another part which maps in all pages just before the environment handler gets called. That way it should play nice with any other handlers that deal with expected aborts. |
Jeffrey Lee (213) 6048 posts |
...and those fixes are now in CVS! |
Terje Slettebø (285) 275 posts |
Excellent work! Most of that went over my head, but I say as Admiral Benson: “I don’t have a clue what you’re talkin’ about, Phil. Not a fucking clue. [...] you just go ahead and do what you do.” :) If my above test code no longer crashes with the CVS version, could you have built a new version of the ROM and either put it on the upload page, or mailed it to me? The reason I’m asking this is that I haven’t yet got my system set up for building the ROM locally. |
Jeffrey Lee (213) 6048 posts |
Here you go: http://www.phlamethrower.co.uk/misc2/riscos.zip It’s completely untested though – I left RPCEmu building overnight since it was taking so long! |
Terje Slettebø (285) 275 posts |
Hi Jeffrey. Thanks a lot. I’m currently on vacation, but I’ll test this when I get back. Regards, Terje |
Terje Slettebø (285) 275 posts |
Sorry for the late reply. I’ve now got around to test this, and it works perfectly, thanks! :) |