Forums → Bugs →

OS_ChangeDynamicArea freezes the system

16 posts, 3 voices

Jun 29, 2010 11:58pm Terje Slettebø (285) 275 posts	Hi all. I’m working on a small graphics demo for the BeagleBoard, which uses double-buffering for the screen, so I’m using OS_ChangeDynamicArea to set the screen memory to at least twice the size of a screen buffer. This works fine the first time the call is made, but running the program again freezes the system. Here’s a BASIC program to demonstrate the problem: SYS "OS_ReadDynamicArea",2 TO ,size% new_size%=12807204*2-size% PRINT "size, new size=",size%,new_size% SYS "OS_ChangeDynamicArea",2,new_size% PRINT "Done" Could somebody confirm that the problem happens for them, as well? What could be the cause of this, and might there be some other way to achieve the same effect (changing the available screen memory)? If I drag the slider in the Task Manager, it’s still reset to the minimal value when the application is run. Regards, Terje

Jun 30, 2010 1:01am Jeffrey Lee (213) 6048 posts	It looks like it might be a bug in one of the ‘fixes’ I made to the cache/TLB maintenance ops a week or two ago, since the machine also freezes if I follow the steps from this bug. Changing the abort handler to ignore aborts caused by all CP15 MCR ops doesn’t seem to fix the issue, so it must be a regular abort caused by some bad code. But at the same time, I can’t see anything obviously wrong with the new cache/TLB code. I’ll have a proper look tomorrow.

Jun 30, 2010 8:48am Terje Slettebø (285) 275 posts	Thanks for the reply. I forgot to mention: I’m running an older version of RISC OS from a couple of months back, so this problem can’t be due to changes done a few weeks ago. When I get home tonight, I’ll update to the latest version, and see if that changes anything. Could you confirm if the above BASIC program also freezes your system when run twice?

Jun 30, 2010 9:28am Jeffrey Lee (213) 6048 posts	I didn’t really test the BASIC program properly – I did it all from the command line, so the second OS_ChangeDynamicArea call didn’t have to move any memory at all. I’m running an older version of RISC OS from a couple of months back, so this problem can’t be due to changes done a few weeks ago. How old is the ROM image? If it’s from before March then it won’t contain the initial fix for the aborting MCR ops, which is the only reason I can think of why the code could crash.

Jun 30, 2010 6:04pm Terje Slettebø (285) 275 posts	How old is the ROM image? That’s hard to tell, because both the one I had and the new one I just downloaded says “RISC OS 5.17 (19 Jan 2010)” when using *FX0. A couple of questions: 1) Am I doing something wrong, such as downloading the wrong RISC OS image? If I’m getting the right one, how come the version or date hasn’t been updated to reflect the new version? 2) I read a while back that it would now boot into the Desktop, but it doesn’t. Have that perhaps been reversed? And, yes, it unfortunately still crashes when running the above BASIC program twice… (each time it displays the same numbers)

Jun 30, 2010 6:32pm Jeffrey Lee (213) 6048 posts	1) Am I doing something wrong, such as downloading the wrong RISC OS image? If I’m getting the right one, how come the version or date hasn’t been updated to reflect the new version? No, you’re downloading it from the right location. I’m not sure off the top of my head where FX0 gets the date from (probably from the UtilityModule), but it’s not from somewhere that gets updated for every ROM that’s built. That’s probably something we should change, as it could easily result in confusion. We’ve standardised on using odd version numbers for development ROM images, so it shouldn’t be too hard to make FX0 report a different date if it detects a development build (I’m fairly certain the ROM linker places a date in the checksum at the end of the ROM image, so that would be a good candidate) In the meantime, the easiest way to work out when the ROM image was built/downloaded is just to check the timestamp of the ROM file! 2) I read a while back that it would now boot into the Desktop, but it doesn’t. Have that perhaps been reversed? It’ll boot into the desktop if you have a USB mass storage device connected which SCSIFS treats as a removable device (i.e. it appears as SCSI::0). For non-removable devices like hard discs you’ll currently need to edit the default CMOS settings and compile your own ROM image. Obviously this will get fixed once we finally write some code for handling CMOS/NVRAM settings!

Jun 30, 2010 8:21pm Terje Slettebø (285) 275 posts	Thanks for the info. Having experimented a little more, I’ve found the following: 1) Repeatedly resizing the screen buffer (back and forth) in the same program does not cause the system to hang. However, running this program again, hangs the system on the first call to OS_ChangeDynamicArea. 2) If the screen buffer is altered to its desired size or above, in the Task Manager, prior to running the program (leading the program to reduce the allocated memory instead of increasing it), then even if it has ran before, it will not hang the system. It seems the only current workaround is the latter one: Manually increasing the screen memory in the Task Manager prior to running the program, every time the program is run. Of course, this is rather cumbersome, but it’s at least easier than a complete reboot every second run…

Jun 30, 2010 9:40pm Sprow (202) 1158 posts	I’m not sure off the top of my head where FX0 gets the date from (probably from the UtilityModule), but it’s not from somewhere that gets updated for every ROM that’s built. That’s probably something we should change The FX0 text is manually set in the kernel and becomes the date/version of the utility module as noted. What you want is SYS”OS_ReadSysInfo”,9,1 TO builddate$

Jul 1, 2010 12:20am Jeffrey Lee (213) 6048 posts	This bug’s proving to be a bit tricky – the machine seems to be hanging somewhere with FIQs disabled, or the processor vectors are getting trashed, so the tricks I usually use to find out what’s going on aren’t working. And rolling back my most recent Kernel changes doesn’t make the problem go away!

Jul 3, 2010 3:13pm Jeffrey Lee (213) 6048 posts	After spending a few hours stepping through code, I think I’ve found the cause of the problem. It looks like it’s a flaw in the way that the AMBControl stuff works. AMBControl seems to have only two behaviours when an abort is encountered: Map in the aborting page, because it was a member of an AMB Map in all pages of the current AMB, in preperation for running the environment abort handler (to prevent recursive aborts if the abort handler was in application space) I’m still not 100% sure why OS_ChangeDynamicArea does what it does, but in the situation where it’s failing it’s attempting to clean the ‘nowhere’ page from the cache. This triggers a data abort due to the page having no mapping. AMBControl correctly detects the page as not being one of its own, and (assuming that the environment abort handler is about to be triggered) attempts to map in all of the other pages, using a fairly dumb piece of code that just LDRs one word from each page, relying on nested aborts to trigger any missing pages to be mapped in. This piece of code doesn’t even bother checking to see if the page is already marked as mapped in. Except that, for one reason or another, one of the pages which AMBControl thinks is mapped in, actually isn’t. This causes the above-mentioned loop to trigger an abort for that page, which triggers AMBControl again. But since AMBControl thinks the page is mapped in, instead of attempting to map it in it starts running through the dumb loop again, causing the machine to hang in a runaway sequence of nested aborts. So: Why is the kernel attempting to clean the nowhere page in the first place? It’s coming from line 3900 of (the Cortex branch of) ChangeDyn.s. As a rough guess (i.e. without spending hours examining page tables over the serial console!) the screen DA handler requested a physical page which is currently owned by an inactive Wimp task. This means the page will be mapped to ‘nowhere’ in the CAM map. OS_ChangeDynamicArea will detect that the page isn’t part of the free pool, so the code around line 3900 will be triggered, to copy the contents of the page elsewhere before adding it to the screen DA. So I guess the cleaning of the nowhere page is just a simple bug which nobody’s bothered to fix in the past because it never aborted. Why does AMBControl think that a page is mapped in, but in reality it isn’t? This one is a bit trickier to try and work out, since if OS_ChangeDynamicArea is swapping one page for another it doesn’t leave any gaps where the logical page is left unmapped. Looking at the AMBControl code I can see that it marks the page as being mapped in before actually mapping it in, so if AMBControl aborts while attempting to map in a page then it would result in failure. But as I say I’m not really sure if this is possible, since ChangeDynamicArea doesn’t leave any windows where a previously valid logical page is left unmapped. I have a feeling I’ll only be able to solve this one by trying to track down where the missing page has been moved to, and why. A few extra observations: The CAM mapping is always kept in sync with the page tables, and it’s only AMBControl which knows that there are lazily mapped pages which need mapping into the CAM map/page table. Looking at the code, I can see that physical pages which aren’t currently mapped in get mapped to ‘nowhere’ in the CAM. Looking at the code in s.AMBControl.service, it looks like, for the situation where an AMB has had some of its pages replaced, AMBControl only updates its page mappings when Service_PagesSafe is called. Except that service call is only made after all the pages have been moved by OS_ChangeDynamicArea. This could cause AMBControl to map in old pages by accident, but due to the way the dumb loop operates I don’t think it will ever cause an updated page to be overwritten with its old mapping. However it wouldn’t surprise me if there’s a bug somewhere in this behaviour. The lazy map in code doesn’t perform any TLB maintenance after mapping in the page! This could be bad if the TLB caches unmapped/invalid entries (not that I think that they do)

Jul 3, 2010 6:31pm Jeffrey Lee (213) 6048 posts	I think I’ve worked out why AMBControl is going wrong. ChangeDynamicArea builds a list of page mappings on the stack, creating a list of what pages need to be moved where. At the time it builds the mappings, some of the pages in the AMB aren’t mapped in, i.e. they are mapped to ‘nowhere’. Even though the page isn’t mapped in the OS knows that it’s in use, so it temporarily maps the page in and copies the contents to a replacement page, then releases the temporary mapping. Then the code foolishly tries to flush the page from the cache, using its real address (i.e. the ‘nowhere’ address since the page isn’t mapped in). This triggers an abort, which causes AMBControl to panic and map in all of its pages. Once the cache clean is complete, the code calls BangCamUpdate, which performs the actual page table/CAM updates, to move the page to its new location in the screen DA. But BangCamUpdate does its own check for whether the page is currently mapped in, causing it to detect the copy of the page that AMBControl just mapped in. To avoid the page being doubly mapped it unmaps the page, leading to the situation where AMBControl thinks the page is mapped in but the CPU doesn’t. Then the next time ChangeDynamicArea tries to clean the nowhere page, everything dies horribly. I think the easiest way to fix this at the moment is to do the following: Fix ChangeDynamicArea so it doesn’t try cleaning the nowhere page! Move the cache/TLB MCR abort handler to just before the AMBControl handler, so that if a bug elsewhere causes an invalid page to be cleaned then AMBControl shouldn’t be triggered at all Another option (perhaps for when I finish up the abort handler reworking for the unaligned load/store handler) would be to split the AMBControl code in two parts – one part which maps in an aborting page, and another part which maps in all pages just before the environment handler gets called. That way it should play nice with any other handlers that deal with expected aborts.

Jul 3, 2010 7:43pm Jeffrey Lee (213) 6048 posts	...and those fixes are now in CVS!

Jul 4, 2010 11:58am Terje Slettebø (285) 275 posts	Excellent work! Most of that went over my head, but I say as Admiral Benson: “I don’t have a clue what you’re talkin’ about, Phil. Not a fucking clue. [...] you just go ahead and do what you do.” :) If my above test code no longer crashes with the CVS version, could you have built a new version of the ROM and either put it on the upload page, or mailed it to me? The reason I’m asking this is that I haven’t yet got my system set up for building the ROM locally.

Jul 6, 2010 10:00am Jeffrey Lee (213) 6048 posts	Here you go: http://www.phlamethrower.co.uk/misc2/riscos.zip It’s completely untested though – I left RPCEmu building overnight since it was taking so long!

Jul 14, 2010 6:24pm Terje Slettebø (285) 275 posts	Hi Jeffrey. Thanks a lot. I’m currently on vacation, but I’ll test this when I get back. Regards, Terje

Sep 22, 2010 6:36pm Terje Slettebø (285) 275 posts	Sorry for the late reply. I’ve now got around to test this, and it works perfectly, thanks! :)

Reply

To post replies, please first log in.

Forums → Bugs →

OS_ChangeDynamicArea freezes the system

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options