Forums → Bugs →

Edit, the VDU driver, and ARMv7 cache maintenance aborts

12 posts, 4 voices

Feb 25, 2010 1:42am Jeffrey Lee (213) 6048 posts	Here’s a fun one. While testing out the new OMAP video driver, I discovered that if you have the wimp in mode 25, open an !Edit window, hit F12, and enter `*wimpmode 28` the machine will get stuck in an infinite abort loop. I haven’t tracked it all down yet, but it looks like there are at least three bugs at work (oh, joy): Something in Edit is causing the MMU_ChangingEntry cache op to be called for the page at &FAFF8000 – except that page doesn’t seem to exist, and (I believe) should never exist. This isn’t necessarily a serious problem, but… ...although I can’t find any mention of it in the manual, it looks like ARMv7 MVA-based cache maintenance operations cause an abort if the destination page doesn’t exist. Apart from being a potential pain in the ass if we want to retain the ability to safely clean invalid addresses, this causes the first abort. Following that abort, the ShellCLI module tries to print out the error message, only to end up crashing in the VDU code (specifically Wrch1bit), when it tries to write to screen memory. Although I can’t be sure, it looks like it was trying to write to an address that would be valid for the 8bpp mode 28, but invalid for the 1bpp mode 25. This second abort then invokes the ShellCLI error reporting code again, which results in another abort, etc. I don’t suppose there are any fools out there who want to take over looking into all this while I concentrate on the much more enjoyable pursuit of writing the video driver? ;)

Feb 25, 2010 7:10am John-Mark Bell (94) 36 posts	Well, &FAFF8000 is “Nowhere” on 32bit HAL builds of the kernel—see the top of NewReset.s in the kernel sources. No page should ever exist there. Section B3.4.2 of DDI0406B suggests that aborts may happen when performing cache maintenance operations with MVA: For maximum portability, ARM recommends that operating systems always provide an abort handler to process Data Abort exceptions on instruction cache maintenance operations by MVA, even though some ARMv7 implementations might not be capable of generating these aborts. This suggests to me that the least invasive fix would be to modify the kernel’s data abort handler to catch these kinds of aborts and recover appropriately. As for the ShellCLI stuff, it sounds to me that the abort is happening mid-mode change, so it’s confused :)

Feb 25, 2010 1:44pm Jeffrey Lee (213) 6048 posts	Cheers JMB. It shouldn’t be too hard to modify the abort handler to detect the cache cleaning ops – and with a bit of work I could even get the handler to skip the entire page if it recognises that the abort came from one of the kernel cache ops. Looking at that section of the manual a bit further, it looks like we may be able to make some further improvements to the cache cleaning code – specifically, if “PIPT data & unified caches” means what I think it does, it means we don’t have to worry about flushing the data cache at all when remapping pages. This would hopefully get rid of the several-second delay that occurs when starting the WIMP (when I had a quick look at it a while ago I tracked it down to the kernel remapping and cleaning lots of pages one at a time – but I didn’t look much further to work out how difficult it would be to change the dynamic area handler to use a more optimal solution for remapping large numbers of pages) Now all I need to do is find an easy way to track down the bit of the mode change code that leaves the mode variables in an inconsistent state :( I suppose the easiest way would be to trigger a breakpoint when an attempt is made to clean the “nowhere” page – that way I’ll be able to find out what Edit’s up to at the same time.

Feb 25, 2010 10:01pm Jeffrey Lee (213) 6048 posts	I suppose the easiest way would be to trigger a breakpoint when an attempt is made to clean the “nowhere” page – that way I’ll be able to find out what Edit’s up to at the same time. After trying this, I’ve discovered that the cache cleaning is being triggered by the OS_ChangeDynamicArea call that attempts to grow screen memory. It must just be some side-effect of having Edit loaded that causes the page tables to be arranged in such a way that OS_ChangeDynamicArea decides to try cleaning the nowhere page; although I’m no expert on the code, it looks like some or all of the memory it wants to add to the screen dynamic area had previously been mapped to ‘nowhere’, and it’s (perhaps erroneously) attempting to clean the nowhere region before remapping the pages. Also of interest is that the VduWrch aborts seem to be a result of the way screen memory is doubly-mapped. The PreGrow handler shifts the first mapping down to make way for the new pages, but the cursor address (and all the other VDU address variables) don’t get updated – and since the new mode uses 8x as much memory as the previous one, this means that none of them are pointing to valid addresses. This could probably be fixed without too much hassle (e.g. make sure they point at the start of the second mapping of screen memory), but since OS_ChangeDynamicArea shouldn’t abort in the first place it probably isn’t worth the hassle/risk.

Feb 26, 2010 10:00pm Steve Revill (20) 1394 posts	Good investigative work there. I think you might want to at least add a comment in the sources relating to the latter – save someone the same investigation work if it needs fixing later.

Feb 27, 2010 6:08pm Jeffrey Lee (213) 6048 posts	Good investigative work there. It isn’t over yet! After writing the abort handler I was still getting stuck in an infinite abort loop. After a few more hours of investigation, it looked like the OS_ChangeDynamicArea + abort handler combination resulted in IRQs being left disabled long enough for multiple IRQs to be queued. Then when IRQs were enabled again, something was going “bang” and corrupting the stack (in some cases causing the PC to be loaded with the PSR, in other cases causing it to be loaded with 0, and in other cases causing it to be loaded with some other value). After some more investigation I managed to track the location of the crash down to PointerV, some time after calling the USB driver’s handler. Although I’m still none the wiser as to when and why the stack corruption is occuring, I have at least found a bug that, when fixed, prevents the crash from occuring (both in my “disable IRQs for a while” testbed and the real-world *wimpmode test) – the pointerv() implementation in USBDriver assumed that interrupts were enabled on entry, and unconditionally called _kernel_irqs_on() after performing its atomic pointer position update (see here, around line 419). This assumption was obviously false for the situation in which the code was failing (video IRQ -> video driver → GraphicsV_VSync → kernel → PollPointer in Kernel.s.PMF.mouse → PointerV_Request). Looking at the code a bit further I can see that PollPointer restores the full PSR after calling PointerV – which must mean that the stack corruption occurs between USBDriver enabling IRQs and PointerV disabling them again (which makes sense, since the crash was occuring with R14 pointing to the loop in OS_CallAVector). But more importantly, it gives me a window in which to enable more comprehensive debugging code – so that I can find out exactly what interrupts occur during the window, and then hopefully work out where the stack corruption comes from.

Feb 27, 2010 6:13pm Jeffrey Lee (213) 6048 posts	And thinking about it even further, the fact that I was still getting stuck in an abort loop suggests that after OS_ChangeDynamicArea finishes and re-enables interrupts the VDU drivers must still be out of sync with the memory map, otherwise the error report wouldn’t have triggered any further aborts. Bah.

Feb 27, 2010 6:45pm Jeffrey Lee (213) 6048 posts	OK, forget that last bit – it looks like (with the abort handler) *wimpmode only gives me problems if I have some of my debug code enabled. With the debug code disabled it looks like everything is fine, and it’s only my testbed that’s able to trigger the PointerV crash. This means I should be able to forget about my current theory that something turns interrupts on before the VDU driver updates its pointers/finishes remapping memory, and it’s only the PointerV/stack corruption that needs further investigation at this point in time.

Feb 27, 2010 10:26pm Steve Revill (20) 1394 posts	I remember the days when I was actually able to indulge in this kind of work. Now I spend all day (and night) on conference calls and updating fault report tickets. :( Good luck.

Feb 28, 2010 12:08am Jeffrey Lee (213) 6048 posts	Well you’ll be pleased to hear it wasn’t too hard to track down. Turned out I’d forgotten to preserve R14_svc before calling a SWI from IRQ mode :(

Nov 4, 2010 1:37am Trevor Johnson (329) 1645 posts	I don’t know if this is related, but I’ve just filed Ticket #256

Nov 17, 2010 8:58am Trevor Johnson (329) 1645 posts	Thanks for investigating this and fixing it :-)

Reply

To post replies, please first log in.

Forums → Bugs →

Edit, the VDU driver, and ARMv7 cache maintenance aborts

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options