Edit, the VDU driver, and ARMv7 cache maintenance aborts
Jeffrey Lee (213) 6048 posts |
Here’s a fun one. While testing out the new OMAP video driver, I discovered that if you have the wimp in mode 25, open an !Edit window, hit F12, and enter I haven’t tracked it all down yet, but it looks like there are at least three bugs at work (oh, joy):
|
John-Mark Bell (94) 36 posts |
Well, &FAFF8000 is “Nowhere” on 32bit HAL builds of the kernel—see the top of NewReset.s in the kernel sources. No page should ever exist there. Section B3.4.2 of DDI0406B suggests that aborts may happen when performing cache maintenance operations with MVA:
This suggests to me that the least invasive fix would be to modify the kernel’s data abort handler to catch these kinds of aborts and recover appropriately. As for the ShellCLI stuff, it sounds to me that the abort is happening mid-mode change, so it’s confused :) |
Jeffrey Lee (213) 6048 posts |
Cheers JMB. It shouldn’t be too hard to modify the abort handler to detect the cache cleaning ops – and with a bit of work I could even get the handler to skip the entire page if it recognises that the abort came from one of the kernel cache ops. Looking at that section of the manual a bit further, it looks like we may be able to make some further improvements to the cache cleaning code – specifically, if “PIPT data & unified caches” means what I think it does, it means we don’t have to worry about flushing the data cache at all when remapping pages. This would hopefully get rid of the several-second delay that occurs when starting the WIMP (when I had a quick look at it a while ago I tracked it down to the kernel remapping and cleaning lots of pages one at a time – but I didn’t look much further to work out how difficult it would be to change the dynamic area handler to use a more optimal solution for remapping large numbers of pages) Now all I need to do is find an easy way to track down the bit of the mode change code that leaves the mode variables in an inconsistent state :( I suppose the easiest way would be to trigger a breakpoint when an attempt is made to clean the “nowhere” page – that way I’ll be able to find out what Edit’s up to at the same time. |
Jeffrey Lee (213) 6048 posts |
After trying this, I’ve discovered that the cache cleaning is being triggered by the OS_ChangeDynamicArea call that attempts to grow screen memory. It must just be some side-effect of having Edit loaded that causes the page tables to be arranged in such a way that OS_ChangeDynamicArea decides to try cleaning the nowhere page; although I’m no expert on the code, it looks like some or all of the memory it wants to add to the screen dynamic area had previously been mapped to ‘nowhere’, and it’s (perhaps erroneously) attempting to clean the nowhere region before remapping the pages. Also of interest is that the VduWrch aborts seem to be a result of the way screen memory is doubly-mapped. The PreGrow handler shifts the first mapping down to make way for the new pages, but the cursor address (and all the other VDU address variables) don’t get updated – and since the new mode uses 8x as much memory as the previous one, this means that none of them are pointing to valid addresses. This could probably be fixed without too much hassle (e.g. make sure they point at the start of the second mapping of screen memory), but since OS_ChangeDynamicArea shouldn’t abort in the first place it probably isn’t worth the hassle/risk. |
Steve Revill (20) 1361 posts |
Good investigative work there. I think you might want to at least add a comment in the sources relating to the latter – save someone the same investigation work if it needs fixing later. |
Jeffrey Lee (213) 6048 posts |
It isn’t over yet! After writing the abort handler I was still getting stuck in an infinite abort loop. After a few more hours of investigation, it looked like the OS_ChangeDynamicArea + abort handler combination resulted in IRQs being left disabled long enough for multiple IRQs to be queued. Then when IRQs were enabled again, something was going “bang” and corrupting the stack (in some cases causing the PC to be loaded with the PSR, in other cases causing it to be loaded with 0, and in other cases causing it to be loaded with some other value). After some more investigation I managed to track the location of the crash down to PointerV, some time after calling the USB driver’s handler. Although I’m still none the wiser as to when and why the stack corruption is occuring, I have at least found a bug that, when fixed, prevents the crash from occuring (both in my “disable IRQs for a while” testbed and the real-world *wimpmode test) – the pointerv() implementation in USBDriver assumed that interrupts were enabled on entry, and unconditionally called _kernel_irqs_on() after performing its atomic pointer position update (see here, around line 419). This assumption was obviously false for the situation in which the code was failing (video IRQ -> video driver → GraphicsV_VSync → kernel → PollPointer in Kernel.s.PMF.mouse → PointerV_Request). Looking at the code a bit further I can see that PollPointer restores the full PSR after calling PointerV – which must mean that the stack corruption occurs between USBDriver enabling IRQs and PointerV disabling them again (which makes sense, since the crash was occuring with R14 pointing to the loop in OS_CallAVector). But more importantly, it gives me a window in which to enable more comprehensive debugging code – so that I can find out exactly what interrupts occur during the window, and then hopefully work out where the stack corruption comes from. |
Jeffrey Lee (213) 6048 posts |
And thinking about it even further, the fact that I was still getting stuck in an abort loop suggests that after OS_ChangeDynamicArea finishes and re-enables interrupts the VDU drivers must still be out of sync with the memory map, otherwise the error report wouldn’t have triggered any further aborts. Bah. |
Jeffrey Lee (213) 6048 posts |
OK, forget that last bit – it looks like (with the abort handler) *wimpmode only gives me problems if I have some of my debug code enabled. With the debug code disabled it looks like everything is fine, and it’s only my testbed that’s able to trigger the PointerV crash. This means I should be able to forget about my current theory that something turns interrupts on before the VDU driver updates its pointers/finishes remapping memory, and it’s only the PointerV/stack corruption that needs further investigation at this point in time. |
Steve Revill (20) 1361 posts |
I remember the days when I was actually able to indulge in this kind of work. Now I spend all day (and night) on conference calls and updating fault report tickets. :( Good luck. |
Jeffrey Lee (213) 6048 posts |
Well you’ll be pleased to hear it wasn’t too hard to track down. Turned out I’d forgotten to preserve R14_svc before calling a SWI from IRQ mode :( |
Trevor Johnson (329) 1645 posts |
I don’t know if this is related, but I’ve just filed Ticket #256 |
Trevor Johnson (329) 1645 posts |
Thanks for investigating this and fixing it :-) |