Pi 3 shutting down-ish
Jon Abbott (1421) 2651 posts |
Via mailbox does not resolve the issue. If you’re running Jeffrey’s latest test build, it will be the reduced number of mailbox events that are substantially reducing the chance of it occurring. |
David Pitt (3386) 1248 posts |
The fault did not occur here with the first test build in three days. I have now reverted to that ROM for further testing. |
Jon Abbott (1421) 2651 posts |
After five hours it didn’t blank when left idle, the minute I started using it however it blanked within minutes. So VCHIQ updates might be triggering it, as I hadn’t RMKilled BCMSupport I can’t rule a combination of them out. CPU/IO load do not appear to be a causal factor, but do appear to contribute as I can force it to blank by heavy CPU/IO. As per tests the other day, RMKilling BCMSound and Portable do substantially lower the amount of times it blanks. What might be worth testing at this stage is a commandline app that floods VCHIQ pointer updates, to see if it will repro the issue outside of the desktop.
What resolution are you running at? To force the issue, I’m running both the Pi and the Desktop at 1360×768×16M, then stressing the CPU/IO with the pointer showing. That triggers it within 2 mins in my case. If I turn the pointer off, the issue doesn’t occur and if I just generally use the desktop, it will randomly occur. |
Dominic Plunkett (2556) 34 posts |
I’m on my phone so can’t check the source, but is there a suitable data sync barrier before accessing the mailbox. Also any irq fiq handler needs a data sync barrier at the beginning and end. |
David Pitt (3386) 1248 posts |
Well that is the thing, I have so far been unable to forcibly reproduce the fault. All the faults here have been while the Pi is idling. I have not got anywhere beyond simply having to wait and see if the Pi spontaneously blanks itself. It can happen shortly after turn on but the Pi was not in user use. |
Jon Abbott (1421) 2651 posts |
It’s bizarre that I seem to be the only one that can Repro the issue at will. It has me stumped though as I’m still no closer to knowing the root cause. I have a nagging feeling this is somehow related to the other issue I reported with VCHIQ, where it will lock under certain conditions. I started looking at its code last night and noticed it has some issues, such as none of the atomic functions are truly atomic. Shutting IRQ off around an LDR/STR for example is not guaranteed to be atomic, as IRQ’s may be queued, I believe you need many NOP’s or a barrier in between. Strictly speaking they should be using LDREX / STREX. For example:
|
Chris Hall (132) 3554 posts |
What are registers a1, a2 and a3? Is the assembler clever enough to decide on the fly which registers are currently unused and allocate a1, a2 and a3 to those registers? Like some sort of macro? |
Matthew Phillips (473) 721 posts |
No, I think the a1, a2, a3, a4 will be the argument registers of APCS, the ARM Procedure Call Standard which is used in code generated by C compilers and compilers for other languages (if they follow the standard). The arguments of a function are passed directly in a1 to a4, and if there are more than four arguments the rest are passed on the stack. The labels a1, a2, a3, a4 map to R1, R2, R3, R4. EDIT: Stuart’s pointed it out below. They map to R0 to R3. That was what I had in my head when I was typing it, but it clearly didn’t come out right! |
Stuart Swales (1481) 351 posts |
a1, a2, a3, a4 map to R0, R1, R2, R3 :-) |
Jon Abbott (1421) 2651 posts |
mutex_lock is another bit that could possibly do with a recode as it RT_Yield’s with IRQ disabled. Won’t that cause a potential issue if it yields to a higher priority thread?
Surely lines 159/160 are redundant? up also RT_Yield’s with IRQ disabled:
|
Jon Abbott (1421) 2651 posts |
An interesting observation this morning. With VCHIQ unplugged, I RMReInit’d VCHIQ and the screen instantly blanked. We need rule the VCHIQ Module out. I need some code that will set the hardware pointer directly. Jeffrey – you need to update your RO source code as I’m not sure if the Pi3 undervolt issue has any knock on effect when testing your builds. |
Jon Abbott (1421) 2651 posts |
I’ve been testing the 16-01-19 build for the past hour and cannot get it to blank with my usual repro method. Is the blanking issue related to the undervolt issue that was fixed in Dec? ie is the SD usage LED update causing the blanking? I’m going to continue testing to see if it does eventually blank, but it’s looking promising that the issue might have vicariously been resolved (famous last words!) EDIT: Spoke too soon, it blanked whilst I was manually copying some files. EDIT2: When it blanks, the SD activity light stops working. |
Jeffrey Lee (213) 6048 posts |
I believe suitable barriers should be present in all the right places.
The current atomic functions should be fine – VCHIQ is only running on a single core, and (unless I’m mistaken) they aren’t used on memory which the GPU is also updating. Calling RT_Yield with IRQs disabled is certainly dangerous, but that would result in a deadlock of the ARM core (as you’ve mentioned) – the GPU should continue running (or at least continue to output the correct video display).
159 marks the mutex as locked, 160 is for debug.
Sorry – hadn’t realised that it was significantly out of date. I’ve now submitted my changes to CVS, and uploaded a new ROM built from the latest sources.
There are a couple of mailbox messages for a GPU-managed hardware pointer; it might be worth experimenting with that. I’d imagine that the GPU will create a dispmanx overlay for it, just like the dispmanx-based hardware pointer that RISC OS uses. n.b. I’m not sure what to make of the mistake in “Width and height should be >= 16 and (width * height) <= 64” – presumably it’s meant to read that the max width & height is 64.
Not unsurprising – on the Pi 3 the SD activity LED is handled by the GPU. The ARM writes a status value to a buffer which the GPU regularly polls to determine the desired LED state. |
Jon Abbott (1421) 2651 posts |
With VCHIQ unplugged, the screen still blanks when enabling the cursor via mailbox tag &8011, which rules out VCHIQ. I’m currently testing with BCMSupport RMKilled once I’ve sent the mailbox message, which is taking longer to blank. |
Jon Abbott (1421) 2651 posts |
It didn’t blank after two hours with BCMSupport killed, within seconds of RMReiniting it, it blanks. Given that I’ve previously not been able to reproduce the issue with BCMSupport active and VCHIQ killed, I have to conclude there’s some interplay between the mailbox messages RISCOS is sending and the hardware pointer being active. I then tried manually enabling gamma and noticed something is constantly changing it back. Why is RISCOS sending gamma messages virtually every vsync? I’m now testing with both BCMSupport and VCHIQ killed and the hardware pointer and gamma set via mailbox. EDIT: I can’t get it to blank when enabling the hardware pointer and gamma via mailbox with BCMSupport and VCHIQ killed, so the root cause surely has to be something RISCOS is doing? Given there’s no issue with the hardware pointer off, is it possible the GPU is getting overloaded with updates? I suppose the other possibily is a timing issue. |
Jeffrey Lee (213) 6048 posts |
Looks like it was the flashing colour logic in the kernel (which meant that you could have theoretically had a flashing gamma table if you wanted!) I’ve fixed it now so that it won’t try flashing the gamma. |
Jon Abbott (1421) 2651 posts |
I’ve narrowed this down to WimpTask. If you run a task via WimpTask, every few ms a GraphicsV 11 R0=0 R3=256 is sent. I know I’ve raised this previously, but can gamma please be given its own R0 value (R0=3) to distinguish it from normal palette changes? I’ve tried hammering mailbox cursor and gamma changes with the pointer off and can’t get it to blank, implying its definitely something RISCOS is doing. This isn’t being caused by an issue in BCMVideo is it? Maybe I’m barking up the wrong tree trying to eliminate VCHIQ and BCMSupport. EDIT:
I’m still seeing them when running something via WimpTask EDIT2: I’m currently testing with GraphicsV 5 replaced with a mailbox equivalent, which should at least narrow it down to GraphicsV 5 in BCMVideo or VCHIQ – if I can’t get it to blank. |
Jon Abbott (1421) 2651 posts |
With VCHIQ and BCMSupport active, it’s not blanking with GraphicsV 5 intercepted and switched to using mailbox messages. We already know killing VCHIQ resolves the issue, so I think the focus has to be on the pointer code in BCMVideo and relevant code in VCHIQ that it calls. EDIT: Another way to force the issue is continuously hit F12 then ENTER, it will eventually blank when it goes back to the desktop. For some reason, having a few StrongEd windows open seems to increase the frequency it will blank. |
Jon Abbott (1421) 2651 posts |
I’ve just realised that I think I’ve seen this issue dozens of times before. This may well be completely unrelated to this issue, but in hindsight it’s very similar. When debugging game crashes, I run the debug build of ADFFS which displays various info direct to the GPU buffer. Certain crashes cause the screen to blank, which I’ve never got to the bottom of. I know there’s no way the debug info could magically disappear from the screen, so for a long time I thought it was an issue in my GraphicsV driver. While debugging the issue, I noticed RTSupport also mysteriously stopped calling my VSync trigger, but that still doesn’t explain why the screen goes blank in the first place. Of course, I now can’t think of a game that did this, to go back and see if it is related. Back to this issue…I tried constantly turning the pointer off/on last night which didn’t seem to trigger the issue, but it can only be done every VSync so wasn’t exactly stressing anything. Leaving the pointer in one spot doesn’t seem to trigger it since the change to drop redundant updates – certainly not within five hours. In my case, I can trigger the issue in one of three ways:
We know the issue only occurs if both gamma and the pointer are active, but it doesn’t seem to occur if gamma is active and the pointer position is handled via mailbox – certainly over the three hour period I tested. Ignoring the fact the display is handled in hardware, the random nature of it sounds like either a timing issue, IRQ issue or reentrancy issue. VCHIQ seems to be at the centre of it, but I’m still puzzled why enabling gamma could trigger the issue other than it increases the time the frame takes to composite – that sort of implies a timing issue, possibly related to the time into the frame that VCHIQ is sending the update. Should we not be able to trigger it by constantly sending pointer updates via VCHIQ? |
Jon Abbott (1421) 2651 posts |
I’ve tried constantly sending pointer changes, pointer palette changes and gamma changes to BCMVideo and can’t get it to blank without moving the mouse. I’ve noticed an issue though, a rogue pointer is left on screen every time you send a GraphicsV 5 with Y set to the screen height – you can end up with dozens of them. |
Jeffrey Lee (213) 6048 posts |
Yeah, I can recreate that here – although it seems to be timing specific for me (only seems to happen if I flood the system with GraphicsV 5 calls), so it looks like it’ll be a bit tedious to track down the cause. |
Jon Abbott (1421) 2651 posts |
Is it ever likely to happen in daily use though, isn’t the Y position capped to height-1? Do you think it’s related? If it is a timing issue, it’s likely there are other issues that aren’t immediately obvious. I noticed another issue as well, although couldn’t reproduce it. Whilst moving the mouse around while flooding requests, the GPU suddenly zoomed in on a small section of the task bar. This got me thinking, I wonder if it’s not actually blanking but displaying a section of memory it can’t access so shuts down. A reboot restarts the GPU, but I’m guessing it’s the bootstrap that’s doing that part, so we can’t actually reset the GPU manually? Do you think it’s worth simplifying BCMVideo and moving the pointer to use the mailbox instead of VCHIQ? It’s possibly worth testing privately at least, as I couldn’t reproduce the issue when intercepting GraphicsV 5 and using the mailbox. If that does resolve the issue, it would at least confirm the root cause is VCHIQ related. |
Jeffrey Lee (213) 6048 posts |
it seems to be timing specific for me (only seems to happen if I flood the system with GraphicsV 5 calls) You can change the mouse rectangle so that it allows the pointer to travel off-screen (e.g. dragging a selection box in a filer window allows this)
Possibly. It’s hard to say without knowing where the bug is (I couldn’t see any logic bugs which would explain the behaviour, so some more debugging is needed to try and get a trace of the events that cause it to go wrong)
For me, if I continue to let the rogue pointers build up, eventually the desktop vanishes from the screen so that all that’s left are the pointer images. But this is distinctly different from the “shutting down-ish” behaviour (where the entire video signal cuts out and the GPU seems to die completely), so it’s hard to say whether the bug is the cause of both.
AFAIK there’s no way of resetting the GPU without also resetting the ARM.
Yeah, a version which uses the mailbox for pointer updates would be a good test. But probably only worth doing once this duplicate pointer bug has been fixed, just in case the two are related. |
jan de boer (472) 78 posts |
You can change the mouse rectangle so that it allows the pointer to travel off-screen (e.g. dragging a selection box in a filer window allows this). Imho, i doubt whether the mouse rectangle normally goes beyond the screen coordinates, in a multitasking program. Mouserectangle as defined by ‘where the tip of the pointer can go’. As there is a feature in the Wimp manager that makes this less likely: |
Jon Abbott (1421) 2651 posts |
I’m going to change tact today. Instead of trying to figure out what’s triggering the issue, I’m going to see if I can figure out why my setup exhibits it so readily. One question that’s been nagging me, how is gamma handled in low bpp modes given its impossible to distinguish a gamma change from a 256 entry palette change? Does RISCOS simply not broadcast gamma in low bpp modes? That would at least explain why the issue goes away when using a 256 colour desktop mode. I had to ignore 256 entry palette changes in my GraphicsV driver, to filter out the gamma broadcasts, but I may just have been seeing them because its in a 24bit screen mode when emulating low bpp modes.
Shouldn’t BCMVideo be capping the displayed pointer to the screen dimensions? VC does some really odd stuff if you ask it to display a pointer too far off screen – that’s what I observed when playing around with the mailbox method. |