Pi 3 shutting down-ish

209 posts, 30 voices

Pages: 1 2 3 4 5 6 7 8 9

Jan 18, 2019 6:15pm Jon Abbott (1421) 2651 posts	gamma via a mailbox does fix the shutdown-ish thing, as was seen here anyway Via mailbox does not resolve the issue. If you’re running Jeffrey’s latest test build, it will be the reduced number of mailbox events that are substantially reducing the chance of it occurring.

Jan 18, 2019 7:09pm David Pitt (3386) 1248 posts	If you’re running Jeffrey’s latest test build, it will be the reduced number of mailbox events that are substantially reducing the chance of it occurring. The fault did not occur here with the first test build in three days. I have now reverted to that ROM for further testing.

Jan 18, 2019 8:24pm Jon Abbott (1421) 2651 posts	It’s been sat idle for over three hours and not blanked, so CPU/IO load or related elements appear to be a factor. After five hours it didn’t blank when left idle, the minute I started using it however it blanked within minutes. So VCHIQ updates might be triggering it, as I hadn’t RMKilled BCMSupport I can’t rule a combination of them out. CPU/IO load do not appear to be a causal factor, but do appear to contribute as I can force it to blank by heavy CPU/IO. As per tests the other day, RMKilling BCMSound and Portable do substantially lower the amount of times it blanks. What might be worth testing at this stage is a commandline app that floods VCHIQ pointer updates, to see if it will repro the issue outside of the desktop. The fault did not occur here with the first test build in three days. What resolution are you running at? To force the issue, I’m running both the Pi and the Desktop at 1360×768×16M, then stressing the CPU/IO with the pointer showing. That triggers it within 2 mins in my case. If I turn the pointer off, the issue doesn’t occur and if I just generally use the desktop, it will randomly occur.

Jan 18, 2019 8:36pm Dominic Plunkett (2556) 34 posts	I’m on my phone so can’t check the source, but is there a suitable data sync barrier before accessing the mailbox. Also any irq fiq handler needs a data sync barrier at the beginning and end.

Jan 18, 2019 9:18pm David Pitt (3386) 1248 posts	To force the issue Well that is the thing, I have so far been unable to forcibly reproduce the fault. All the faults here have been while the Pi is idling. I have not got anywhere beyond simply having to wait and see if the Pi spontaneously blanks itself. It can happen shortly after turn on but the Pi was not in user use.

Jan 19, 2019 8:54am Jon Abbott (1421) 2651 posts	Well that is the thing, I have so far been unable to forcibly reproduce the fault. It’s bizarre that I seem to be the only one that can Repro the issue at will. It has me stumped though as I’m still no closer to knowing the root cause. I have a nagging feeling this is somehow related to the other issue I reported with VCHIQ, where it will lock under certain conditions. I started looking at its code last night and noticed it has some issues, such as none of the atomic functions are truly atomic. Shutting IRQ off around an LDR/STR for example is not guaranteed to be atomic, as IRQ’s may be queued, I believe you need many NOP’s or a barrier in between. Strictly speaking they should be using LDREX / STREX. For example: `112: atomic_xchg 113: MRS a4, CPSR 114: ORR ip, a4, #I32_bit 115: MSR CPSR_c, ip 116: LDR a3, [a1] 117: STR a2, [a1] 118: MSR CPSR_c, a4 119: MOV a1, a3 120: MOV pc, lr`

Jan 19, 2019 9:28am Chris Hall (132) 3554 posts	What are registers a1, a2 and a3? Is the assembler clever enough to decide on the fly which registers are currently unused and allocate a1, a2 and a3 to those registers? Like some sort of macro?

Jan 19, 2019 9:46am Matthew Phillips (473) 721 posts	No, I think the a1, a2, a3, a4 will be the argument registers of APCS, the ARM Procedure Call Standard which is used in code generated by C compilers and compilers for other languages (if they follow the standard). The arguments of a function are passed directly in a1 to a4, and if there are more than four arguments the rest are passed on the stack. The labels a1, a2, a3, a4 map to R1, R2, R3, R4. EDIT: Stuart’s pointed it out below. They map to R0 to R3. That was what I had in my head when I was typing it, but it clearly didn’t come out right!

Jan 19, 2019 10:00am Stuart Swales (1481) 351 posts	a1, a2, a3, a4 map to R0, R1, R2, R3 :-)

Jan 19, 2019 10:29am Jon Abbott (1421) 2651 posts	mutex_lock is another bit that could possibly do with a recode as it RT_Yield’s with IRQ disabled. Won’t that cause a potential issue if it yields to a higher priority thread? `147: int mutex_lock(struct mutex m) 148: { 149: uint32_t rt_handle = _swi(RT_ReadInfo,_IN(0)\|_RETURN(0),RTReadInfo_Handle); 150: int irqs = ensure_irqs_off(); 151: while(!m->pollword && (m->rt_handle != rt_handle)) 152: { 153: if(_swix(RT_Yield,_IN(1),&m->pollword)) 154: { 155: restore_irqs(irqs); 156: return -1; 157: } 158: } 159: m->pollword = 0; 160: m->rt_handle = rt_handle; 161: restore_irqs(irqs); 162: return 0; 163: }` Surely lines 159/160 are redundant? up also RT_Yield’s with IRQ disabled: `210: void up(struct semaphore s) 211: { 212: int irqs = ensure_irqs_off(); 213: if(!(s->pollword++)) 214: { 215: _swix(RT_Yield,_IN(1),&dummy_pollword_1); 216: } 217: restore_irqs(irqs); 218: }`

Jan 19, 2019 11:43am Jon Abbott (1421) 2651 posts	An interesting observation this morning. With VCHIQ unplugged, I RMReInit’d VCHIQ and the screen instantly blanked. We need rule the VCHIQ Module out. I need some code that will set the hardware pointer directly. Jeffrey – you need to update your RO source code as I’m not sure if the Pi3 undervolt issue has any knock on effect when testing your builds.

Jan 19, 2019 12:52pm Jon Abbott (1421) 2651 posts	I’ve been testing the 16-01-19 build for the past hour and cannot get it to blank with my usual repro method. Is the blanking issue related to the undervolt issue that was fixed in Dec? ie is the SD usage LED update causing the blanking? I’m going to continue testing to see if it does eventually blank, but it’s looking promising that the issue might have vicariously been resolved (famous last words!) EDIT: Spoke too soon, it blanked whilst I was manually copying some files. EDIT2: When it blanks, the SD activity light stops working.

Jan 19, 2019 2:41pm Jeffrey Lee (213) 6048 posts	I’m on my phone so can’t check the source, but is there a suitable data sync barrier before accessing the mailbox. Also any irq fiq handler needs a data sync barrier at the beginning and end. I believe suitable barriers should be present in all the right places. I started looking at its code last night and noticed it has some issues, such as none of the atomic functions are truly atomic. Shutting IRQ off around an LDR/STR for example is not guaranteed to be atomic, as IRQ’s may be queued, I believe you need many NOP’s or a barrier in between. Strictly speaking they should be using LDREX / STREX. The current atomic functions should be fine – VCHIQ is only running on a single core, and (unless I’m mistaken) they aren’t used on memory which the GPU is also updating. Calling RT_Yield with IRQs disabled is certainly dangerous, but that would result in a deadlock of the ARM core (as you’ve mentioned) – the GPU should continue running (or at least continue to output the correct video display). Surely lines 159/160 are redundant? 159 marks the mutex as locked, 160 is for debug. Jeffrey – you need to update your RO source code as I’m not sure if the Pi3 undervolt issue has any knock on effect when testing your builds. Sorry – hadn’t realised that it was significantly out of date. I’ve now submitted my changes to CVS, and uploaded a new ROM built from the latest sources. We need rule the VCHIQ Module out. I need some code that will set the hardware pointer directly. There are a couple of mailbox messages for a GPU-managed hardware pointer; it might be worth experimenting with that. I’d imagine that the GPU will create a dispmanx overlay for it, just like the dispmanx-based hardware pointer that RISC OS uses. n.b. I’m not sure what to make of the mistake in “Width and height should be >= 16 and (width * height) <= 64” – presumably it’s meant to read that the max width & height is 64. EDIT2: When it blanks, the SD activity light stops working. Not unsurprising – on the Pi 3 the SD activity LED is handled by the GPU. The ARM writes a status value to a buffer which the GPU regularly polls to determine the desired LED state.

Jan 19, 2019 8:21pm Jon Abbott (1421) 2651 posts	There are a couple of mailbox messages for a GPU-managed hardware pointer; it might be worth experimenting with that With VCHIQ unplugged, the screen still blanks when enabling the cursor via mailbox tag &8011, which rules out VCHIQ. I’m currently testing with BCMSupport RMKilled once I’ve sent the mailbox message, which is taking longer to blank.

Jan 19, 2019 10:36pm Jon Abbott (1421) 2651 posts	It didn’t blank after two hours with BCMSupport killed, within seconds of RMReiniting it, it blanks. Given that I’ve previously not been able to reproduce the issue with BCMSupport active and VCHIQ killed, I have to conclude there’s some interplay between the mailbox messages RISCOS is sending and the hardware pointer being active. I then tried manually enabling gamma and noticed something is constantly changing it back. Why is RISCOS sending gamma messages virtually every vsync? I’m now testing with both BCMSupport and VCHIQ killed and the hardware pointer and gamma set via mailbox. EDIT: I can’t get it to blank when enabling the hardware pointer and gamma via mailbox with BCMSupport and VCHIQ killed, so the root cause surely has to be something RISCOS is doing? Given there’s no issue with the hardware pointer off, is it possible the GPU is getting overloaded with updates? I suppose the other possibily is a timing issue.

Jan 20, 2019 4:02pm Jeffrey Lee (213) 6048 posts	I then tried manually enabling gamma and noticed something is constantly changing it back. Why is RISCOS sending gamma messages virtually every vsync? Looks like it was the flashing colour logic in the kernel (which meant that you could have theoretically had a flashing gamma table if you wanted!) I’ve fixed it now so that it won’t try flashing the gamma. http://www.phlamethrower.co.uk/misc2/bcm2835dev.zip

Jan 20, 2019 4:30pm Jon Abbott (1421) 2651 posts	Why is RISCOS sending gamma messages virtually every vsync? I’ve narrowed this down to WimpTask. If you run a task via WimpTask, every few ms a GraphicsV 11 R0=0 R3=256 is sent. I know I’ve raised this previously, but can gamma please be given its own R0 value (R0=3) to distinguish it from normal palette changes? I’ve tried hammering mailbox cursor and gamma changes with the pointer off and can’t get it to blank, implying its definitely something RISCOS is doing. This isn’t being caused by an issue in BCMVideo is it? Maybe I’m barking up the wrong tree trying to eliminate VCHIQ and BCMSupport. EDIT: I’ve fixed it now so that it won’t try flashing the gamma. I’m still seeing them when running something via WimpTask EDIT2: I’m currently testing with GraphicsV 5 replaced with a mailbox equivalent, which should at least narrow it down to GraphicsV 5 in BCMVideo or VCHIQ – if I can’t get it to blank.

Jan 20, 2019 7:41pm Jon Abbott (1421) 2651 posts	I’m currently testing with GraphicsV 5 replaced with a mailbox equivalent, which should at least narrow it down to GraphicsV 5 in BCMVideo or VCHIQ With VCHIQ and BCMSupport active, it’s not blanking with GraphicsV 5 intercepted and switched to using mailbox messages. We already know killing VCHIQ resolves the issue, so I think the focus has to be on the pointer code in BCMVideo and relevant code in VCHIQ that it calls. EDIT: Another way to force the issue is continuously hit F12 then ENTER, it will eventually blank when it goes back to the desktop. For some reason, having a few StrongEd windows open seems to increase the frequency it will blank.

Jan 21, 2019 7:44am Jon Abbott (1421) 2651 posts	I’ve just realised that I think I’ve seen this issue dozens of times before. This may well be completely unrelated to this issue, but in hindsight it’s very similar. When debugging game crashes, I run the debug build of ADFFS which displays various info direct to the GPU buffer. Certain crashes cause the screen to blank, which I’ve never got to the bottom of. I know there’s no way the debug info could magically disappear from the screen, so for a long time I thought it was an issue in my GraphicsV driver. While debugging the issue, I noticed RTSupport also mysteriously stopped calling my VSync trigger, but that still doesn’t explain why the screen goes blank in the first place. Of course, I now can’t think of a game that did this, to go back and see if it is related. Back to this issue…I tried constantly turning the pointer off/on last night which didn’t seem to trigger the issue, but it can only be done every VSync so wasn’t exactly stressing anything. Leaving the pointer in one spot doesn’t seem to trigger it since the change to drop redundant updates – certainly not within five hours. In my case, I can trigger the issue in one of three ways: within a second of getting to the desktop while I’m moving the mouse whilst creating game packages with the pointer active (although I’ve not retested this since the change to drop redundant pointer updates) We know the issue only occurs if both gamma and the pointer are active, but it doesn’t seem to occur if gamma is active and the pointer position is handled via mailbox – certainly over the three hour period I tested. Ignoring the fact the display is handled in hardware, the random nature of it sounds like either a timing issue, IRQ issue or reentrancy issue. VCHIQ seems to be at the centre of it, but I’m still puzzled why enabling gamma could trigger the issue other than it increases the time the frame takes to composite – that sort of implies a timing issue, possibly related to the time into the frame that VCHIQ is sending the update. Should we not be able to trigger it by constantly sending pointer updates via VCHIQ?

Jan 21, 2019 3:26pm Jon Abbott (1421) 2651 posts	Should we not be able to trigger it by constantly sending pointer updates via VCHIQ? I’ve tried constantly sending pointer changes, pointer palette changes and gamma changes to BCMVideo and can’t get it to blank without moving the mouse. I’ve noticed an issue though, a rogue pointer is left on screen every time you send a GraphicsV 5 with Y set to the screen height – you can end up with dozens of them.

Jan 21, 2019 10:57pm Jeffrey Lee (213) 6048 posts	I’ve noticed an issue though, a rogue pointer is left on screen every time you send a GraphicsV 5 with Y set to the screen height – you can end up with dozens of them. Yeah, I can recreate that here – although it seems to be timing specific for me (only seems to happen if I flood the system with GraphicsV 5 calls), so it looks like it’ll be a bit tedious to track down the cause.

Jan 22, 2019 6:10am Jon Abbott (1421) 2651 posts	it seems to be timing specific for me (only seems to happen if I flood the system with GraphicsV 5 calls) Is it ever likely to happen in daily use though, isn’t the Y position capped to height-1? Do you think it’s related? If it is a timing issue, it’s likely there are other issues that aren’t immediately obvious. I noticed another issue as well, although couldn’t reproduce it. Whilst moving the mouse around while flooding requests, the GPU suddenly zoomed in on a small section of the task bar. This got me thinking, I wonder if it’s not actually blanking but displaying a section of memory it can’t access so shuts down. A reboot restarts the GPU, but I’m guessing it’s the bootstrap that’s doing that part, so we can’t actually reset the GPU manually? Do you think it’s worth simplifying BCMVideo and moving the pointer to use the mailbox instead of VCHIQ? It’s possibly worth testing privately at least, as I couldn’t reproduce the issue when intercepting GraphicsV 5 and using the mailbox. If that does resolve the issue, it would at least confirm the root cause is VCHIQ related.

Jan 22, 2019 1:27pm Jeffrey Lee (213) 6048 posts	it seems to be timing specific for me (only seems to happen if I flood the system with GraphicsV 5 calls) Is it ever likely to happen in daily use though, isn’t the Y position capped to height-1? You can change the mouse rectangle so that it allows the pointer to travel off-screen (e.g. dragging a selection box in a filer window allows this) Do you think it’s related? If it is a timing issue, it’s likely there are other issues that aren’t immediately obvious. Possibly. It’s hard to say without knowing where the bug is (I couldn’t see any logic bugs which would explain the behaviour, so some more debugging is needed to try and get a trace of the events that cause it to go wrong) I noticed another issue as well, although couldn’t reproduce it. Whilst moving the mouse around while flooding requests, the GPU suddenly zoomed in on a small section of the task bar. This got me thinking, I wonder if it’s not actually blanking but displaying a section of memory it can’t access so shuts down. For me, if I continue to let the rogue pointers build up, eventually the desktop vanishes from the screen so that all that’s left are the pointer images. But this is distinctly different from the “shutting down-ish” behaviour (where the entire video signal cuts out and the GPU seems to die completely), so it’s hard to say whether the bug is the cause of both. A reboot restarts the GPU, but I’m guessing it’s the bootstrap that’s doing that part, so we can’t actually reset the GPU manually? AFAIK there’s no way of resetting the GPU without also resetting the ARM. Do you think it’s worth simplifying BCMVideo and moving the pointer to use the mailbox instead of VCHIQ? It’s possibly worth testing privately at least, as I couldn’t reproduce the issue when intercepting GraphicsV 5 and using the mailbox. If that does resolve the issue, it would at least confirm the root cause is VCHIQ related. Yeah, a version which uses the mailbox for pointer updates would be a good test. But probably only worth doing once this duplicate pointer bug has been fixed, just in case the two are related.

Jan 22, 2019 11:32pm jan de boer (472) 78 posts	You can change the mouse rectangle so that it allows the pointer to travel off-screen (e.g. dragging a selection box in a filer window allows this). Imho, i doubt whether the mouse rectangle normally goes beyond the screen coordinates, in a multitasking program. Mouserectangle as defined by ‘where the tip of the pointer can go’. As there is a feature in the Wimp manager that makes this less likely: It’s in the Wimpmanager, offset & AC14. It is in code that gives back the window and icon handles for a mouse position. If there is no handle beneath the pointer, the wimpmanager can crash; in a RPI, with high vectors and alignment errors on, it gives both a zeropage and an unalignment error. AC14 STMDB R13!,{R0-R7,R10} AC18 SUB R14,R3,#1 AC1C LDR R14,[R14,#4] \unaligned load from zeropage address 3 AC20 CMN R14,#1 AC24 BNE &0000AC40 AC38 SWI XOS_ReadMonotonicTime To elicit this issue, have a multitasking program make the pointerwindow unlimited (OS_Word 21); the wimp will crash with the next program that is polled, when you move the pointer outside the screen area. On old machines, nothing happens, on new ones you can only do (Ctrl_Shift_Break). The source of this code is in apache.RiscOS.Sources.Desktop.Wimp.s.Wimp03, line 1356 etc. If R3 (AC18) is zero, because there is no window/icon handle under the pointer, R14 becomes -1 in line 1357. This logic is there and unchanged since at least Riscos 3.11. Old machines don’t notice it, and on newer machines it never happens because the pointerwindow supposedly is limited to the screen area. The ‘body’ of the pointer can dive under the bottom and the right side of the screen, of course; but hopefully when it’s drawn everything outside the screen is clipped. If the mouserectangle could go beyond the screen area, you could see this error more often, i suppose. The above has nothing to do with the shutting-down problem so best to be left alone; ‘one problem a time’

Jan 23, 2019 6:23am Jon Abbott (1421) 2651 posts	I’m going to change tact today. Instead of trying to figure out what’s triggering the issue, I’m going to see if I can figure out why my setup exhibits it so readily. One question that’s been nagging me, how is gamma handled in low bpp modes given its impossible to distinguish a gamma change from a 256 entry palette change? Does RISCOS simply not broadcast gamma in low bpp modes? That would at least explain why the issue goes away when using a 256 colour desktop mode. I had to ignore 256 entry palette changes in my GraphicsV driver, to filter out the gamma broadcasts, but I may just have been seeing them because its in a 24bit screen mode when emulating low bpp modes. You can change the mouse rectangle so that it allows the pointer to travel off-screen (e.g. dragging a selection box in a filer window allows this) Shouldn’t BCMVideo be capping the displayed pointer to the screen dimensions? VC does some really odd stuff if you ask it to display a pointer too far off screen – that’s what I observed when playing around with the mailbox method.