RPi 4B with RISC-OS 5.28 & 5.29 Lockups
Dave Higton (1515) 3526 posts |
You guys need to brush up on your 2’s complement arithemtic. Really. |
Rick Murray (539) 13840 posts |
Should work, as -20 plus 10 is -10, which will be the same as big hex number to slightly larger hex number (I’m writing this on my phone so can’t ask BASIC to That being said, the Wimp Poll/Idle source is, uh, interesting. A Byzantine maze of exception. So I’m not going to go looking to see if it’s a signed or unsigned comparison… |
Frank de Bruijn (160) 228 posts |
Already tested that, ages ago, when I still thought giving it a large unsigned integer (a.k.a. a negative signed one) would work. It didn’t (i.e. it returned immediately). |
Steve Fryatt (216) 2105 posts |
For the purposes of OS_MonotonicTime and Wimp_PollIdle, where we’re just doing additions and subtractions and testing relative values, it shouldn’t matter. If you never compare two time values directly, but always subtract one from the other and compare the result to zero, things cancel out just fine. That’s why you get constructs like this one, paraphrased a bit from the PRM (page 3-185, in the description of Wimp_PollIdle):
Assuming, of course, that internally, both OS_ReadMonotonicTime and Wimp_PollIdle are trading in unsigned 32-bit ints. The PRM doesn’t explicitly say they are, but the OS_ReadMonotonicTime description implies it. |
Frank de Bruijn (160) 228 posts |
And Wimp_PollIdle doesn’t. Tested extensively, some time during the 2010s. |
Dave Higton (1515) 3526 posts |
The result is treated as signed. It doesn’t matter whether the numbers being compared are signed or unsigned, so long as (a) they are both of the same type, and (b) the difference is less than half the range of the numbers. Like I say, brush up on your 2’s complement arithmetic. IIRC there used to be a problem in the Wimp at the wrap-round of the monotonic timer, but it was fixed decades ago. |
Steve Fryatt (216) 2105 posts |
The relevant code seems to be
from hereabouts down to the That looks to be very similar to the
construct from the Wimp_PollIdle PRM entry? If “now” is greater than “return time” then return? PS. That’s got to be a contender for a “most idiotic comment” award, hasn’t it? I can see that you’re using |
nemo (145) 2546 posts |
Gentlemen, there’s a degree of confusion here. 1. The time parameter for Wimp_PollIdle is an absolute MonotonicTime. If you want to yield for one second, do 2. (Related point) Don’t do anything just because you tried it and it didn’t obviously catch fire. 3. If you want to be called back immediately, don’t use PollIdle – just use Wimp_Poll as the gods intended. The code in question is this simple (my labels) – the pink bit: Task_R2 is the MonotonicTime the task asked to idle until. It will therefore NOT receive a null event until MonotonicTime is greater than or equal to that number, by subtracting it and comparing with zero. But also note that null events are the lowest priority event, and your task will be called back regardless of the idle time if anything else at all happens. For the record, the events are delivered in this priority order (i.e. first one that happens gets delivered):
HTH |
nemo (145) 2546 posts |
In context, it gets credit for someone considering whether a signed comparison was required. Regrettably we have not always been so lucky in that regard – see VDU-12345678 in a TaskWindow for example. |
nemo (145) 2546 posts |
Incidentally, the above priority list is why it is a Very Bad Idea™ for a task to send a message to itself – Wimp_Poll will just immediately return without a context switch. Do not mistake the WindowManager for any kind of multitasking scheduler. There is no concept of time-slicing or time-starvation. It’s a very simple hack of the single-tasking Wimp and it’s a testament to legions (generations, even) of Wimp programmers that the desktop works as well as it does. |
Steve Fryatt (216) 2105 posts |
I wonder if the confusion here is coming from people not realising that there’s a limit to the time delay which can be applied to Null events by Wimp_PollIdle, due to the wrap-around of the arithmetic? The delay can only be half of the total timespan allowed by OS_ReadMonotonicTime: &FFFFFFFF centiseconds is just over 497 days, so we can only delay by 248-and-a-half days before the comparison wraps around from “in the future” to “in the past”. Given this, Paolo’s -1 will be “in the past” until the machine has been running for more than 248 days, and so for that time it will return immediately with a Null event (just as if Wimp_Poll were used). However, after 248 days, it will start to cause Wimp_PollIdle to block Null events until 497 days have elapsed. Then there will be another 248 days where it returns immediately, and so on. In a similar vein, a “large unsigned integer” is more likely to be “in the past” than “in the future” for the first few days after the system has booted. Testing the Wimp’s behaviour is tricky unless the monotonic timer is tampered with. |
Paolo Fabio Zaino (28) 1882 posts |
First of all, thank you everyone for checking this! :) @ Steve
Not quite, the BPL instruction uses the N flag (Negative), so if N is not set… hence your BBC BASIC code should be:
Also, I need to make an apology, I have mentioned -1, but I do not use -1 as a magic number, sorry, I was in a hurry due other things going on at the same time and cutted my content way too much! For my test, I used:
Where user_delay% in my test was set to -1 and normally is set to 0 for a redraw, while it’s set to 7 otherwise. Which in my understanding should return immediately, as explained by nemo, I do not consider RO as an OS with a real multi-tasking scheduler (as I have mentioned bilions of times, for me it’s an Acorn MOS pumped with steroids). So using an immediately expired time sounds like it should just return immeditately to me. In other words, the difference between +0 and +(-1), should be in the lines of, with +0 there might be something else happening (like try to switch to another task), while with +(-1) there is no chance and just return straight to my task and, in my case, execute the next chunk of work. I guess I am wrong then. Again thanks for checking it :) |
Stuart Swales (8827) 1357 posts |
Yes, one of the first clients that was updated to use Wimp_PollIdle (and incorrectly) was the internal MailMan at Acorn. After the requisite interval had elapsed on one of the manglement desktops, it no longer polled for mail. Ironically, this particular mangler had wangled one of the early A440 production systems so he could boot either RISC OS or RISCiX and was ‘going to be doing that all the time’. Clearly not for 248 days, they hadn’t… [Edit: as Nemo infers below, they did in fact barely use the system.] |
nemo (145) 2546 posts |
Steve garbled
Days. If you manage to 1. Use RISC OS without crashing or resetting for eight months; and 2. Write a program that does so very little that the idle null poll is the first thing it hears about; then you win a special prize from the RISC OS faeries. |
Daniel Garrod (9459) 34 posts |
Sorry guys, but I am lost, what are all these messages to each other got to do with my locking up issue? Daniel. |
nemo (145) 2546 posts |
It moved off onto the orthogonal topic of tasks ceasing to perform their function (“locking up”) because of a failure to understand Wimp_PollIdle. |
Paolo Fabio Zaino (28) 1882 posts |
@ Daniel It started with people trying to understand why you are experiencing that locking up, but unfortunately it has moved off onto a side discussion because some people have made quite an enormous set of assumptions, which led to a confusing comclusions. In your case (and even in others mentioned here) I don’t think that, even a mistaken use of Wimp_PollIdle is causing the issue, here is why: 1) Some people mentioned that use now% - 1 would cause problems at some point in the future, this is obviously not quite right (and can happen only in an extremely remote condition), here is why: The monotonc timer will reset after a while, in the end is a 32bit number, so when reached its maximum it will restart from zero. This seems to be the sole element evaluated by who thinks it will cause problems in the future, but there is more to consider: a) WIMP_PollIdle interval only decide which NULL events to be send to our task, doesn't preclude other events and messages, this is absolutely crucial to consider, because in the rare case of "locking up" by using -1, as soon as a new event comes in, obviously the interval will be reset to proper numbers and so everything will be back to normal. So I agree with Nemo, someone is definitely confused about how Wimp_PollIdle works and affects RO, hopefully this will help. b) NULL events are the lowest priority AFAIR, so any other event, will reach a task and that will also trigger a reset of the now% - 1, which will then solve the problem. c) To actually get into the case where the number produced is going to lock a task, we need to execute the now% - 1 just about it is resetting and that task must only be accepting NULL events, but that requires some very specific configuration, which I have never seen done tbh also because a task should accept at th every minimal a signal to quit. But, even in this case, on modern RO there is a chance to kill that specific task using [ALT]+[BREAK]. 2) Someone suggested to use Wimp_Poll in cases where we wish to receive ALL the NULL events, and that is true, but not quite the same. In fact using Wimp_Poll shoudl equate to use Wimp_PollIdle with now% + 0 or something, not the same as now% - 1, which will literaly return immediately to the original task, no process "swap" will happen at all. In your case, if the system is locking up, it’s most likely something else causing the problem, and I would start from investigating which modules you have loaded etc. Hope this helps, [edit] |
nemo (145) 2546 posts |
Paolo monospaced
No. now%-1 is pointless (use Wimp_Poll) but harmless – wrap-around is never an issue in that case. The problem is in the theoretical case of using a fixed number, whether -1, 0 or RND, which in the worst-case might not return for eight months.
Yes.
No. There’s no difference. I posted the lines of code above – the Wimp will return to your task if MonotonicTime is >= your idle time. This will happen only if there is no other event to be delivered anywhere, but will happen1 regardless of whether your idle time was now%, now%-1 or now%-2147483647. As Dave has pointed out, this is a simple matter of twos-complement arithmetic. 1 Your task will also only be called back immediately if there are no other tasks waiting for nulls – they’re delivered in round-robin manner to avoid one task monopolising things. |
Paolo Fabio Zaino (28) 1882 posts |
That was, originally, my understanding as well, but I swear I have seen my Launchpad redrawing the icons faster than when using now% + 0, I’ll re-test tonigth again (it may have been just some peculiar combinations of things).
In my test, that was probably the case, as it was the only user executed task running. In any case, if I see again that visible difference I’ll make a video and post it somewhere. |
Daniel Garrod (9459) 34 posts |
Hi Paolo, You mention about testing modules, What is the best way and is there a copy of !Boot that has minimum amount of modules loaded? Thanks. Daniel. |
Paolo Fabio Zaino (28) 1882 posts |
Hey Daniel,
Here are few ideas that may help you to find the root cause of the problem you are experiencing: You can quickly check all the loaded modules by rebooting your system and then opening a TaskWindow and type: *modules AFAIR, ROM Modules should start at `FCxxxxxx`, while modules that gets loaded during your boot sequence should have addresses not starting with `FC`. Take note of those. If you are using an editor like !StrongEd you can copy then into the clipboard and the paste them on a text file, it’s quick. Then download a clean UniBoot from ROOL (if you are using RO 5), you can find one here I think: https://www.riscosopen.org/content/downloads/common (Scroll that page down until you find “Disc based components” and download the HardDIsc4 image (if you have a way to unzip it manual on your RISC OS system then you can download the zip file, if you don’t then download the self extracting one) Unzip the HardDisc4 image and then rename your original !Boot (something like XBoot will be fine) and then copy the !Boot you’ll find in your unzipped HardDisc4 onto the root directory of your SD card (where your old !Boot was basically). Reboot and run your software, see what happens. If it works fine and the problem you had disapear, then the problem was caused by something in your previous boot sequence. At this point, if you need to recover things from your old UniBoot I would suggest to proceed with caution and copy one item at the time from your old !Boot to your new one. Every time you copy a item from the old !Boot to the new one I would retest everything, so reboot, launch your app and make sure no problem happens, if no problem happens, then add another another item from old !Boot to the new one and repeat. In case the problem you had happens also with the clean new !Boot, then I would look for bugs in the App you’re using. RISC OS has a very, very vintage architecture (as mentioned above, it’s just an Acorn MOS – the original OS for the BBC Micro – improved, ported to ARM and with added a Desktop that was designed for machines running at 8Mhz, that is all it is), so it offers no protection against modules that miss-behave or have bugs, no protection against applications that have bugs and it’s an OS in which both the App can access the kernel space and the Kernel can access (directly) the App space, so system freeze can happen pretty much for almost any issue (technically one can “freeze” a RISC OS system just by running an infinite loop in an App). In recent years ROOL has put some effort in making the situation a bit better and they have created RO “kernels” that, for example, do not use page 0, so offer a bit more resiliance against apps that may be traing to access memory on page 0 by mistake (this can happen for a number of reasons). So, another suggestion is, if your App works fine with such kernels (and it should tbh), then use those, that will help gaining a bit more resiliance. Another quick test you could do to roll out any issues caused by the hardware for example, is just plug your SD card on another Raspberry Pi and see if the problem happens also on the new hardware, if it does then try the tricks described above, if it doesn’t then it could be either something caused by the Hardware on your previous Raspberry Pi OR by the Pi firmware version installed on your previous Raspberry Pi. Hope this helps, good luck! :) |
Rick Murray (539) 13840 posts |
If it’s a Pi we’re talking about, you’ll want to be VERY careful with the big “Loader” file within !Boot. This should never be copied, nor deleted. And I’m not sure I’d move it either. Why? It’s sort-of-not-quite a real file. It’s a bit of magic that covers the FAT boot partition (used by the bootloader) so the RISC OS filesystem doesn’t mess with that part of the disc. A potentially less problematic way to test boot stuff is:
When you reboot, the standard stuff will run, and any of your startup customisations won’t be done. Note that this will reset your monitor setup, as that’s part of PreDesk. Do NOT run the server during the test period. Let’s check the machine itself is working correctly first. If things do not work, then try renaming However, if things do work, then you can – as Paolo suggests – copy stuff back bit by bit. Start with the things in PreDesk, then add the Desktop file, and finally Tasks. Some things ought to be skippable. I’d suggest that you don’t need to worry individually testing Fat32Fs, RPFS, or !!DeepKeys. Once the system is running and seems stable, you can then add in the server. How are you running ARMbbs and the telnet gateway? Are you doing it via Aemulor? |
nemo (145) 2546 posts |
Renaming your
“Start RISC OS in Safe Mode” perhaps? There was a time when Shift-Boot did that. Those were the days. Mind you there was also a time when holding Shift once the boot started would result in loads of files being loaded instead of run. Like I said, those were the days. |
Rick Murray (539) 13840 posts |
I wasn’t thinking so much the moving out of the way as the putting back… |
Rick Murray (539) 13840 posts |
As for a safe mode RISC OS, doesn’t spamming the Escape key at boot do that? |