Is there a known issue with OS_CallAfter ?
Pages: 1 2
Rick Murray (539) 13851 posts |
I have a problem where a module that is supposed to do something at periodic intervals…doesn’t. Has anybody come across something like this before? |
Jon Abbott (1421) 2651 posts |
You can’t rely on callbacks to work as they’re depended on many factors, such as (IIRC) the Supervisor stack being clear, IRQsema being clear, IRQ’s being enabled and the SWI handler exiting to User mode. I had no end of issues with it in ADFFS and had to write my own callback handler in the end. You could use TickerV, OS_CallEvery (I think this hangs off TickerV) or RTSupport instead. |
Rick Murray (539) 13851 posts |
CallAfter, not CallBack. This is the “happens on time but don’t mess with anything or the OS will blow up” version. |
Jeffrey Lee (213) 6048 posts |
Are you checking for any errors returned by OS_CallAfter? If you’re calling it from an interrupt context then it’s possible there isn’t enough free memory in the RMA (or is it the system heap?) for it to add the entry (and you can’t resize dynamic areas from IRQ handlers). If the SWI isn’t returning an error, then it’s possible you’re encountering this TickerV corruption bug, which I’m yet to get to the bottom of (I tried adding some debug code to check for it, but either my initial test case wasn’t very good or it’s a very timing-sensitive issue and the addition of the debug code has caused it to not happen). I’m also starting to wonder whether some of the issues people are reporting with NetTime – like your case of it being > 1 day since the last correction – are also a symptom of the TickerV corruption bug. If you have an easy-to-repro test case (i.e. one that happens after a few days rather than one that happens after a few weeks) then I’d love to hear about it! |
Rick Murray (539) 13851 posts |
My server module (started at boot) often fails to respond until it has had a “kick”, though sometimes the kick fails. Thing is, it is an entire application. I’ve not tried to narrow it down. In your other post, you said “It looks like somehow the kernel’s ticker event chain is becoming corrupted. […] so both it and everything that was located after it in the schedule weren’t being processed.” Do you have a tool to dump the contents of the ticker chain? If the thing stops responding (or NetTime), I could look at the chain to see if it made sense? If it does, then the problem lies elsewhere… |
Jeffrey Lee (213) 6048 posts |
DebugTools has a *Tickers command for that very purpose. However the output wasn’t particularly human-readable (it was basically just a raw dump of the internal ticker chain format), so I’ve now tweaked it to make it a bit better. Grab a build of the latest version of the module here If the ‘Repeat’ time for an entry is lower than the ‘In …’ heading then that’s a sign that you’ve been struck by the bug. |
Rick Murray (539) 13851 posts |
This is strange. My module isn’t listed with Called with code like:
(wsp is a copy of the “pw” (privateword) passed during the module initialisation) and:
I have added some extra code to write to DADebug if the reply from CallAfter is non-NULL; though this is normally called in either SVC or USR mode – with the sole exception of the kick routine, everything else needs filesystem access; so the CallAfter schedules a CallBack from which it is safe to do stuff without the dreaded FileCore in use. So, interesting. Why isn’t my module showing up in the list? ;-) |
Martin Avison (27) 1494 posts |
Just tried that on Iyo with RO5.22, and the last entry was and I was wondering why on earth TerritoryManager would need a call after about 375 days ?! |
Jeffrey Lee (213) 6048 posts |
Daylight savings time. (And unless I’m mistaken, that value comes out as 75 days, not 375) |
Rick Murray (539) 13851 posts |
Well, that didn’t take long.
The non-existant server was also ‘dead’, kicking it revived it and appears to have revived part of CoolSwitch, although some of the earlier when-it-worked timings are a bit odd. This is sort of what it should look like, from around 9pm this evening:
|
Rick Murray (539) 13851 posts |
BTW, would this explain why sometimes it seems as if Wimp_PollIdle “gives up” and just behaves like regular Wimp_Poll? |
Rick Murray (539) 13851 posts |
Okay, I have “found” my module in the tickers list:
Interesting. Zap just apologised that it didn’t have enough memory to provide me with a copy of CoolSwitch’s workspace (I have about 180MiB free). |
Rick Murray (539) 13851 posts |
Right, I’ve removed CoolSwitch from my boot and my module is showing up in the tickers list. I’ll leave it running awhile too see if the ticker chain times muck up. While I’m doing this, I might change the current behaviour (a CallAfter that schedules a CallBack; a CallBack that does the work and then schedules a new CallAfter at the end) to simply be a CallEvery that will schedule a CallBack if one isn’t already pending… BTW – there is a clash with the DDE and DebugTools. If the module is loaded at boot prior to the DDE being ‘seen’ by the filer, the DDE will fail to boot, syntax error re. *Canonical. |
Jeffrey Lee (213) 6048 posts |
Yep, that looks pretty broken! Not sure why I didn’t spot these issues when I first looked at the kernel’s CallAfter/CallEvery code, but here are two I’ve spotted:
If my condition codes are correct then swapping the MOVNE pc, lr for MOVHI pc, lr should fix the unsigned wrap around problem. And the interrupt hole should be fixable by moving the “IRQ’s off again” at line 193 down to after the 10 label (for something like this it’s best for the kernel to play it safe and not assume the users routine restored interrupt state correctly). AFAIK the issues with SmartReflex/TickerV only started getting reported in 2014, after the unsigned time changes in 2013 – so rather than doing any more exhaustive testing myself, maybe it would be enough to make those changes, check them in and wait to see if people report any further issues. (The other alternative I can think of would be to make a testbed containing a copy of the kernel’s ticker code and manually call it recursively at various points to see what issues pop up – but if I’m fixing the only interrupt hole I can spot then I don’t really know where else I’d be putting the checks!)
Not sure – I think the Wimp maintains the time itself for Wimp_PollIdle. |
Jeffrey Lee (213) 6048 posts |
Interesting – perhaps the DDE provides its own version of that command? There’s also *Where, which is now present in both the Debugger and DebugTools (Debugger’s version is better). |
Rick Murray (539) 13851 posts |
;-) For the moment, it is behaving. For the moment. That said, stuff thinking CoolSwitch had an infinite workspace indicates that something was wrong!
Okay, hands up every programmer that has had a moment like that. Sometimes, the best way to deal with (potentially) problematic code is to leave it, do other stuff, and come back a while later.
Thanks. I’ll keep an eye on the CVS and do a diff to modify my kernel. I’m running a build from last October (as I never got CVS to work for me and it is really time consuming unpacking the gzip) because I depend upon CE(S)T in the UK territory, and CLib dealing with it correctly.
AcornC/C++.!SetPaths.Lib32.canonical (utility) |
Colin (478) 2433 posts |
I have a directory containing !cvs and the taskobey file ‘cvsfetch’ as shown below
The path after ‘co’ is the path of the directory you want to download. Just double click on cvsfetch and the path is downloaded to the same directory as cvsfetch. |
Jeffrey Lee (213) 6048 posts |
I think the recommendation is to use ‘cvs -z9’ to ensure the connection is compressed (to save ROOL on some bandwidth costs). With a bit of work it’s also possible to get ROOL’s Perl CVS scripts working under RISC OS: https://www.riscosopen.org/forum/forums/5/topics/313?page=3#posts-5842 |
Rick Murray (539) 13851 posts |
Thanks. I wrapped that in a TaskWindow call, and added “-z 9”, and it worked. ;-) But… A question and an observation. The question – can I download the Pi build, or does this mean checking out everything piece-by-piece-by-piece? The observation – remember when I said it was not entirely realistic to end the ZeroPain trial period at the end of the year? Well, “cvs” crashes unless alignment faults are disabled, so such old software as that is still kicking around. |
Jeffrey Lee (213) 6048 posts |
That’s what the Perl scripts are for – see here for some slightly crappy instructions on how to get them (this guide is perhaps a bit better, but it assumes you’re setting up for write access) If you have the Perl scripts you can just do “checkout BCM2835Dev” to grab all the components (although, updating an existing source tree is a bit trickier) |
Colin (478) 2433 posts |
I think you need Jeffreys perl scripts for that. I think they read the components file and module database that !builder uses to automate a build fetch. I just fetch the odd part via CVS. |
Rick Murray (539) 13851 posts |
Just had a quick look at the ticker handler code and I noticed that the handler is called using BLX – does this suggest that the code could be written in Thumb? |
Jeffrey Lee (213) 6048 posts |
Yes, although there are no guarantees that it will work, or which OS versions it will work on – the BLX was added there as a micro-optimisation to save one instruction, not with the goal of enabling Thumb support. Even without the BLX, Thumb could work on ARMv7 anyway due to all branches being interworking. |
Rick Murray (539) 13851 posts |
Happened again this morning, about half an hour after (yet another brief) power cut. The huge values – while BASIC and RISC OS itself very helpfully say “Number too big” when attempting to convert 4294551774 into hex, going the other way via OS_ConvertCardinal4 tells me that &FFFFFFFF is 4294967295 – so if they all have values “about like this” as their trigger time, it appears as if the times somehow got erroneously calculated away from zero, to give a massive number. I’m going to apply your changes to my kernel, I’ll get back to you on its behaviour. Fingers crossed! ;-) |
Colin (478) 2433 posts |
There’s no need with the changes. The high values were caused when the head of the list had a delta value of 0. After 1 was taken from it it became &FFFFFFFF – a valid high value. The seamingly random high numbers you are getting are the &FFFFFFFF value being counted down at cs intervals. Jeffrey’s changes will avoid this happening. |
Pages: 1 2