147 / 148 modules crash
Pages: 1 2
Andrew Rawnsley (492) 1445 posts |
Hi folks, I’m trying to fix a machine, and seeing a very odd behaviour. It may be a faulty board/RAM (imx6) but it’d be a first if that’s the case. Essentially the machine works absolutely perfectly until it has 147 (OS 5.27 June 2020) or 148 (5.28) modules loaded. After that, anything attempting to load a module generates an abort in the kernel – &FC02CD54 which *where tells me is &0002CD5C in the Kernel. If alignment exceptions are disabled, instead of an abort, it stiffs the machine. With alignment exceptions on, the computer remains useable, and will load further applications provided they don’t try and load any modules. Aemulor is not loaded, and isn’t on the system. I have tried to ensure all apps are up to date, esp ones with lots of modules such as Zap, StrongEd, NetSurf and so on. Disc image updated with ROOL’s 5.28 and update script, just to be super-sure. Has anyone seen anything like this? |
Martin Avison (27) 1494 posts |
With Align Execpt on, what does the Tasks display show for the size on the RMA, the amount free and the largest block? If the Module area can be dragged larger, can extra modules then be loaded? |
Julie Stamp (8365) 474 posts |
Are you able to make a debugger dump with annotated stack? |
Andrew Rawnsley (492) 1445 posts |
In the meantime, I loaded a fresh set of cmos settings taken from my own system, and the problem seems to have been cured. It was very strange, yet completely reproduceable. If it re-occurs (I’m keeping the machine in for testing for a couple more days, because I don’t know the link between the problem and the solution, and that makes me uncomfortable), I’ll try and provide more info. Are there specific *commands that would be helpful for dumping/debugger? Martin – I tried increasing RMA substantially, since that seemed a logical direction. Didn’t make any difference. It also seemed to only matter how many modules were loaded, not their size. Killing one and letting an app load another in its place would reproducably work. It is just possible that the modules were all similarly sized, of course, although I did try to vary things as best I could. |
Chris Johnson (125) 825 posts |
Without wishing to muddy the waters, this sounds very much like symptoms I get with an IGEPv5 based machine, when I try to use a reasonably up to date OS. The main symptom I have seen is very similar to yours. The machine appears to boot ok, but once one tries to load further apps which load one or more modules there is a crash located in the kernel. Loading a self contained app which doesn’t need any modules is fine and everything works. Anything loading a module crashes immediately. I never did track down the cause – my workaround is to stick with a version of the OS which is now more that two years out of date. Maybe I will flash a recent nightly build and see what happens. It would be useful if one could softload a ROM on the IGEPv5 as you can for the Titanium (which has never misbehaved in this way). |
Doug Webb (190) 1180 posts |
I did once have a machine that used to have issues as programs were loaded and I cured this by setting, if I remember rightly, the Country and Territory to UK and it went away. I think both were unset initially but it was a few years back. |
Chris Johnson (125) 825 posts |
I have now run the IGEPv5 with the most recent nightly build. I forgot that the IGEP just has a loader partition, so no reflash needed. Same problem – standard apps fail to initialise if they load a module from their !Run file (or from in app code). The error is at address 0xFC01CE44. Where tells me it is at offset 0×0001CE4C in the kernel. Counting up the modules that are loaded – lo and behold it comes to 148! Looks like it is the same problem. As far as Doug’s comment: Country was set to ‘Default’. I changed it to UK before rebooting into the new ROM. |
Chris Johnson (125) 825 posts |
I found a much older CMOS file and tried rebooting with that in loader, but there was no improvement. |
Andrew Rawnsley (492) 1445 posts |
It might be worth writing all zeros to the cmos file, or otherwise corrupting it, to force the checksums to fail. This will cause the OS to restore defaults, in case there’s something subtly wrong. |
Julie Stamp (8365) 474 posts |
This should give some useful info I hope
Obviously set the filename to wherever is convenient. It’ll make a dump when the abort happens. |
Chris Johnson (125) 825 posts |
I’ll have another look at this tomorrow when I have more time. |
Chris Johnson (125) 825 posts |
Having said that, I have had a bit of a play. A copy of the debugger dump can be found here. https://www.chris-johnson.org.uk/bits/dump1.zip (edit – incorrect url first time) Maybe someone knowledgeable could have a look at it. It confirms it is failing at module load. An interesting name for a function – ClaimChocolateBlock. |
Chris Johnson (125) 825 posts |
I tried simply deleting the CMOS file but that had no effect. I must say – that was a good bit of lateral thinking by Andrew to see the connection with number of modules loaded. That was what I didn’t see when I was first investigating it. The fact that different apps gave or didn’t give the error depending on the order they were run after bootup confused me, but I never saw the connection. I just assumed that something was screwing up the kernel so that no more modules were being loaded. Anyway, the debugger dump may give a clue. |
Alan Adams (2486) 1149 posts |
I find myself wonmdering whether loading modules uses some resource, of which 108 entries are already in use, and it’s hit a count of 255? |
Julie Stamp (8365) 474 posts |
It definitely does :-) Can you do
for me, and zip it up like before1? This saves three chocolate block arrays containing information about modules in your system, the second of which is causing the trouble. Hopefully by seeing what the corruption is, we can tell who corrupted it.
There are 150 chocolate blocks available for loading modules at the moment, but it doesn’t matter if you run out because it will use the system heap after that. 1 If you don’t want to put system memory dumps on the internet for everybody to see, let me know and we can e-mail. |
Steve Pampling (1551) 8172 posts |
?? What does the fifth1 essential food group have to do with modules?? 1 Some people believe the groups fat, fibre, protein and carbohydrate are the full list. More knowledgable people know that chocolate should be included (possibly as the first and most essential) |
Chris Johnson (125) 825 posts |
OK. Rebooted – loaded some apps until errors started. Did the memory grab. The zip file is at: |
Chris Johnson (125) 825 posts |
Can you do Was the array dump of any help in working out what is going on? |
Julie Stamp (8365) 474 posts |
:-( There is not much to go on, only two words corrupt I found. I’ve sent you RMTest, which will make lots of modules. You could try disabling the boot sequence (e.g. shift-boot), in case something loaded in there is causing the problem, and then use RMTest to see if the abort happens still. |
Julie Stamp (8365) 474 posts |
Chris and I have now done some tests. It turns out something is writing a mode selector block into the middle of the chocolate block arrays that are meant for modules. Does anybody know what might be doing this? |
Martin Avison (27) 1494 posts |
What values are in the block? – may give some clues. |
Chris Johnson (125) 825 posts |
It is obviously a rather obscure occurrence, otherwise many more users would be afflicted. I have now found on the IGEPv5 that if the configure settings of Mode, Wimpmode, and Monitor type are set to Auto (which is what they always have been), the 147/148 module problem occurs. If the mode settings are configured to say Mode 32, and the Monitortype is set to EDID, the problem goes away. The change in behaviour is repeatable. |
Andrew Rawnsley (492) 1445 posts |
Chris – That might explain why my resetting cmos fixed it. My default cmos settings on the machine in question would be monitortype 4 and mode 32 etc. I never normally use Auto as historically I’ve not had much luck with it. However, it doesn’t explain what’s messing things up for you, other than it would appear to be consistant across several different ROM builds (ie. I’ve seen it on i.MX6, you’ve seen it on IGEP and DavidS had it on a Pi v1). I wonder if it is somehow connected to the EDID interacting with auto “mode” setting. I want to say it is writing a mode block into somewhere where it’d historically do a numbered mode, but by the sound of things, it’s writing it to completely the wrong area of memory, so it is probably a very distinct bug… somewhere. |
Julie Stamp (8365) 474 posts |
Watchpoints would be useful for investigating this kind of thing. |
Julie Stamp (8365) 474 posts |
This is it:
It’s an exact copy of the block at &30202408 = the address returned by It looks consistent to me, going by the wiki. |
Pages: 1 2