RISC OS Open: Forum: 147 / 148 modules crash

Dec 7, 2020 3:17pm

Hi folks,

I’m trying to fix a machine, and seeing a very odd behaviour. It may be a faulty board/RAM (imx6) but it’d be a first if that’s the case.

Essentially the machine works absolutely perfectly until it has 147 (OS 5.27 June 2020) or 148 (5.28) modules loaded. After that, anything attempting to load a module generates an abort in the kernel – &FC02CD54 which *where tells me is &0002CD5C in the Kernel.

If alignment exceptions are disabled, instead of an abort, it stiffs the machine.

With alignment exceptions on, the computer remains useable, and will load further applications provided they don’t try and load any modules.

Aemulor is not loaded, and isn’t on the system.

I have tried to ensure all apps are up to date, esp ones with lots of modules such as Zap, StrongEd, NetSurf and so on. Disc image updated with ROOL’s 5.28 and update script, just to be super-sure.

Has anyone seen anything like this?

Dec 7, 2020 3:47pm

Martin Avison (27) 1494 posts

With Align Execpt on, what does the Tasks display show for the size on the RMA, the amount free and the largest block?

If the Module area can be dragged larger, can extra modules then be loaded?

Dec 7, 2020 3:50pm

Julie Stamp (8365) 474 posts

Are you able to make a debugger dump with annotated stack?

Dec 7, 2020 4:12pm

Andrew Rawnsley (492) 1445 posts

In the meantime, I loaded a fresh set of cmos settings taken from my own system, and the problem seems to have been cured. It was very strange, yet completely reproduceable.

If it re-occurs (I’m keeping the machine in for testing for a couple more days, because I don’t know the link between the problem and the solution, and that makes me uncomfortable), I’ll try and provide more info. Are there specific *commands that would be helpful for dumping/debugger?

Martin – I tried increasing RMA substantially, since that seemed a logical direction. Didn’t make any difference. It also seemed to only matter how many modules were loaded, not their size. Killing one and letting an app load another in its place would reproducably work. It is just possible that the modules were all similarly sized, of course, although I did try to vary things as best I could.

Dec 7, 2020 4:34pm

Chris Johnson (125) 825 posts

Without wishing to muddy the waters, this sounds very much like symptoms I get with an IGEPv5 based machine, when I try to use a reasonably up to date OS. The main symptom I have seen is very similar to yours. The machine appears to boot ok, but once one tries to load further apps which load one or more modules there is a crash located in the kernel. Loading a self contained app which doesn’t need any modules is fine and everything works. Anything loading a module crashes immediately.

I never did track down the cause – my workaround is to stick with a version of the OS which is now more that two years out of date.

Maybe I will flash a recent nightly build and see what happens. It would be useful if one could softload a ROM on the IGEPv5 as you can for the Titanium (which has never misbehaved in this way).

Dec 7, 2020 4:53pm

Doug Webb (190) 1180 posts

I did once have a machine that used to have issues as programs were loaded and I cured this by setting, if I remember rightly, the Country and Territory to UK and it went away.

I think both were unset initially but it was a few years back.

Dec 7, 2020 8:54pm

Chris Johnson (125) 825 posts

I have now run the IGEPv5 with the most recent nightly build. I forgot that the IGEP just has a loader partition, so no reflash needed.

Same problem – standard apps fail to initialise if they load a module from their !Run file (or from in app code). The error is at address 0xFC01CE44. Where tells me it is at offset 0×0001CE4C in the kernel.

Counting up the modules that are loaded – lo and behold it comes to 148!

Looks like it is the same problem.

As far as Doug’s comment: Country was set to ‘Default’. I changed it to UK before rebooting into the new ROM.

Dec 7, 2020 9:05pm

Chris Johnson (125) 825 posts

In the meantime, I loaded a fresh set of cmos settings taken from my own system, and the problem seems to have been cured. It was very strange, yet completely reproduceable.

I found a much older CMOS file and tried rebooting with that in loader, but there was no improvement.

Dec 7, 2020 9:54pm

Andrew Rawnsley (492) 1445 posts

It might be worth writing all zeros to the cmos file, or otherwise corrupting it, to force the checksums to fail. This will cause the OS to restore defaults, in case there’s something subtly wrong.

Dec 7, 2020 10:34pm

Julie Stamp (8365) 474 posts

This should give some useful info I hope

Set Debugger$AnnotatedFile SDFS::RISCOSpi.$.Dump
Set Debugger$DumpOptions -file annotated

Obviously set the filename to wherever is convenient. It’ll make a dump when the abort happens.

Dec 7, 2020 11:40pm

Chris Johnson (125) 825 posts

I’ll have another look at this tomorrow when I have more time.

Dec 8, 2020 12:01am

Chris Johnson (125) 825 posts

Having said that, I have had a bit of a play.

A copy of the debugger dump can be found here.

https://www.chris-johnson.org.uk/bits/dump1.zip

(edit – incorrect url first time)

Maybe someone knowledgeable could have a look at it.

It confirms it is failing at module load. An interesting name for a function – ClaimChocolateBlock.

Dec 8, 2020 12:46pm

Chris Johnson (125) 825 posts

I tried simply deleting the CMOS file but that had no effect.

I must say – that was a good bit of lateral thinking by Andrew to see the connection with number of modules loaded. That was what I didn’t see when I was first investigating it. The fact that different apps gave or didn’t give the error depending on the order they were run after bootup confused me, but I never saw the connection. I just assumed that something was screwing up the kernel so that no more modules were being loaded.

Anyway, the debugger dump may give a clue.

Dec 8, 2020 1:25pm

Alan Adams (2486) 1149 posts

I find myself wonmdering whether loading modules uses some resource, of which 108 entries are already in use, and it’s hit a count of 255?

Dec 8, 2020 1:44pm

Julie Stamp (8365) 474 posts

Anyway, the debugger dump may give a clue.

It definitely does :-) Can you do

*Save Arrays 30000D44 30004A60

for me, and zip it up like before¹? This saves three chocolate block arrays containing information about modules in your system, the second of which is causing the trouble. Hopefully by seeing what the corruption is, we can tell who corrupted it.

I find myself wondering whether loading modules uses some resource, of which 108 entries are already in use, and it’s hit a count of 255?

There are 150 chocolate blocks available for loading modules at the moment, but it doesn’t matter if you run out because it will use the system heap after that.

¹ If you don’t want to put system memory dumps on the internet for everybody to see, let me know and we can e-mail.

Dec 8, 2020 3:37pm

Steve Pampling (1551) 8170 posts

chocolate blocks

?? What does the fifth¹ essential food group have to do with modules??

¹ Some people believe the groups fat, fibre, protein and carbohydrate are the full list. More knowledgable people know that chocolate should be included (possibly as the first and most essential)

Dec 8, 2020 3:56pm

Chris Johnson (125) 825 posts

OK. Rebooted – loaded some apps until errors started. Did the memory grab.

The zip file is at:

https://www.chris-johnson.org.uk/bits/arrays.zip

Dec 11, 2020 6:24pm

Chris Johnson (125) 825 posts

Can you do
*Save Arrays 30000D44 30004A60
for me, and zip it up

Was the array dump of any help in working out what is going on?

Dec 11, 2020 9:19pm

Julie Stamp (8365) 474 posts

:-(

There is not much to go on, only two words corrupt I found.

I’ve sent you RMTest, which will make lots of modules. You could try disabling the boot sequence (e.g. shift-boot), in case something loaded in there is causing the problem, and then use RMTest to see if the abort happens still.

Dec 14, 2020 9:43am

Julie Stamp (8365) 474 posts

Chris and I have now done some tests.

It turns out something is writing a mode selector block into the middle of the chocolate block arrays that are meant for modules.

Does anybody know what might be doing this?

Dec 14, 2020 11:52am

Martin Avison (27) 1494 posts

What values are in the block? – may give some clues.
Are they consistent?

Dec 14, 2020 12:08pm

Chris Johnson (125) 825 posts

It is obviously a rather obscure occurrence, otherwise many more users would be afflicted.

I have now found on the IGEPv5 that if the configure settings of Mode, Wimpmode, and Monitor type are set to Auto (which is what they always have been), the 147/148 module problem occurs. If the mode settings are configured to say Mode 32, and the Monitortype is set to EDID, the problem goes away. The change in behaviour is repeatable.

Dec 14, 2020 12:59pm

Andrew Rawnsley (492) 1445 posts

Chris – That might explain why my resetting cmos fixed it. My default cmos settings on the machine in question would be monitortype 4 and mode 32 etc. I never normally use Auto as historically I’ve not had much luck with it.

However, it doesn’t explain what’s messing things up for you, other than it would appear to be consistant across several different ROM builds (ie. I’ve seen it on i.MX6, you’ve seen it on IGEP and DavidS had it on a Pi v1). I wonder if it is somehow connected to the EDID interacting with auto “mode” setting. I want to say it is writing a mode block into somewhere where it’d historically do a numbered mode, but by the sound of things, it’s writing it to completely the wrong area of memory, so it is probably a very distinct bug… somewhere.

Dec 14, 2020 1:15pm

Julie Stamp (8365) 474 posts

Watchpoints would be useful for investigating this kind of thing.

Dec 14, 2020 1:24pm

Julie Stamp (8365) 474 posts

What values are in the block? – may give some clues.
Are they consistent?

This is it:

DCD 1           ; Mode selector flags - always 1
DCD 640         ; x pixels
DCD 480         ; y pixels
DCD 4           ; bpp = 2^4 = 16
DCD 60          ; Frame rate 60Hz

; Mode variables list, number, then value

DCD 0           ; Assorted flags
DCD &4000       ; TRGB, wiki says "Example use: Iyonix DVI card at 16bpp"

DCD 3           ; Maximum logical colour
DCD &FFFF       ; Corresponds to 16bpp

DCD 4           ; XEigFactor
DCD 1           ; 

DCD 5           ; YEigFactor
DCD 1

DCD -1          ; End of variables list

It’s an exact copy of the block at &30202408 = the address returned by
SYS "OS_ScreenMode",1, except it has been written to &30003E04;
literally just that block – any words after that are fine.

It looks consistent to me, going by the wiki.

147 / 148 modules crash

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Dec 7, 2020 3:17pm Andrew Rawnsley (492) 1445 posts	Hi folks, I’m trying to fix a machine, and seeing a very odd behaviour. It may be a faulty board/RAM (imx6) but it’d be a first if that’s the case. Essentially the machine works absolutely perfectly until it has 147 (OS 5.27 June 2020) or 148 (5.28) modules loaded. After that, anything attempting to load a module generates an abort in the kernel – &FC02CD54 which where tells me is &0002CD5C in the Kernel. If alignment exceptions are disabled, instead of an abort, it stiffs the machine. With alignment exceptions on, the computer remains useable, and will load further applications provided they don’t try and load any modules. Aemulor is not* loaded, and isn’t on the system. I have tried to ensure all apps are up to date, esp ones with lots of modules such as Zap, StrongEd, NetSurf and so on. Disc image updated with ROOL’s 5.28 and update script, just to be super-sure. Has anyone seen anything like this?

Dec 7, 2020 3:47pm Martin Avison (27) 1494 posts	With Align Execpt on, what does the Tasks display show for the size on the RMA, the amount free and the largest block? If the Module area can be dragged larger, can extra modules then be loaded?

Dec 7, 2020 3:50pm Julie Stamp (8365) 474 posts	Are you able to make a debugger dump with annotated stack?

Dec 7, 2020 4:12pm Andrew Rawnsley (492) 1445 posts	In the meantime, I loaded a fresh set of cmos settings taken from my own system, and the problem seems to have been cured. It was very strange, yet completely reproduceable. If it re-occurs (I’m keeping the machine in for testing for a couple more days, because I don’t know the link between the problem and the solution, and that makes me uncomfortable), I’ll try and provide more info. Are there specific commands that would be helpful for dumping/debugger? Martin – I tried increasing RMA substantially, since that seemed a logical direction. Didn’t make any difference. It also seemed to only matter how many* modules were loaded, not their size. Killing one and letting an app load another in its place would reproducably work. It is just possible that the modules were all similarly sized, of course, although I did try to vary things as best I could.

Dec 7, 2020 4:34pm Chris Johnson (125) 825 posts	Without wishing to muddy the waters, this sounds very much like symptoms I get with an IGEPv5 based machine, when I try to use a reasonably up to date OS. The main symptom I have seen is very similar to yours. The machine appears to boot ok, but once one tries to load further apps which load one or more modules there is a crash located in the kernel. Loading a self contained app which doesn’t need any modules is fine and everything works. Anything loading a module crashes immediately. I never did track down the cause – my workaround is to stick with a version of the OS which is now more that two years out of date. Maybe I will flash a recent nightly build and see what happens. It would be useful if one could softload a ROM on the IGEPv5 as you can for the Titanium (which has never misbehaved in this way).

Dec 7, 2020 4:53pm Doug Webb (190) 1180 posts	I did once have a machine that used to have issues as programs were loaded and I cured this by setting, if I remember rightly, the Country and Territory to UK and it went away. I think both were unset initially but it was a few years back.

Dec 7, 2020 8:54pm Chris Johnson (125) 825 posts	I have now run the IGEPv5 with the most recent nightly build. I forgot that the IGEP just has a loader partition, so no reflash needed. Same problem – standard apps fail to initialise if they load a module from their !Run file (or from in app code). The error is at address 0xFC01CE44. Where tells me it is at offset 0×0001CE4C in the kernel. Counting up the modules that are loaded – lo and behold it comes to 148! Looks like it is the same problem. As far as Doug’s comment: Country was set to ‘Default’. I changed it to UK before rebooting into the new ROM.

Dec 7, 2020 9:05pm Chris Johnson (125) 825 posts	In the meantime, I loaded a fresh set of cmos settings taken from my own system, and the problem seems to have been cured. It was very strange, yet completely reproduceable. I found a much older CMOS file and tried rebooting with that in loader, but there was no improvement.

Dec 7, 2020 9:54pm Andrew Rawnsley (492) 1445 posts	It might be worth writing all zeros to the cmos file, or otherwise corrupting it, to force the checksums to fail. This will cause the OS to restore defaults, in case there’s something subtly wrong.

Dec 7, 2020 10:34pm Julie Stamp (8365) 474 posts	This should give some useful info I hope `Set Debugger$AnnotatedFile SDFS::RISCOSpi.$.Dump Set Debugger$DumpOptions -file annotated` Obviously set the filename to wherever is convenient. It’ll make a dump when the abort happens.

Dec 7, 2020 11:40pm Chris Johnson (125) 825 posts	I’ll have another look at this tomorrow when I have more time.

Dec 8, 2020 12:01am Chris Johnson (125) 825 posts	Having said that, I have had a bit of a play. A copy of the debugger dump can be found here. https://www.chris-johnson.org.uk/bits/dump1.zip (edit – incorrect url first time) Maybe someone knowledgeable could have a look at it. It confirms it is failing at module load. An interesting name for a function – ClaimChocolateBlock.

Dec 8, 2020 12:46pm Chris Johnson (125) 825 posts	I tried simply deleting the CMOS file but that had no effect. I must say – that was a good bit of lateral thinking by Andrew to see the connection with number of modules loaded. That was what I didn’t see when I was first investigating it. The fact that different apps gave or didn’t give the error depending on the order they were run after bootup confused me, but I never saw the connection. I just assumed that something was screwing up the kernel so that no more modules were being loaded. Anyway, the debugger dump may give a clue.

Dec 8, 2020 1:25pm Alan Adams (2486) 1149 posts	I find myself wonmdering whether loading modules uses some resource, of which 108 entries are already in use, and it’s hit a count of 255?

Dec 8, 2020 1:44pm Julie Stamp (8365) 474 posts	Anyway, the debugger dump may give a clue. It definitely does :-) Can you do `Save Arrays 30000D44 30004A60` for me, and zip it up like before¹? This saves three chocolate block arrays containing information about modules in your system, the second of which is causing the trouble. Hopefully by seeing what* the corruption is, we can tell who corrupted it. I find myself wondering whether loading modules uses some resource, of which 108 entries are already in use, and it’s hit a count of 255? There are 150 chocolate blocks available for loading modules at the moment, but it doesn’t matter if you run out because it will use the system heap after that. ¹ If you don’t want to put system memory dumps on the internet for everybody to see, let me know and we can e-mail.

Dec 8, 2020 3:37pm Steve Pampling (1551) 8170 posts	chocolate blocks ?? What does the fifth¹ essential food group have to do with modules?? ¹ Some people believe the groups fat, fibre, protein and carbohydrate are the full list. More knowledgable people know that chocolate should be included (possibly as the first and most essential)

Dec 8, 2020 3:56pm Chris Johnson (125) 825 posts	OK. Rebooted – loaded some apps until errors started. Did the memory grab. The zip file is at: https://www.chris-johnson.org.uk/bits/arrays.zip

Dec 11, 2020 6:24pm Chris Johnson (125) 825 posts	Can you do *Save Arrays 30000D44 30004A60 for me, and zip it up Was the array dump of any help in working out what is going on?

Dec 11, 2020 9:19pm Julie Stamp (8365) 474 posts	:-( There is not much to go on, only two words corrupt I found. I’ve sent you RMTest, which will make lots of modules. You could try disabling the boot sequence (e.g. shift-boot), in case something loaded in there is causing the problem, and then use RMTest to see if the abort happens still.

Dec 14, 2020 9:43am Julie Stamp (8365) 474 posts	Chris and I have now done some tests. It turns out something is writing a mode selector block into the middle of the chocolate block arrays that are meant for modules. Does anybody know what might be doing this?

Dec 14, 2020 11:52am Martin Avison (27) 1494 posts	What values are in the block? – may give some clues. Are they consistent?

Dec 14, 2020 12:08pm Chris Johnson (125) 825 posts	It is obviously a rather obscure occurrence, otherwise many more users would be afflicted. I have now found on the IGEPv5 that if the configure settings of Mode, Wimpmode, and Monitor type are set to Auto (which is what they always have been), the 147/148 module problem occurs. If the mode settings are configured to say Mode 32, and the Monitortype is set to EDID, the problem goes away. The change in behaviour is repeatable.

Dec 14, 2020 12:59pm Andrew Rawnsley (492) 1445 posts	Chris – That might explain why my resetting cmos fixed it. My default cmos settings on the machine in question would be monitortype 4 and mode 32 etc. I never normally use Auto as historically I’ve not had much luck with it. However, it doesn’t explain what’s messing things up for you, other than it would appear to be consistant across several different ROM builds (ie. I’ve seen it on i.MX6, you’ve seen it on IGEP and DavidS had it on a Pi v1). I wonder if it is somehow connected to the EDID interacting with auto “mode” setting. I want to say it is writing a mode block into somewhere where it’d historically do a numbered mode, but by the sound of things, it’s writing it to completely the wrong area of memory, so it is probably a very distinct bug… somewhere.

Dec 14, 2020 1:15pm Julie Stamp (8365) 474 posts	Watchpoints would be useful for investigating this kind of thing.

Dec 14, 2020 1:24pm Julie Stamp (8365) 474 posts	What values are in the block? – may give some clues. Are they consistent? This is it: `DCD 1 ; Mode selector flags - always 1 DCD 640 ; x pixels DCD 480 ; y pixels DCD 4 ; bpp = 2^4 = 16 DCD 60 ; Frame rate 60Hz` `; Mode variables list, number, then value` `DCD 0 ; Assorted flags DCD &4000 ; TRGB, wiki says "Example use: Iyonix DVI card at 16bpp"` `DCD 3 ; Maximum logical colour DCD &FFFF ; Corresponds to 16bpp` `DCD 4 ; XEigFactor DCD 1 ;` `DCD 5 ; YEigFactor DCD 1` `DCD -1 ; End of variables list` It’s an exact copy of the block at &30202408 = the address returned by `SYS "OS_ScreenMode",1`, except it has been written to &30003E04; literally just that block – any words after that are fine. It looks consistent to me, going by the wiki.