RISC OS Open: Forum: kernel error with BASIC program

Oct 24, 2011 10:03am

I have been reading bytes from the serial port using the SYS “OS_Hardware”
type commands with some success, after running the code for a few days
I get a kernel error, please see below for details.

The strange thing is I have now REMed out the serial port commands see
lines 260,280 for details ths function now only creates a
LOCAL variable and then returns -1, I ran this program over the weekend
and get the same error, this is running on a beagle(xM) board.

I have been downloading the latest roms as they become available and the
error has appeared on about the last 6 versions, I will try removing the
LOCAL variable to see if that changes the outcome. The trouble it takes
a few days running to generate the error.

I have looked at the program and cant see any errors so I dont think I
have done anything “silly” but you never know :-)

I have had to use the # list command to get the program to list OK
any advice as to how I add program listings.

  10ON ERROR PROCset_baud(115200):PRINT ERL:PRINT REPORT$:END
  21DIM status% 1
  30Port%=0
  40HAL_UARTLineStatus=71
  50HAL_UARTReceiveByte=69
  60HAL_UARTRate=73
  70
  80PROCset_baud(300)
  90
  100char%=FNPROCReceiveSerial
  110PRINT char%
  120GOTO 100

  150DEFPROCset_baud(rate%)
  160REM set baud rate
  170SYS "OS_Hardware",Port%,rate%*16,,,,,,,0,HAL_UARTRate
  180REM -------------
  190ENDPROC

  240DEFFNPROCReceiveSerial
  250LOCAL linestatus
  260REM SYS "OS_Hardware",Port%,status%,,,,,,,0,HAL_UARTLineStatus TO linestatus
  270IF (linestatus AND 1) =1 THEN
  280REM SYS "OS_Hardware",Port%,status%,,,,,,,0,HAL_UARTReceiveByte TO c%
  290ELSE
  300c%=-1
  310ENDIF
  320=c%

*where
Address &FC026404 is in the Kernel
*showregs
Register dump (stored at &2000E8F0) is:
R0 = FC04B966 R1 = 00000006 R2 = FFFFFFFF R3 = 00000000
R4 = 00000000 R5 = FA200048 R6 = FA200250 R7 = 00000800
R8 = A0000113 R9 = 00000000 R10 = 00000013 R11 = 00000000
R12 = A0000113 R13 = FA200010 R14 = FC025C24 R15 = FC026404
Mode SVC32 flags set: NzcvqjggggeAift PSR = 80000113
*

Oct 24, 2011 11:03am

Jeffrey Lee (213) 6048 posts

I have had to use the # list command to get the program to list OK
any advice as to how I add program listings.

If you place the code in <pre> tags then it should work.

Were you running the program from the command line, a task window, or just double-clicking a BASIC file? I’ll start it running tonight so I can look into the bug.

Oct 24, 2011 11:11am

Terry Swanborough (455) 53 posts

I am running the code just by double clicking the basic file from the desktop

Oct 31, 2011 10:05pm

Jeffrey Lee (213) 6048 posts

How long does it usually take to crash? My board has been running that program for a week now and is still going strong.

Maybe I should be using one of the downloadable ROM images instead of one I’ve built myself.

Nov 1, 2011 8:55am

Terry Swanborough (455) 53 posts

I ran the code again, last week at work and it ran all week, but it failed during the weekend with the same kernel error. Strangely it seems to alway crash over the weekend ? , I can’t see anything else hardware wise causing the problem as there is a lot less going on in work during the weekend , also the desktop always recovers after the crash. I am interested in using RISC OS for a commercial project at work replacing our current range of radio nurse call systems with a more power version. So I am interested in tracking this problem down as our products run 24 7 once they leave us.

I can’t believe that the clock has anything to do with this error but I always set it to the correct time before I run the program, what I will do next is change the clock so that the weekend occurs during the week :-) you can see I am now grabbing at straws. I think the rom I am using I downloaded about a month ago is there a way of identifying the rom?

Nov 1, 2011 9:51am

Terry Swanborough (455) 53 posts

I also have an original beagleboard non Xm and I can’t remember ever seeing
this error, I am going to buy another Xm board from Farnell as I am beginning
to suspect my current board, I will do the same tests and report back.

Nov 1, 2011 1:22pm

Jeffrey Lee (213) 6048 posts

I think the rom I am using I downloaded about a month ago is there a way of identifying the rom?

For recent ROMs (any built since the 2nd of August), *FX 0 will show the ROM build date. If you can let me know when your ROM was built then I can probably work out where in the kernel it crashed just by using the register dump that you’ve given. Hopefully that will be enough to allow us to work out what the problem is.

Nov 1, 2011 1:29pm

Terry Swanborough (455) 53 posts

*FX0 returns (29th sep 2011)

Nov 1, 2011 1:31pm

Jeffrey Lee (213) 6048 posts

OK, I’ll have a look tonight and let you know how I get on.

Nov 1, 2011 8:31pm

Jeffrey Lee (213) 6048 posts

I’m surprised I didn’t spot this earlier, but it looks like it’s crashing due to a supervisor stack overflow.

From looking at your register dump I can see that it’s crashing inside OS_ReadVarVal, trying to lookup FileSwitch$CurrentFilingSystem (pointed to by R0). But that doesn’t really help us work out what’s causing the stack overflow.

The next time it crashes, if you could save a copy of the supervisor stack using “*save svcstack fa200000 + 8000” and then upload it somewhere or email it to me then that should allow us to work out what the problem is. A copy of the output of *modules would be useful too.

Nov 1, 2011 8:34pm

Steve Revill (20) 1361 posts

Don’t forget the DebugTools module and its *Where command… (assuming that works on ARMv7)

Nov 2, 2011 3:27pm

Terry Swanborough (455) 53 posts

Next time I get it to crash I will collect as much
information as possible

Nov 7, 2011 9:56am

Terry Swanborough (455) 53 posts

Just a quick update, it did crash again
I have emailed the details to Jeffrey.

Nov 21, 2011 10:01am

Terry Swanborough (455) 53 posts

Another quick update, I have tried a new Xm type board
running the ROM dated 14/11/11 and it still crashes
after a few days running with the error below
I have also saved the supervisor stack as per
Jeffrey’s advice if its needed.

*showregs
Register dump (stored at &2000E8F0) is:
R0= FC04B966 R1 = 00000006 R2 = FFFFFFFF R3 = 00000000
R4= 00000000 R5 = FA200048 R6 = FA200250 R7 = 00000800
R8= A0000113 R9 = 00000000 R10 = 00000013 R11 = 00000000
R12=A0000113 R13 = FA200010 R14 = FC025C24 R15 = FC026404
Mode SVC32 flags set: NzcvqjggggeAift PSR = 80000113
*where
Address &FC026404 is in the Kernel

Nov 21, 2011 1:47pm

Jeffrey Lee (213) 6048 posts

I completely forgot about this :-(

I’ll look into it tonight for you.

Nov 22, 2011 1:37am

Jeffrey Lee (213) 6048 posts

I’m still working on unwinding the stack, but it looks like the stack overflow is caused by the Internet & DHCP modules getting into a state where they keep sending service calls to one another. Specifically it looks like the modules are trying to cope with an IP address change, which suggests that the problem could be triggered by the machines DHCP lease expiring. Do you know how long your DHCP server leases IP addresses for? (*DHCPInfo should tell you when the lease was obtained, and when it’s due to expire). Or is there anything which could be screwing with your network over the weekend?

When I was doing the testing I was running with a static IP setup, but I’ve switched to using DHCP now to see if that triggers the bug. My router seems to use a 24 hour lease time, so sometime tomorrow I should know whether the DHCP lease expiring is enough to trigger the bug.

Nov 22, 2011 8:10am

Terry Swanborough (455) 53 posts

I think you are on the right track, over the weekend our Internet connection is
Switched off its on a separate power ring, so the beagleboard would have made a connection via DHCP first thing in the morning and then we switch off the box but leave the beagle board running. The use I have in mind for the beagle board will not be using the Internet so this probably solves my problem but it is in everybody’s interest if this bug could be tracked down many thanks for your detective work.

Nov 22, 2011 1:39pm

Steffen Huber (91) 1953 posts

I am no DHCP expert – what does the DHCP protocol demand if a client wants to renew its lease and the original address provider (DHCP server) is no longer available? I could imagine that this scenario was possibly not tested properly with the RISC OS Internet and DHCP module.

Nov 23, 2011 4:24pm

Mark Scholes (148) 2 posts

According to the spec:

http://tools.ietf.org/html/rfc2131#section-4.4.5

“If the lease expires before the client receives a DHCPACK, the client
moves to INIT state, MUST immediately stop any other network
processing and requests network initialization parameters as if the
client were uninitialized.”

It first tries to renew with the current dhcp server, then tries to rebind with any dhcp server, then should go back to whatever it does with no dhcp server

Nov 24, 2011 11:55pm

Jeffrey Lee (213) 6048 posts

I’m still not sure how it gets stuck in the loop, but I’ve now got a fairly good idea of why it can’t get out of it:

The DHCP module receives an “IP address changed” service call
The DHCP module checks the list of interfaces that it manages and spots one that isn’t using the IP address it’s expected to be using
It assumes the interface has been manually reconfigured, so attempts to send a packet to the DHCP server to release the lease on the old address
Sending the packet fails
The DHCP module responds to this by releasing and recreating the socket that it uses to communicate with the Internet module (the request to send a packet to the DHCP server was actually sent via this socket – so the module assumes it’s this socket which is at fault)
But as part of recreating the socket, it also creates the loopback interface, by setting the interface IP address to 127.0.0.1
The Internet module detects this request to change an interface IP address, acts on it, and then sends an “IP address changed” service call
Go to 1

So, there are a few points of failure here:

The “IP address changed” service call doesn’t indicate which interface was changed. If this information was present then the DHCP module could easily ignore any changes that were made to the loopback interface.
After the DHCP module sends the packet to the DHCP server, it removes the interface from the list of interfaces that it manages. If the two operations were reordered, it would ensure the module never tries to recursively release the same interface.
The DHCP module probably doesn’t need to reset the loopback interface each time it recreates the socket is uses to communicate with the Internet module.
For some reason, sending the DHCP release packet always fails!

To fix the crash, I’ll probably go with fixing point 2 in the list above. But instead of crashing, you’ll now just end up with no DHCP. So we also need to work out why the DHCP module thinks the interface needs to be released. Plus there’s the question of why sending the DHCP release packet always fails. I’ll have another go at recreating the crash here, but if that fails I think I might have to send you a debug version of the DHCP module so we can see what the root cause of the problem is.

Nov 28, 2011 11:07am

Terry Swanborough (455) 53 posts

Just for confirmation I ran the beagle board over the weekend with the internet TCP/IP protocol switched off and everything was still running OK on Monday morning.

kernel error with BASIC program

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Oct 24, 2011 10:03am Terry Swanborough (455) 53 posts	I have been reading bytes from the serial port using the SYS “OS_Hardware” type commands with some success, after running the code for a few days I get a kernel error, please see below for details. The strange thing is I have now REMed out the serial port commands see lines 260,280 for details ths function now only creates a LOCAL variable and then returns -1, I ran this program over the weekend and get the same error, this is running on a beagle(xM) board. I have been downloading the latest roms as they become available and the error has appeared on about the last 6 versions, I will try removing the LOCAL variable to see if that changes the outcome. The trouble it takes a few days running to generate the error. I have looked at the program and cant see any errors so I dont think I have done anything “silly” but you never know :-) I have had to use the # list command to get the program to list OK any advice as to how I add program listings. 10ON ERROR PROCset_baud(115200):PRINT ERL:PRINT REPORT$:END 21DIM status% 1 30Port%=0 40HAL_UARTLineStatus=71 50HAL_UARTReceiveByte=69 60HAL_UARTRate=73 70 80PROCset_baud(300) 90 100char%=FNPROCReceiveSerial 110PRINT char% 120GOTO 100 150DEFPROCset_baud(rate%) 160REM set baud rate 170SYS "OS_Hardware",Port%,rate%16,,,,,,,0,HAL_UARTRate 180REM ------------- 190ENDPROC 240DEFFNPROCReceiveSerial 250LOCAL linestatus 260REM SYS "OS_Hardware",Port%,status%,,,,,,,0,HAL_UARTLineStatus TO linestatus 270IF (linestatus AND 1) =1 THEN 280REM SYS "OS_Hardware",Port%,status%,,,,,,,0,HAL_UARTReceiveByte TO c% 290ELSE 300c%=-1 310ENDIF 320=c% where Address &FC026404 is in the Kernel showregs Register dump (stored at &2000E8F0) is: R0 = FC04B966 R1 = 00000006 R2 = FFFFFFFF R3 = 00000000 R4 = 00000000 R5 = FA200048 R6 = FA200250 R7 = 00000800 R8 = A0000113 R9 = 00000000 R10 = 00000013 R11 = 00000000 R12 = A0000113 R13 = FA200010 R14 = FC025C24 R15 = FC026404 Mode SVC32 flags set: NzcvqjggggeAift PSR = 80000113

Oct 24, 2011 11:03am Jeffrey Lee (213) 6048 posts	I have had to use the # list command to get the program to list OK any advice as to how I add program listings. If you place the code in <pre> tags then it should work. Were you running the program from the command line, a task window, or just double-clicking a BASIC file? I’ll start it running tonight so I can look into the bug.

Oct 24, 2011 11:11am Terry Swanborough (455) 53 posts	I am running the code just by double clicking the basic file from the desktop

Oct 31, 2011 10:05pm Jeffrey Lee (213) 6048 posts	How long does it usually take to crash? My board has been running that program for a week now and is still going strong. Maybe I should be using one of the downloadable ROM images instead of one I’ve built myself.

Nov 1, 2011 8:55am Terry Swanborough (455) 53 posts	I ran the code again, last week at work and it ran all week, but it failed during the weekend with the same kernel error. Strangely it seems to alway crash over the weekend ? , I can’t see anything else hardware wise causing the problem as there is a lot less going on in work during the weekend , also the desktop always recovers after the crash. I am interested in using RISC OS for a commercial project at work replacing our current range of radio nurse call systems with a more power version. So I am interested in tracking this problem down as our products run 24 7 once they leave us. I can’t believe that the clock has anything to do with this error but I always set it to the correct time before I run the program, what I will do next is change the clock so that the weekend occurs during the week :-) you can see I am now grabbing at straws. I think the rom I am using I downloaded about a month ago is there a way of identifying the rom?

Nov 1, 2011 9:51am Terry Swanborough (455) 53 posts	I also have an original beagleboard non Xm and I can’t remember ever seeing this error, I am going to buy another Xm board from Farnell as I am beginning to suspect my current board, I will do the same tests and report back.

Nov 1, 2011 1:22pm Jeffrey Lee (213) 6048 posts	I think the rom I am using I downloaded about a month ago is there a way of identifying the rom? For recent ROMs (any built since the 2nd of August), *FX 0 will show the ROM build date. If you can let me know when your ROM was built then I can probably work out where in the kernel it crashed just by using the register dump that you’ve given. Hopefully that will be enough to allow us to work out what the problem is.

Nov 1, 2011 1:29pm Terry Swanborough (455) 53 posts	*FX0 returns (29th sep 2011)

Nov 1, 2011 1:31pm Jeffrey Lee (213) 6048 posts	OK, I’ll have a look tonight and let you know how I get on.

Nov 1, 2011 8:31pm Jeffrey Lee (213) 6048 posts	I’m surprised I didn’t spot this earlier, but it looks like it’s crashing due to a supervisor stack overflow. From looking at your register dump I can see that it’s crashing inside OS_ReadVarVal, trying to lookup FileSwitch$CurrentFilingSystem (pointed to by R0). But that doesn’t really help us work out what’s causing the stack overflow. The next time it crashes, if you could save a copy of the supervisor stack using “save svcstack fa200000 + 8000” and then upload it somewhere or email it to me then that should allow us to work out what the problem is. A copy of the output of modules would be useful too.

Nov 1, 2011 8:34pm Steve Revill (20) 1361 posts	Don’t forget the DebugTools module and its *Where command… (assuming that works on ARMv7)

Nov 2, 2011 3:27pm Terry Swanborough (455) 53 posts	Next time I get it to crash I will collect as much information as possible

Nov 7, 2011 9:56am Terry Swanborough (455) 53 posts	Just a quick update, it did crash again I have emailed the details to Jeffrey.

Nov 21, 2011 10:01am Terry Swanborough (455) 53 posts	Another quick update, I have tried a new Xm type board running the ROM dated 14/11/11 and it still crashes after a few days running with the error below I have also saved the supervisor stack as per Jeffrey’s advice if its needed. showregs Register dump (stored at &2000E8F0) is: R0= FC04B966 R1 = 00000006 R2 = FFFFFFFF R3 = 00000000 R4= 00000000 R5 = FA200048 R6 = FA200250 R7 = 00000800 R8= A0000113 R9 = 00000000 R10 = 00000013 R11 = 00000000 R12=A0000113 R13 = FA200010 R14 = FC025C24 R15 = FC026404 Mode SVC32 flags set: NzcvqjggggeAift PSR = 80000113 where Address &FC026404 is in the Kernel

Nov 21, 2011 1:47pm Jeffrey Lee (213) 6048 posts	I completely forgot about this :-( I’ll look into it tonight for you.

Nov 22, 2011 1:37am Jeffrey Lee (213) 6048 posts	I’m still working on unwinding the stack, but it looks like the stack overflow is caused by the Internet & DHCP modules getting into a state where they keep sending service calls to one another. Specifically it looks like the modules are trying to cope with an IP address change, which suggests that the problem could be triggered by the machines DHCP lease expiring. Do you know how long your DHCP server leases IP addresses for? (*DHCPInfo should tell you when the lease was obtained, and when it’s due to expire). Or is there anything which could be screwing with your network over the weekend? When I was doing the testing I was running with a static IP setup, but I’ve switched to using DHCP now to see if that triggers the bug. My router seems to use a 24 hour lease time, so sometime tomorrow I should know whether the DHCP lease expiring is enough to trigger the bug.

Nov 22, 2011 8:10am Terry Swanborough (455) 53 posts	I think you are on the right track, over the weekend our Internet connection is Switched off its on a separate power ring, so the beagleboard would have made a connection via DHCP first thing in the morning and then we switch off the box but leave the beagle board running. The use I have in mind for the beagle board will not be using the Internet so this probably solves my problem but it is in everybody’s interest if this bug could be tracked down many thanks for your detective work.

Nov 22, 2011 1:39pm Steffen Huber (91) 1953 posts	I am no DHCP expert – what does the DHCP protocol demand if a client wants to renew its lease and the original address provider (DHCP server) is no longer available? I could imagine that this scenario was possibly not tested properly with the RISC OS Internet and DHCP module.

Nov 23, 2011 4:24pm Mark Scholes (148) 2 posts	According to the spec: http://tools.ietf.org/html/rfc2131#section-4.4.5 “If the lease expires before the client receives a DHCPACK, the client moves to INIT state, MUST immediately stop any other network processing and requests network initialization parameters as if the client were uninitialized.” It first tries to renew with the current dhcp server, then tries to rebind with any dhcp server, then should go back to whatever it does with no dhcp server

Nov 24, 2011 11:55pm Jeffrey Lee (213) 6048 posts	I’m still not sure how it gets stuck in the loop, but I’ve now got a fairly good idea of why it can’t get out of it: The DHCP module receives an “IP address changed” service call The DHCP module checks the list of interfaces that it manages and spots one that isn’t using the IP address it’s expected to be using It assumes the interface has been manually reconfigured, so attempts to send a packet to the DHCP server to release the lease on the old address Sending the packet fails The DHCP module responds to this by releasing and recreating the socket that it uses to communicate with the Internet module (the request to send a packet to the DHCP server was actually sent via this socket – so the module assumes it’s this socket which is at fault) But as part of recreating the socket, it also creates the loopback interface, by setting the interface IP address to 127.0.0.1 The Internet module detects this request to change an interface IP address, acts on it, and then sends an “IP address changed” service call Go to 1 So, there are a few points of failure here: The “IP address changed” service call doesn’t indicate which interface was changed. If this information was present then the DHCP module could easily ignore any changes that were made to the loopback interface. After the DHCP module sends the packet to the DHCP server, it removes the interface from the list of interfaces that it manages. If the two operations were reordered, it would ensure the module never tries to recursively release the same interface. The DHCP module probably doesn’t need to reset the loopback interface each time it recreates the socket is uses to communicate with the Internet module. For some reason, sending the DHCP release packet always fails! To fix the crash, I’ll probably go with fixing point 2 in the list above. But instead of crashing, you’ll now just end up with no DHCP. So we also need to work out why the DHCP module thinks the interface needs to be released. Plus there’s the question of why sending the DHCP release packet always fails. I’ll have another go at recreating the crash here, but if that fails I think I might have to send you a debug version of the DHCP module so we can see what the root cause of the problem is.

Nov 28, 2011 11:07am Terry Swanborough (455) 53 posts	Just for confirmation I ran the beagle board over the weekend with the internet TCP/IP protocol switched off and everything was still running OK on Monday morning.