socket_select crash
Alan Adams (2486) 1149 posts |
Part of my rather complex system (written in BASIC) crashes in very specific circumstances. The function concerned has been called many times before the crash, and it crashes while calling an SWI, Socket_Select. It’s been pointed out that within the disassembly around the crash address, R0 is not word aligned. I have no way to find out what causes this, as all the parameters I pass to the SWi are word aligned. The calling line is the FNselect below: IF DEBUG%AND2048 THEN *report SCL: caller$: before FD_SETSIZE% ~readmask% ~writemask% ~!readmask% ~!writemask% ~!(readmask%4) ~!(writemask%4) broadcast_s% server_s% poll_delay% The called function is: DEF FNselect(ndfs%, readfds%, writefds%, exceptfds%, timeval%) The crash occurs while executing the SYS call above: The debug output is: Trace @ 594 Reporter shows the following at a WHERE command: Memory: Prog=14,113 Vars=328,360 Free=124,208 Stack=516 Undefined=0 Slot=460K Reporter 2.71s (21 Nov 2018) Listed 17180 lines disassembly of the area shows: *memoryi fc19d260 fc19d300FC19D260 : .04à : E034300B : EORS R3,R4,R11 |
Martin Avison (27) 1494 posts |
I have just replied to your duplicate post in csap… You give no clues as to what version of RISC OS, BASIC or Internet modules are being used, which might help. Reporter should have shown the BASIC line number of the abort … although no line numbers are given for the code. I am not sure if the Socket C code is relevant, as the abort seems to be in the BASIC module, just after LOOKU1 label. This is involved in looping round the BASIC variable chains. r0 & r1 seem to contain “%MI:” which looks wrong, as it is being used as an address that should be the next variable in the chain. It may be followed in storage by r5, which is “SE00%” and should be characters 2-5 of the variable name (I think). My suspicion is that something in Alan’s program has overwritten storage, which has corrupted one of the variable chains. |
Alan Adams (2486) 1149 posts |
OK. It happens in very specific circumstances. It happens the first time the server is started after a restart, and on ARMX6 running RO5.25 25-Apr-2018, supplied by R-Comp. BASIC is BASIC V 1.75 03 Mar 2018 I could fiddle around with memory allocation, changing END= etc, but I don’t want to make it go away without knowing why, otherwise Murphy’s law says it will return, just as we go live at the start of a major event. |
Jeffrey Lee (213) 6048 posts |
Not sure if it’ll help much, but %MI: and SE look suspiciously like they’re part of a time format string: https://www.riscosopen.org/wiki/documentation/show/Date%20Format%20String%20Tokens Perhaps a symptom of reporter going wrong, since it includes timestamps in its log output. |
Alan Adams (2486) 1149 posts |
Thank you – the function at the top of the sequence produces a formatted time and sends it. It’s the sending code that triggers the crash. |
Martin Avison (27) 1494 posts |
That (obviously) should not produce an error … but it seems it is :-(( If the ON ERROR is first in the error proc, is the error line still the first line of the error proc? Incidentally, the report command which includes |
Alan Adams (2486) 1149 posts |
The original calling code was: DEF PROCtimerequest_exec(client%) LOCAL A$, J%, K% J%=FNscanchar(client%*2,"#") K%=datastart%(client%*2) A$=FNreadbuf(client%*2,"~","#",K%) datastart%(client%*2)=K ?env%=3 SYS "OS_Word",14,env% $format%="%W3, %DY-%M3-%CE%YR.%24:%MI:%SE"+CHR$(0) SYS "Territory_ConvertDateAndTime",-1,env%,names%,256,format% A$=FNreadstr(names%,0) A$ = "timeis~time=" + A$ + "#" IF DEBUG%AND1 THEN *report \R SVR: timerequest_exec: A$ PROCsendmessage(A$,client%) ENDPROC It uses several general-purpose blocks that are reused elsewhere. The calling code is now:
Now I’ve defined timebuffer, timeformat and timeblock separately as 256, 256 and 10 bytes respectively, and it no longer crashes. If I reinstate using format% instead of timeformat% it crashes. However this isn’t used anywhere else as far as I can tell. I’ve defined guard blocks of 256 bytes before and after format%, which made no difference. So I’ve fixed it, but I don’t know why it fixed it. I know menu blocks have to be preserved while the menu is open. I assumed you could change them as soon as a selection is made. I wonder whether there is some clearing up done in the background, and I’m changing the block too soon? Having said that, the buffer containing “%MI:%SE” isn’t the menu buffer. The menu buffer is the one receiving the converted string. |
Rick Murray (539) 13850 posts |
Please don’t post code by simply copying it to the forum. Textile will make a mess of it. Because Textile… To show code, leave a blank line and then write <pre>, then place your code, register dump, whatever after that. End it with a </pre> and you’re good to go. You can do this multiple times …if… you have a number of things to show. When you do that, we’ll all get to see more or less what you intended (stupid Textile messes up URLs) rather than the gibberish that it would be without. (example – look at the code you posted a couple of messages above this one. See all the bits in bold? That’s where Textile saw a ‘*’ pair and decided that you meant bold, and your multiply by two just got eaten…) |
Alan Adams (2486) 1149 posts |
No, it is the *report line which is reported. From my immediately previous reply, changing the buffer from format% to timeformat% stopped the problem appearing. This suggests the variable chain with the problem included format%. timeformat% is a newly created variable, DIMmed after format% in the program. |
Alan Adams (2486) 1149 posts |
Thanks. I thought there was something to do that, but when I opened the textile reference I missed it. I see it’s done weird things to the quoted sample too. |
Alan Adams (2486) 1149 posts |
In case textile has mssed it up, here is how it should have been: IF DEBUG%AND2048 THEN *report SCL: caller$: before FD_SETSIZE% ~readmask% ~writemask% ~!readmask% ~!writemask% ~!(readmask%+4) ~!(writemask%+4) broadcast_s% server_s% poll_delay% |
Martin Avison (27) 1494 posts |
Are you sure it is not used anywhere else? You can use *ReportMon to monitor a variable and report if/when it changes – presumably format% should never ever change value. |
Alan Adams (2486) 1149 posts |
I added that. The output now contains 15:16:58.00 ** Mon ** format%=196932 some more output then Value @ 43 format%=571996 Does this mean it’s changed at line 43 in one of the libraries? Actually, yes it is, but I don’t know whether that routine is called. It’s declared LOCAL and used within a routine. Defining it as local shouldn’t change the outer value though, should it? |
Martin Avison (27) 1494 posts |
The single value change you see is what I would expect if it was not declared LOCAL. For example Should produce
|
Alan Adams (2486) 1149 posts |
I’m a bit puzzled then. I get a single reference to line 43. However there isn’t a mention of format% in line 43 of either the main program of any of the libraries it uses.
which is interesting. The first line number is one after the LOCAL declaration. The second is one after format% is assigned a value, the third is one after the function call that contains the first two. They are all in the same library. I still can’t find a reference to format% near a line 43 or 42 in the main program or any of the libraries. 1266DEF PROCcommonwritelog(logname$,message$) 1267IF NOT(logging%) THEN ENDPROC 1268LOCAL log$,logsize%,freespace% 1269IF LEN(message$) < (254-13) THEN message$ = FNclocktime2("%24:%MI:%SE.%CS") + ": " + message$ 1302DEF FNclocktime2(format$) 1303LOCAL rsize%,localtime%,format%,result% 1304IF LEN(format$) > 29 THEN format$=LEFT$(format$,29) 1305localtime%=wlib_block% 1306format%=wlib_block%+30 1307result%=wlib_block%+60 1308rsize%=30 1309$format%=format$ + CHR$(0) 1310?localtime%=3 1311SYS "OS_Word",14,localtime% 1312SYS "OS_ConvertDateAndTime",localtime%,result%,rsize%,format% TO ,rsize% 1313?rsize%=13 1314=$result% 1315 BTW Textile is still messing this up. It was missing various things in brackets, but after adding this comment, it’s stopped doing that. It is still double-spacing some of it. |
Alan Adams (2486) 1149 posts |
and a final thought – I was using format% as a buffer to hold a date/time format string to pass into Territory_ConvertDateAndTime. What I hadn’t realised when I wrote the code, some time ago, is that SYS does that for me. It turns out that passing either a string variable or a string literal in works just as well, so I’ve now cleaned up several similar routines. It’s a result of reading the PRM where parameters are often referred to as pointers to memory. Unless you read the BASIC manual on SYS quite carefully, it isn’t obvious that SYS does it for you. |
Martin Avison (27) 1494 posts |
The changes to avoid using SYS abilities rather than format% within the time function(s) are a good thing … but I suspect that while it has avoided the immediate, obvious, bug, it it has not removed the bug. Looking at the Reporter output with logging enabled, it clearly shows that format% was changed in value, at a mysterious line 43. Depending on whether it was changed as a result of an assignment, or a DIM, any uses of format% after that point could (and at some time will) cause problems of some sort. I suggest that it will save future grief if that mysterious line 43 change is tracked down. Personally I would use one of the following: It may also be interesting to look at the values that format% is set to. If Initial value 196932 = &30144 fine – in variables [Note that *ReportMon ~format% will give values in hex] |
Alan Adams (2486) 1149 posts |
I suppose one question is what exactly does ReportMon do? If it monitors the contents of an address, pointed to by the variable name, then something overwriting that address would show as a change to the variable. I think I’ve eliminated future problems, because there is now only one use of format%, as a local variable assigned to a space in a large buffer, and only in one function. I might change the name of the variable there too. However following your suggestion I tried Textseek (which I’d forgotten I had) and found that format% was being DIMmed at line 42 in a library, as well as being DIMmed in the main program. I’ve removed it. Fortunately only one program uses that library, so it won’t affect anything else. This also I think explains the corruption. The main program DIMmed format% as 1024 bytes. The library then reDIMmed it as 20 bytes. Later it put a 30 byte string into the buffer, thereby overwriting the end. What this overwrote then depended on the memory changes that occurred between these two events, which depended on the data being processed at the time. The original error sequence I now see is that SYS “XSocket_select” returned an error, not from the SWI but from the BASSIC interpreter preparing to call it. Reporter then not surprisingly produced the abort when it tried to check the variables and found a corrupt chain. Without the ON ERROR OFF, it then looped in the abort. Thank you for all the help. |
Martin Avison (27) 1494 posts |
ReportMon monitors the value of a variable, whatever that is. It checks at the start of each line if it has changed, and if so reports the new valeu. I am sure that the extra DIM explains the nasty corruption you found, which affected anything looking for a variable (including Reporter). |