LanManFS losing directory contents after a while
Martin Avison (27) 1494 posts |
I have failed to reproduce the missing files problem using LMcheck here with all variations I can think of. It would be very interesting if anyone who does see the problem could report if LMcheck also sees the problem or not, as this could be a useful clue. Timing issues have been suggested as a possible cause, and I did wonder if the speed of the network could be a factor? Mine is a GB network & NAS … although neither my Iyonix or RPi3 can do justice to that. |
Will Ling (519) 98 posts |
I’ve had no luck repeating in the last few days. I connect via a wireless bridge (TP-WR710N). I’ve just been checking back every few hours as seems to work for Jeff. Perhaps trying too hard prevents it. Maybe the nas needs to sleep. Maybe somthing else on the network needs to change or maybe it’s an interaction with other software. |
Colin (478) 2433 posts |
Would anyone who can repeat the truncated directory problem like to tryout this version of the module LanMan_timeout. The anti idleout interval seemed to be a long time in the original version so I’ve shortened it a lot hopefully it will fix the problem. |
Martin Avison (27) 1494 posts |
Timeout version seems to work just the same for me … with no problems! Any clues what circumstances provoked the long anti idleout to cause problems? |
Colin (478) 2433 posts |
First impressions of the debug code suggests that the anti idleout code is transmitted quite often but on a little more digging I discovered that it is only actually transmitting a packet every 18mins when idle which seems a very long time to me so I’ve modified it to transmit every 2mins to see if that is the problem. I can’t reproduce the problem here so I’m only guessing. It may be that timeouts are server software specific and that may be why it’s a problem for some people and not others. It may also explain why the directory enumeration initially works but then fails. In any case if someone can reproduce the problem with this new version then it’s unlikely to be a timeout issue. |
Colin (478) 2433 posts |
Mounting 2 shares to the same server showed that the Anti IdleOut function was only happening on one of the shares. I’ve updated LanMan_timeout/zip to fix this. Thanks Martin for spotting that. |
Colin (478) 2433 posts |
Just in case anyone would like to try out a version with Anti IdleOut disabled to see if it makes the problem appear more easily I’ve made LanMan_notimeout/zip available. |
Martin Avison (27) 1494 posts |
If you have seen the missing files problem, can you remember if at the time there was more than one LanMan mount? |
Andrew Rawnsley (492) 1445 posts |
Hi guys, sorry I haven’t been back on this topic. Have been down with a nasty fever until today, so am now trying to catch up on the various things that didn’t happen last week :( Thanks to Colin and Martin and everyone for looking into this. |
Colin (478) 2433 posts |
Just wondering if anyone who has seen the truncated directory problem has done any testing of the modified module? |
Jeff Blyther (1856) 47 posts |
Colin, I have been testing the notimeout module. I’ve just sent you an email of some reporter reports. With this module I find that it goes wrong a bit more regularly, usually in a day (tested 5 times), compared to a failure rate of between 3 hours and 5 days for the standard module (tested loads of times) I was going now going to move onto the new timeout module to see if this fails more or less regularly |
Will Ling (519) 98 posts |
No. Just one last time I had the missing file issue. Currently, I’m running my PI against the NAS I previously had the issue on, using Colins debug LanManFS, and Iyonix against a share from Rasbian with the notimeout version. Neither have shown the issue in the last few days, but, I have had a few aborts. Now, I had been ignoring them, not wanting to be distracted from the lanman issue, I think a couple might have been netserf, but when I checked in this evening, both had abort on data transfer on screen. I thought I’d look into it further. Both were in the same place in LanManFS. Oh dear. Now, given no one else seems to have an issue with this, perhaps it’s somthing common I’m running on both machines that causing it. But I’ve included the info below in case it’s useful. Iyoinx abort
PI Abort
I should add, just clicking ‘Cancel’ (‘Quit’ on the pi, why is that?), nothing bad happens |
Jeff Blyther (1856) 47 posts |
Will, I’ve had two pi’s connected to my NAS, both pi’s have been running the notimeout LanMan module to see if the missing files problem appear more easily. The only problems I’ve had is sometimes I’m getting ‘LanMan timeout’ error box (I assume caused by the anti idle out being disabled) and also sometimes a very long hourglass wait while it fetches info (I can guess that another module trying to access the NAS during this hourglass pause could cause trouble) Anyway I’ve satisfied myself that the missing file problem is made slightly worse using this module (which was the aim of this module), and have now got both pi’s running the lanman timeout module. This time I will report any odd behaviour. |
Colin (478) 2433 posts |
Will, any idea what the abort was? When you say that when you press cancel nothing bad happens is LanMan still running? The address is in the debug code (DumpBuffer) and it looks like that is called in processing the response to LanMan requesting the status of a server name. If it happens again can you send any reporter output if you are able to access it. I think the ‘notimeout’ version is causing confusion. It was only meant to see if it produced the problem any quicker. I’m more interested in the ‘timeout’ version. If anyone can reproduce the problem with that version just once then that will prove that the server timing out isn’t the problem. |
Will Ling (519) 98 posts |
‘Nothing bad’ was perhaps misleading. As in nothing I noticed to lanman. I noticed this morning though that the access icon was missing from the icon bar, so something knocked that out. I think I’ll clean image boot my iyonix and go from there, with the debug lanmanfs and get you more info if I get it again. I don’t want to be distracting with possibly unrelated issues. |
Will Ling (519) 98 posts |
This evening, I checked my Iyonix, with the intent of downloading the latest discimage to set up a clean boot, and found: Error: Internal error: abort on data transfer at &2034D3B0, again, clicked cancel, and the apps icon had now gone too. I’ve captured a report log of *cat, opening the filer, and running LMCheck (without recusion, just on the one folder). I’ve got a screenshot of the filer window, and the *cat output. |
Martin Avison (27) 1494 posts |
LMcheck before v0.03 used a GBPB buffer of 8000 bytes, but it now uses 512 bytes which I think is the same as the Filer, after I realised it could be a vital difference. Why it is so rare is still a mystery to me! I hope your latest log files will give some more clues to Colin. |
Will Ling (519) 98 posts |
Task manager has died now, not sure how long I’ll be able to get stuff out of it. |
Colin (478) 2433 posts |
Ok I’ve made up LanManFS_01.zip which is the same as the ‘timeout’ version without the debug code – if you want to confirm that you are using it
The ‘notimeout’ version disables all ping code to keep the mount alive so if you are finding that that makes the bug appear more often then that is good. It may be the debug code that is causing the aborts for some reason or it may be that the timeouts are causing more problems than just truncated directories. Either way there is a possibility that the ‘timeout’ version fixes the problem as I have had no reports of the file truncation problem with this version of the module and Jeff – who seems to be able to produce a truncated directory relatively frequently – reports no problems in 24hr so far. All the directory listing failure reports I have – including yours – show that the server is replying to a request for more filenames with an empty list and an end of list marker so it’s not a case of the server sending the file list and LanManFS going wrong. I don’t think that there is any need for more reports at the moment, I’m more interested in any truncated directory listing with either LanManFS_01 linked above or the ‘timeout’ version – which is the same with debug code and reporter output. |
Jeffrey Lee (213) 6048 posts |
If you’re experiencing crashes, don’t forget that modern versions of RISC OS 5 can generate stack traces which can be useful for pinpointing the cause. https://www.riscosopen.org/wiki/documentation/show/Debugger%20Exception%20Dumps |
Will Ling (519) 98 posts |
Thanks Colin, I’m now running test-01 on a clean !Boot on my Iyonix.
The current session was too far gone, so had to reboot. Noted for next time though, thanks. |
Martin Avison (27) 1494 posts |
I have now been running test-01 for 6 days continuously on my RPi3, without any problems. This was with two copies of LMtest v0.03, one listing listing a single directory with 981 files, the other listing 319 directories with 5935 files, at random intervals between 30 and 90 minutes. Not sure if this proves anything, as I have never seen the problem here! |
Will Ling (519) 98 posts |
I’ve been running the iyonix all week, checking daily with no issues. test-01 has certainly eliminated the normal 30 second pause when comming back after it’s been idle for a while, so that’s good. However, yesterday I found a quick simple way to force the problem to appear. |
Colin (478) 2433 posts |
Ok I got linux, samba and wireshark up and running and can confirm that the process you describe does result in a truncated directory – yay. Initial findings show that the smb server is sending an End Of Search indicator early in response to a continuation request when enumerating the directory – as the LanMan debug output also shows. At least I can repeat the problem here now. |
Dave Higton (1515) 3526 posts |
But this makes it look as if the server is at fault. |