File handles are rubbish
Pages: 1 2
nemo (145) 2571 posts |
I tried to open a zip file yesterday, but SparkFS said it was corrupt. I then did a search through the file system, and a few zips reported corruption. Very weird. Then I checked the dates – all changed the same day in April. Loaded one into Zap, and the end of the file is full of ascii – debugging info from something. So, four months ago I was running something that was sending debugging info to a file. Somehow, its file handle got closed. Thereafter, whenever SparkFS opened a zip, the program (whatever it was) spaffed debugging info onto and into various zips. Thanks, mystery program. Fortunately I use VRPC on top of Google Drive, and Drive does automatic versioning, so I was able to revert the ten zips that got killed (or I’d still be Quite Cross Indeed). File handles should not be reused. I’m thinking of writing something that will add a cycle number to FileCore’s apparently 8bit file handles to stop this kind of stupidity happening again. Some things treat file handles as bytes… in which case I can allow them through (optionally), fire warnings, whatever. But all those programs that correctly treat them as words would then be protected from overwriting the wrong file. (Just for the record, file handles are rubbish on all systems, and used to cause me lots of trouble inside printers too) |
Matthew Phillips (473) 721 posts |
That sounds a very worthwhile enhancement. |
Rick Murray (539) 13862 posts |
Why so complicated? The stupidity here is mostly in the logic where the “let’s find a free file handle” starts at 255 and counts downwards. It’s a lame-ass saving in that if the counter reaches zero then we’ve run out of handles, instead of bothering to actually count them. All it needs to fix a lot of these issues is to simply patch FileSwitch (not FileCore!) to decrement the stream count for each request, wrapping when it reaches zero. This will mean separately keeping track of how many handles were allocated as well as the number to start from the next time around. That way, much much less possibility of one program trampling on the handles of another. The code is in …Sources.FileSys.FileSwitch.s.StreamBits in the function AllocateStream: MOV r1, #MaxHandle ; Allocate handles from MaxHandle..1 ADR r2, UnallocatedStream ADR r4, streamtable 05 LDR r14, [r4, r1, LSL #2] ; Empty slot found ? TEQ r14, r2 BEQ %FT20 SUBS r1, r1, #1 BNE %BT05 ; If reached handle 0, naff up 10 addr r0, ErrorBlock_TooManyOpenFiles BL CopyError EXIT
Had that back in the Archimedes days. Exact same problem, exact same reason. That’s why I prefer a non-file-based debugging method. Also means the system wasn’t trying to write stuff to file if it crashes. DADebug is my friend. There’s also a tweaked version to periodically dump the info to the debug serial port, where it can be looked at on another machine if necessary. |
nemo (145) 2571 posts |
Your suggestion, if I understand correctly, simply cycles through 8bit file handles. So corruption can still happen after only 240ish OS_Finds (I think you’d be surprised how often file handles get allocated – try a grep). My suggestion means a single file handle would have to be reused 8 million times before corruption could occur. I prefer mine. |
David Feugey (2125) 2709 posts |
Hum, it’s probably SparkFS itself. It corrupts most of my zip file when I try to modify them. All works OK… but after a reboot, it’s not the same. |
Steffen Huber (91) 1958 posts |
You seem to have a deeply-embedded-somewhere filing system problem on your RISC OS systems. SparkFS never ever corrupted an archive for me, not a Sparkive, not a ZIP, not a TAR. And I have used it very very extensively. I remember you also had problems with Sunfish. I never had a data corruption problem there either, and I’m shifting multi GiB of data around. |
Rick Murray (539) 13862 posts |
It’s probably something messing with the file handles. Do you have anything that may accidentally close all of the handles?
True, and a valid point. However, we are looking at the problem of: file is closed unknown to an application, different file is opened and allocated the same handle, first application continues to access the file unaware that it is something else. There are two factors at play – the first is that an application really shouldn’t keep handles open across a poll and expect it to be the same thing on return. That’s not to say your eight million is a bad idea, it isn’t. I just think my blu-tack fix is a much quicker way to reduce (not eliminate) the problem.
I wonder how many things would stop working correctly due to the assumption of 255→1? Certainly every “show open files” program I have seen (including the two I wrote myself) work on the basis of starting the enumeration at 255, because that’s how it is defined and there is no API to say otherwise.
That implies something else is corrupting the files. It seems okay because SparkFS’ cached copies of the data is intact. After a reboot, the file is damaged and SparkFS has nothing cached. When I said above my files were corrupted, it was often archives and they contained bits of ARCbbs log files. But, importantly, this corruption also nuked some other files – there was a database, I forget which. What they all had in common (after some digging) was that they kept files open over Wimp polls. It was “fixed” by trapping CLOSE#0. I don’t know what was doing that – I suspect that something might have closed a handle, set the value to zero, then closed the handle (without checking if it was already zero). There is no way for an application to tell that all of the file handles have been closed. Ideally the Wimp should broadcast a notification message (because hooking into events and such is not something sensibly done from application code). But, the obvious response is ideally apps shouldn’t keep files open across times when they aren’t in control of the machine. And thus, we end up at the current situation where a bunch of dumb things equals a problem. |
Rick Murray (539) 13862 posts |
Just out of interest, David, have you tried DiscKnight? Is your logical structure intact? If your maps are awry, all sorts of chaos will ensue, progressively getting worse. Trust me, DiscKnight is a tenner well spent – your disc structure has to be good (there’s no maybe, either it is good or it is broken and will corrupt data) and RISC OS is too clueless to be able to fix itself. Hell, I don’t think *CheckMap has worked for anything other than a simple yes/no since the days when we all had floppy discs. https://armclub.org.uk/products/discknight/ |
Frank de Bruijn (160) 228 posts |
Not my experience. https://www.freelists.org/post/davidpilling/SparkFS-and-application-directories-in-tars |
nemo (145) 2571 posts |
David wondered
No. The unwanted content was a load of hex octets that looked looked like exactly the sort of thing I might have been spraying into a debugging file. The interesting thing to ponder is whether the guilty code was using XOS_BPut (or equiv), or whether it just happened to not output anything while the handle was closed. Rick declared
The only restriction about keeping files open during Wimp_Poll relates to Wimp_OpenTemplates. Quite apart from the novel restriction on Wimp programs, how’s “Rick’s Rule” going to work with command line utilities in TaskWindows? There is an argument for protecting a file handle opened in Task A from being touched by Task B, but that’s a more complicated thing to enforce than the cycle number wheeze.
…Apart from OS_FSControl,58 you mean.
Yes, one would have hoped that OS_FSControl,59 would have obliged, but it isn’t even filing system specific. And yes, I have written a program that had to write two files to a directory in order to detect that it would instead have to delete one and create a continuation directory instead.
I’m sure a |
Rick Murray (539) 13862 posts |
Haven’t TaskWindows always been something of a special case?
Because the Wimp, itself, only uses one file handle…and doesn’t give a handle to the caller (left over Arthur legacy or something?).
I find it useful to make no assumptions about what happened in the duration that the app yielded to Wimp Poll – to the point where something that might have checked/used the time should read it again. While it is “unlikely”, there is technically nothing at all that states that the poll didn’t return from the command line three days later with the alphabet set to Eastern European, the keyboard set to Dvorak, and an entirely different SD card mounted.
Brilliant stuff! Thanks, I missed that one, was still using OS_Args… |
David Feugey (2125) 2709 posts |
With SparkFS, the corruption happens mostly when I reboot the OS or when I leave the SparkFS filesystem. Nota: it happens with big files. By big file, I mean 300 MB or more. I move/copy/modify about 60 GB of data every month on my main setup. And several thousand of little files everyday.
When you reboot? True: RISC OS closes the handles :) It’s as if SparkFS didn’t (have the time to) apply the changes to the file. Or forgets to do it. Nota: I use a big SparkFS cache: 32 MB. And if you want to laugh, add a file to a zip stored inside a zip, then leave the SparkFS filesystem, and check your zip.
Yes: no problem. The problem occurs with my very big zip files (made initially with 7Zip on a PC). For Sunfish it was on a clean install of RISC OS 5.22 on a Pandaboard. One of the problem was when I did edit network files with StrongED: sometimes, additional data was added at the end of the file. I did have the same problem with RISC OS 5.20 to 5.26 on various configurations (Pi 1, 2, 3, clean installs or complex ones). I can create or read archives without problems, but it’s always tricky to modify something in an existing zip. Most of the time it’s just data added at the end of the zip (7Zip will complain about it when checking the file). And sometimes it’s just data corruption on the new files added to the archive. Sometimes, it’s the archive itself that is corrupted. |
nemo (145) 2571 posts |
Will you tell the file corruption or shall I? The point is that as far as the desktop is concerned, files shall be left open during polling. If anything it’s Note that FontManager keeps its file handles open (lots), but silently retries if it finds they’ve closed. I haven’t investigated what happens if one of FontManager’s handles gets closed and then a FindFont gets the same handle again. Unpleasantness, I expect.
The (internal) template file handle could have been in per-task storage instead of global… but it’s difficult to argue a case for wanting to keep your Templates open. But as with most things in RO, it isn’t even policed properly – Wimp_Poll ought to return an error if the template file is still open. As is rather common, it works by coincidence.
My |
Martin Avison (27) 1498 posts |
There are probably other solutions around, but if anyone is interested I have a small module StopClose0, which is designed to stop all files and devices being closed at once by program or by accidental command, which can cause some programs problems as disxcussed above. It will trap instances of the command *Close, and BASIC CLOSE#0 – indeed anything that results in a SWI OS_Find with r0 and r1=0, which indicates all files on the current filing system are to be closed. When OS_Find,0,0 is detected, a prompt will be issued to Allow or Fail: if the Desktop is running, this will be in an error box; if single-tasking it is a simple line prompt. If Allowed, all open files will be closed as ‘normal’. On System CloseDown, the module is deactivated in case the closedown process does a Close All as a last resort, although I am not sure this is required. If/When there is agreement that failing these requests does not cause any problems, the prompts could be removed and they would always fail. Note that StopClose0 does not trap Shut and Shutdown commands – although they also close all files they are closed individually, not by OS_Find,0,0. It can probably be improved … but only if I receive feedback. The module is available from my website |
Martin Avison (27) 1498 posts |
nemo wrote:
From my notes *Shut does not use OS_Find,0,0 but loops round closing individual file handles. But I agree there is a case for supressing it! It seems a very dangerous command today. |
Dave Higton (1515) 3543 posts |
Sorry, Rick, that’s rubbish. I wrote a pair of backup and restore apps that concatenate lost of source files into a small number of large files. The result is very fast, and multi-tasks with silky smoothness. A backup takes a significant time – a couple of hours is an entirely reasonable time if there are thousands of source files totalling some gigabytes. The only reasonable way is to keep a file handle open until its maximum size (according to the app’s rules) is reached. To close it, reopen it, and reposition to the full extent, would be crazy – it would take many times as long, and would have a good go at wearing out any flash destination medium. If some app closes one or more file handles that it isn’t entitled to, that’s clearly a bug and should be fixed. Programming apps defensively around such bugs would be ludicrous. I would support attempts to prevent *Shut or similar from working. |
Grahame Parish (436) 481 posts |
Is there a way to implement a scheme where each task has its own ‘local’ file handles that are not seen or shared outside of the task? So if a task does shut all open files it will only be for the handles it owns. |
nemo (145) 2571 posts |
Martin delivered
Thanks!
Yeah Perhaps your module should trap that just like OS_Find,0? :-D
Well, in my case it’s usually “Something has just exploded leaving multiple files open…” I used to have a library obey file that closed all files, reinitialised all filing system modules in the correct order and basically rebooted the whole filing system for when I’d broken something at a deep level. It was only one letter different from Grahame asked
I’d touched upon that earlier. It’s difficult to police because it’s very difficult to clearly define ownership: Task calls OS_Find This is straightforward – the task owns the handle. I’m not aware of any protocols that require Task A to pass a file handle to Task B, but if there were one, a Filter could detect the explicit sharing of ownership. Task calls module which calls OS_Find (eg FontManager) This is much more problematic – whether the handle belongs only to this one task is implementation defined. In the case of the FontManager it’s definitely shared. In the case of SysLog it isn’t. So, as I said, as the problem in my case was caused by illiberal use of
This is likely to be via OS_Find,0 and may well be unintentional (trying to close a file that wasn’t actually opened). Martin’s module protects against that. |
Stuart Swales (1481) 351 posts |
*Phut ? ;-) |
nemo (145) 2571 posts |
Martin, I’ve just had a look at your module. Task names returned from TaskManager are only guaranteed to be ctrl-terminated, not null-terminated. TaskManager currently truncates task names to 32 bytes including terminator… but that’s an implementation detail not a contract. I’d expect that to increase significantly once Unicode is supported properly (because worst-case that’s only a few characters!). Note that the German Word of the Year, 1999, is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz, which is 65 bytes long in UTF-8. And whereas the capital of England in English is six bytes long, Ulaanbataar in Mongolian is 39 bytes: ᠤᠯᠠᠭᠠᠨᠪᠠᠭᠠᠲᠤᠷ. Where was I? Oh yes, task names are restricted by the 400C2 TaskInitialise message that the Wimp broadcasts. So they must be less than 228 bytes long. Naturally there’s no checking. Of course there isn’t. This is what RO4 does if you try to use a 230 byte long task name for example: So don’t do the equivalent of a strcpy unless your buffer is at least 228 bytes, or you’ll get similarly disappointing results. You’re also using a lot of non-X SWIs in a vector handler, which is… brave, especially when using a 12 byte buffer to read the current filing system name – there’s no such restriction I know of (I just created an FS with a 20 chr name, for example). Also OS_Confirm would like to remind readers that reports of its death are exaggerated. But bravo. This could easily handle FSControl,22 too. |
David Feugey (2125) 2709 posts |
Nota: I never have these problems under the 26bit versions of RISC OS. On the other hand, I did not manipulate files of these sizes, and the uptime of my RISC OS 4 system was in hours and not days. |
nemo (145) 2571 posts |
Using VRPC, my uptime can be weeks at a time. I often forget it’s running. |
David Feugey (2125) 2709 posts |
I mean, the uptime while using it with many different and sometimes buggy apps. RPCEmu is quite stable with RO4, if you don’t launch unknown apps, |
Martin Avison (27) 1498 posts |
I am happy for you! Do you ever hibernate your machine? re StopClose0 – Thanks to nemo for constructive feedback, which I will look into. |
nemo (145) 2571 posts |
No, it’s a desktop. But when I used to use a laptop I think it was fine… but don’t quote me.
<grits teeth> it is… fortunate… that I tend to use it full screen anyway. BUT, its windowed pointer handling is beyond irritating. BY FAR the worst sin is that if the VRPC window gets the focus (because some other Windows window has closed) the RO pointer starts moving (with the mouse, but separate from the Windows ptr) and then, if you click in some other Windows window it remembers that you pressed the mouse button even though you didn’t click in VRPC and it has lost the focus. THEN when (perhaps days later) you deliberately change back to VRPC it then processes the focus gain as if you’ve clicked where the RO pointer happens to be… closing a window, or clicking on “Yes, do delete the contents of my disc” or… you get the picture. THAT is shoddy. RPCEmu has far better mouse handling. |
Pages: 1 2