SCSISoftUSB issues when performing large writes
Jeffrey Lee (213) 6048 posts |
It’s been mentioned on the forums before that there are some problems with performing large writes to disc using SCSISoftUSB. As mentioned in the post linked above, originally I’d only seen the problem with a FAT formatted USB stick, which would fail with a “Target error – no sense” error when attempting to save a large (64MB) sprite file. Reformatting the stick under Windows seemed to make the problem go away. But now that I’m messing around with USB HDDs, I’ve been able to reproduce the issue there as well – using the same hard disc in two different enclosures, and with the disc Filecore formatted, one will fail with “Target error – no sense” when saving the 64MB sprite file, while the other will get stuck and never complete the save operation (hourglass on screen, computer still responding to interrupts, etc., but after leaving it for over half an hour it was still busy, with the activity light on the HDD flashing). Disconnecting the USB cable did bring up an error message, but there didn’t seem to be any way to get back to a functional desktop afterwards. So there’s defintely something fishy going on with SCSISoftUSB, which I guess I’ll have to try getting to the bottom of over the next few weeks. Hopefully the “infinite transfer” is just because of a bug in SCSISoftUSB and not a bug in the USB->SATA bridge in the drive enclosure (the Aegis NetDock thing). |
Jeffrey Lee (213) 6048 posts |
Alas, it looks like these problems could be caused by buggy USB devices. For the device which returns the “no sense” error, this is quite obvious – SCSISoftUSB sends a write command for &4000200 (64MB + 1 sector, since the sprite is slightly larger than 64MB), which is accepted by the device. SCSISoftUSB then starts writing data, but the transfer fails with a STALL error when transferring the first packet (i.e. 512 bytes). And as the “no sense” error implies, it doesn’t report that anything’s wrong when a request sense command is sent. Using a slightly smaller sprite (&3FFC200 bytes) still causes it to fail straight away, so it doesn’t look like it’s anything silly like the device only having N bits for storing the transfer length. Although the SCSI spec does allow devices to state their maximum supported transfer lengths (page code B1), after connecting both the devices to my Linux PC and using sg3_utils to dump the MODE SENSE responses it doesn’t look like either of them implement any of the VPD pages. And the spec is quite clear that if a device implements the block limits VPD page, and the transfer size is exceeded, then it should report an INVALID FIELD IN CDB error. Looking at how different OS’s have handled this issue, the first thing I found was that NetBSD uses a hard limit of MAXPHYS for all umass transfers. And it looks like MAXPHYS is a mere 64KB on all platforms. As such I don’t think they’ve even encountered the issue (there don’t appear to be any quirks to deal with it). For Linux things are a little different – I’m not sure what the maximum transfer size is, but there are two quirks to deal with devices which don’t like big transfers: US_FL_MAX_SECTORS_64, which limits the maximum transfer to 64 sectors (32K for most devices) and US_FL_MAX_SECTORS_MIN, which limits the maximum to PAGE_CACHE_SIZE>>9 sectors (whatever that is). So the problem now becomes working out how to implement something similar in RISC OS. Should SCSIDriver/SCSISoftUSB silently split large transfers into smaller parts? I don’t know enough about SCSI to know if this is sensible/possible or not. Or should it be left to higher-level parts of the system? (e.g. SCSIFS, any SCSI-based CDFS that may appear, etc.). And if it’s left to the higher level bits, how should the lower-level bits indicate that a maximum transfer size applies? (presumably via a new SCSI_Control reason code, rather than faking block limit VPD pages for the devices that lack them) |
Steffen Huber (91) 1955 posts |
IIRC, the 64 KB “limit” is also used in the Windows XP (and later) Mass Storage driver. I have read the recommendation to restrict writes to 64 KB quite often. Actually, I think the 64 KB limit might be of ancient heritage – the ASPI interface also defines a 64 KB limit for transfers. I don’t think it is a problem if you silently split up transfers to 64 KB portions. Of course you cannot guarantee an atomic write operation, but IIRC this is not guaranteed at SCSI level anyway (i.e. a drive is free to overwrite only parts of the requested blocks when encountering an error). It would be interesting to see if it really makes a difference wrt performance when restricting writing to 64 KB max at a time. BTW, the driver file in CDVDBurn also has a “write blocks at once” configuration option because I have found various IDE drives (and firmware) that were not happy with large writes. |
Dave Higton (281) 668 posts |
I was subscribed to the Linux USB developers mailing list for a long time. It became clear that there are huge numbers of buggy USB devices out there. It even seems possible that the majority of device designs are buggy. I’m sure that Linux breaks transfers into blocks no bigger than 64 kiB. You may find that Windows does too. One consequence is that transfers above 64 kiB never get tested – after all, what can anyone test them with? And even if they’re tested, what can anyone use them with? Two of the favourite bugs for USB mass storage devices are:
It’s frightening to read of some of the crepe out there. |
Jeffrey Lee (213) 6048 posts |
I’ve found some ‘interesting’ things in SCSIFS. Firstly, there’s already some (disabled) code in there which will limit the maximum number of blocks-per-transfer to 255 (look for XferLenMax255 in s.ScsiFs15). This code was introduced by Castle after they found that some devices (presumably USB ones?) wouldn’t work reliably with transfers more than 256 blocks in length. Secondly, there’s also some much older code (i.e. it’s as old as CVS) which will limit the maximum number of blocks-per-transfer to 65535, since 10 byte CDB’s only have two bytes of space for storing the transfer length. But the thing to realise about both of these pieces of code is that they’re both broken. Although they do correct the block count that’s placed in the CDB, they don’t correct the transfer length that’s handed to SCSIDriver. And it’s that uncorrected transfer length which gets inserted into the USB mass storage ‘command block wrapper’. So whenever you try transferring more than 65535 blocks of data, the USB device will receive conflicting information – the command block wrapper will give the original length but the SCSI CDB will give the adjusted length. It wouldn’t surprise me if this was the cause of some of our problems. So I’m going to have a go at fixing SCSIFS and see if that fixes the issues. If it doesn’t, then I might just enable the XferMaxLen255 code (or some variant) – not an ideal solution since it will affect non-USB devices, but it looks to be a hell of a lot easier to just enable that piece of code than to add an extra layer of code to SCSISoftUSB. |
Dave Higton (281) 668 posts |
Jeffrey, my advice would be to limit the maximum length to whatever number of blocks makes 64 kiB. That’s based on what I read on the Linux USB developers mailing list. Sorry, I can’t provide a reference – but it’s one of the things that has stuck in my memory. Even that limit is unlikely to have much of an impact on performance of devices, I would have thought. |
Jeffrey Lee (213) 6048 posts |
I think I’ve now got a version of SCSIFS working that correctly limits the transfer length to 64K. If it survives my test regime then I’ll probably check it in tomorrow night. |
Jeffrey Lee (213) 6048 posts |
This fix is now checked in! |
Tank (53) 375 posts |
Jeffrey, I seem to be having a lot of problems with this new version of SCSIFS. My external HDD has been very stable , without any disk errors on both the Devkit and the xM boards. Using the new SCSIFS module (built into a ROM on the xM) quite a few errors started to appear. Reverting back to the previous version and all is well again. With the new module, running !DiscKnight gives 6840 errors on a disk that checks out OK with the old module and also on an Iyonix. This is with ROMs built with the only difference the SCSIFS module. Both tests on a freshly booted machine with !Zap , !DiscKnight and !Organizer the only apps running (alignment exceptions on). I have tried this a few times, always with the same result. |
Jeffrey Lee (213) 6048 posts |
With the latest code, what happens if you try disabling the XferLenMax64K option in Sources.FileSys.SCSIFS.SCSIFS.s.ScsiFs00? I’ll have a play around with DiscKnight and see if I can find anything wrong. Hopefully it’s just a bug in my code and not another dodgy SCSI controller. Has anyone else run into any problems with this new version of SCSIFS? |
Chris Gransden (337) 1207 posts |
Yes. Lots of disc errors using an external HDD. Similar problems with DiskKnight. Back to normal using a previous version. |
Tank (53) 375 posts |
Changing the option to “F” and building into a ROM works fine when checking with !DiscKnight. |
Jeffrey Lee (213) 6048 posts |
Just to confirm – is it only DiscKnight that people are having trouble with, or are there problems elsewhere too? (i.e. genuine disc corruption that doesn’t go away when reverting to the old module, or programs not loading/saving data properly) The reason that DiscKnight fails is because the code in SCSIFS which limits the transfer length simply transfers 64K of data and then returns to the caller, under the assumption that the caller will spot that not all the data has been transferred and will call the SWI again. And according to the comment next to the original code, this is the behaviour that FileCore takes. However it looks like FileCore only behaves in this manner when reading/writing to files – so if something (like DiscKnight) calls a DiscOp SWI directly then the remainder of the data will be left untransferred. Looking at FileCore’s source it looks like the behaviour which SCSIFS relies upon is simply a side-effect of the way FileCore has been written, and isn’t behaviour which other modules should be relying upon. I can’t find any mention of FileCore’s behaviour in the PRMs either. There’s also no mention of what should be done if a transfer completes without an error but with some of the data left untransferred. So I’m going to go with the common sense approach that read/write calls should either (a) exit with an error or (b) exit with no error and all data transferred. I.e. I’ll add extra code to SCSIFS so that whenever it clamps the transfer length it will instead split the SCSI op into several smaller ones to ensure that all the data does eventually get transferred. |
Jeffrey Lee (213) 6048 posts |
This fix is now checked in – let me know if there are any other problems! |