USB is slow
Pages: 1 2
Dave Higton (281) 668 posts |
Yes, that’s a bit of a generalisation, I know… Yesterday evening I copied my BeagleBoard’s HardDisc0 image (less than 500 MB) onto another stick. The stick was in the Iyonix, and I transferred everything over ShareFS from the stick in the Beagle. It took all evening (roughly 3 hours). I suspect that the USB system has a poor algorithm for allocating transfers into frames and microframes. Bulk transfers, which is what we’re talking about here, should be able to take up all of the bandwidth that remains after periodic transfers (isochronous and interrupt) have been allocated. I’ve attempted to look at the USB code before, but I couldn’t make head or tail of it. Can anyone point me at which source files are responsible for this allocation? |
Jeffrey Lee (213) 6048 posts |
I’d be more tempted to blame ShareFS, or the internet stack, or perhaps EtherUSB (although I think other people have said in the past that they haven’t had any performance problems with it). Remember that a few months ago I fixed a design flaw in SCSISoftUSB that limited the transfer rate to one frame every centisecond (or around 3MB/s transfer rate for bulk transfers, and much much worse performance for small files). With the new driver people were reporting speeds of up to 20MB/s with USB hard drives. So one thing to do is to make sure that you’ve got the new drivers installed on your Iyonix (SCSISoftUSB 0.11 or above, and RTSupport – see here for an installation guide). With the old drivers I also found that the code in the filer to perform interactive file copies was getting confused and limiting the transfer rate more than it should have been, resulting in significantly worse performance than singletasking transfers. So that might also be a factor (if either you’re using an old version of SCSISoftUSB, or something is causing ShareFS/the network stack to perform poorly) If you still want to take a look at the USB stack, I wrote a quick guide to the source code here. If you want to look into this problem in any depth then you’ll also find fiqprof invaluable, as it’s the only way you’ll get a global view of what the system is spending all its time doing. It can also allow you to easily spot limitations that may otherwise be hidden in code (e.g. that centisecond timer that SCSISoftUSB was using stood out like a sore thumb). However the current version doesn’t deal with WIMP tasks, so it’s best to only profile single-tasking file transfers (although it shouldn’t be too hard to add WIMP support by tracking task swaps using the filter manager). You’d also probably want to compile your own ROM image(s) so that you can generate GPA files and track down the exact source lines being executed at each point in time. Unfortunately ShareFS is still closed-source, so if the problem lies there then I don’t think there’s much we’ll be able to do about it :( |
Dave Higton (281) 668 posts |
EtherUSB is a good candidate, because the slow speed seems to apply whether using ShareFS, LanMan98 or NFS via Moonfish/Sunfish. LanMan98 is between the BB and an NSLU2, which is definitely faster when the Iyonix is writing to it. So that’s three protocols and two devices, both of which I know to work faster than the BB. |
Steffen Huber (91) 1953 posts |
Hi Dave, did you have a chance to play around with the prototype code I sent to you? I would be interested to hear what kind of speed you get when reading from a CD/DVD device, this would be a good indicator of USB performance on the BB. |
Dave Higton (281) 668 posts |
I had the briefest of plays late on Sunday (I think) with the reader, only to discover that I need some logging software whose name escapes me. (So many names escape me these days… I hate that one aspect of growing old!) I haven’t had a chance since. The writer arrived yesterday, along with some USB sticks. Bearing in mind the trouble I’ve had with the one I’ve been using, that was top priority. I will play with it, definitely before the end of the weekend, and I’ll let you know what happens. Btw I know the logging software is freely available, I just haven’t installed it on the BB yet. I may well play on the Iyonix before moving it to the BB. |
Steffen Huber (91) 1953 posts |
Hi Dave, the code as I sent it to you needs !Reporter for its tracing/logging, but you can just use a different debug library and it works without !Reporter but logs to a file instead. I think I described that “Feature” in the !ReadMe. I would be especially interested to get comparative performance figures when using different USB buffer sizes and playing around with the “how many blocks at once” settings. And maybe using something different than the standard OS_GBPB to read/write from/to the USB device. I have finally built USB support into CDVDBurn, so I will soon have some performance info for CD/DVD (and hopefully blu-ray, if my BD writer works with the USB-IDE adapter) writing. |
James Peacock (318) 129 posts |
I wouldn’t rule out EtherUSB… Last time I looked into this, EtherUSB’s performance was quite asymmetric: HTTP downloads were much faster than uploads. My memory is hazy now but I believe the problem was due to not getting prompt notification of the DeviceFS output buffer emptying. This lead to the addition of an output packet buffer and a callback to prod writes in case they stalled. The above is a fallback mechanism, the primary notification mechanism avoids timers and callbacks by using UpCalls as follows: For output streams, EtherUSB hooks onto UpCall 9 (Buffer emptying), sets the buffer threshold to the amount of free space in the buffer when empty and sets the ‘UpCall on threshold’ flag. I suspect this doesn’t always work for some reason. At the time (about 5 years ago!) I remember trying all sorts of varients without much success. For input streams, EtherUSB hooks onto UpCall 15 (device receive data present) and directly calls the backend to read the newly arrived packet. AFAIK this works well. See EtherUSB’s usb.c. It is a little messy due to much experimentation. |
Dave Higton (281) 668 posts |
How many data do you put into the send buffer at once? Could it simply be that you’re not supplying enough at a time? |
James Peacock (318) 129 posts |
Everything happens one ethernet packet at a time, so there is at most one ethernet packet in the DeviceFS buffers associated with the Tx and Rx bulk endpoints. All of the devices I’ve seen expect a packet to be sent as a sequence of full USB bulk transfers terminated by a short or empty one. USBDriver module takes care of that, or so it appears. As soon as EtherUSB sees that the Tx buffer is empty it fills it with the next queued packet. Going down this line there are two things to consider: 1) Can an individual USB ethernet device accept multiple packets in one sequence of BULK transfers? Some of them could in theory support this as the prefix the actual packet data with a header indicating the packet length. 2) Is there a better way to send the packets to the USBDriver so it can send them more quickly? |
James Peacock (318) 129 posts |
Dave, have you tried just downloading some big files from a webserver with NetSurf or wget say to see what the throughput is for a plain TCP stream is? I have noticed (though not properly measured) that this was a much faster way to transfer a single large tarball to a BeagleBoard than using NFS or ShareFS. |
Jeffrey Lee (213) 6048 posts |
James – I sent you an email this morning with some quick optimisations (for some reason the ROOL forums weren’t working for me at the time). Basically the performance problems are because you’re only adding data to the transmit buffer when the buffer is fully empty, and the buffer sizes are too small/not specified (or at least they were with the Pegasus backend – and I must have copied that code from one of the other backends). By increasing the transmit buffer size and fixing the code to try and keep the buffer full of data, I was able to get around a fivefold performance increase when copying files to a PC using Sunfish (UDP NFS protocol). The performance seemed to be about half that of my Iyonix’s onboard NIC.
My memory is a little hazy, but yes, I believe that behaviour is as defined in the USB spec. Applications should treat bulk transfers as just a raw stream of data, like a TCP stream. It’s only the lower levels of the USB stack that needs to have any knowledge of the way that the stream gets split into packets in order to be sent across the packetised USB network. This is different to control and interrupt endpoints, where applications do need to be aware of the packetisation.
For NICs that prefix the packet with the packet length, yes, you should be able to store multiple packets in transmit/receive buffers. Do you have any examples of NICs that don’t add headers to the packets? Are you sure they don’t use some other method to indicate the packet length? (e.g. they send interrupt packets to indicate the length of the packet(s) in the internal RX buffer) Also, surely the packet itself must contain the length? (my knowledge of the lower levels of TCP/IP admittedly isn’t that great)
Not that I know of. Just make sure that you use reasonably sized buffers, keep your transmit buffer full, and your receive buffer empty! |
Dave Higton (281) 668 posts |
I don’t have any direct knowledge of USB Ethernet, so please prefix this with warnings. However… I would expect that they can. Surely it would not take you long to try it in a modification of your code? Change the transmit criterion from “Tx buffer empty” to “enough space in the Tx buffer”. Either it will run at much increased speed, or it will fail horribly. |
James Peacock (318) 129 posts |
Jeffrey – thanks for the mail. I’ll try to have a look at it this weekend. IIRC, at the ethernet level, there is a 2 byte length field in the packet, but that generally gets used to hold the packet type and so can’t be used. The AX88172 chip (the first backend implemented) doesn’t have a packet length header, its datasheet at least doesn’t suggest any other way of passing this. Though to be honest it isn’t the most complete document, I’ll see if I can dig out the NetBSD driver again. I did remember trying larger rx buffer sizes with that chip, though presumably without much success given that the buffer is only big enough for one full sized ethernet packet. The later Asix parts do have a header, but looking at the code the MCS chip doesn’t. Even so, it would probably benefit from refilling the tx buffer as soon as it begins to empty the one packet it contains as long as the USBDriver can cope with it and not merge the two packets into one. How does the USBDriver determine the length of the packet (as in cases like this where it is a sequence of packets rather than a continuous stream of bytes)? My assumption at the time, given the lack of documentation, was that it was the amount of data in the buffer and hence the present implementation. If it could track the multiple packets in its buffer then I could keep it as full as possible for chips which don’t support packing multiple packets into one transfer. When it was all written much of this was guess work. I had (and still have) little idea of what is and isn’t allowed when driving the USBDriver at this level which was the other reason it was so conservative. As an aside, is there a high resolution monotonic timer I can get access to easily. I’d like to knock up a module to provide a light weight means of recording a sequence of specific events, say an array of (time-of-event, event-code), in a DA. |
Jeffrey Lee (213) 6048 posts |
I’ll have a read through the USB docs & source code tonight and see if I can come up with an answer for this, since as I’ve said I’m not 100% sure myself.
Yes, you can use one of the HAL timers (specifically HAL timer 1, since the Iyonix only has two timers and timer 0 is used by RISC OS). However rather than using the timer directly it’s probably a better idea to use Rik Griffin’s HALTimer module since RISC OS doesn’t enforce any kind of control over who/what uses HAL resources. |
James Peacock (318) 129 posts |
... or it will partially work depending on the tx queue, merging some packets and not others, but since TCP is quite resilient will probably appear to work fine. I need to faff about with the network and get ethereal (or whatever it is called these days) working again. |
Dave Higton (281) 668 posts |
Wireshark. One of the best programmes I’ve ever used. It’s active on my desktop now. And don’t forget Wiresalmon for use on the RISC OS platform. You can capture using Wiresalmon and display using Wireshark. |
Jeffrey Lee (213) 6048 posts |
After reading through the USB docs and the USBDriver source, this is the situation:
There are still a few bits left for me to investigate with regards to read pipes, but hopefully the above will prove of some use. And if none of the above makes any sense, it’s probably because I had to rewrite most of it after realising that the DeviceFS interface won’t allow you to have multiple outstanding IRPs on the same pipe (since there’s only one usbd_xfer per pipe) |
James Peacock (318) 129 posts |
So, to write a packet: I need to find the buffer’s insertion address and check that the packet will fit, if so I copy the packet into the buffer, add its size to a list of buffered packets and reset the threshold so an Upcall gets triggered once the first packet in the buffer has being removed. The Upcall must look at the buffer’s used space to determine which packets have gone and remove those from the list. It then needs to reset the threshold for the next packet and attempt to fill the buffer with any waiting ones. In either case I need to trigger the write by faking a buffer fill of the correct size. Can I call the device WakeUpForTx call instead of doing a fake buffer fill? My knowledge of the depths of USB isn’t that great, however it may suggest why I had so much trouble trying to get libptp2 to work: is it actually possible to read a large, camera JPEG sized, chunk of data if the request gets truncated to at most the buffer size? Or will it work if the buffer is continuously emptied quickly enough? I haven’t being able to get the buffer filling fix to work with any of my devices as of yet, though the Internet module assuming that ‘ej0’ always has the same MAC even if I deregister it and reregister it with a different MAC didn’t help (nor did it getting to to a strop due to excessive use of route and ifconfig, it turns out RMReinit Internet is the way to go). Anyway, I’ve changed tack and have being working on a module to aid getting good enough logging to see what is going on as SysLog isn’t up to it. The module just creates a DA into which it puts a sequence of events with a highish resolution timestamp (using the TimerMod module). I’ve recompiled EtherUSB with profiling enabled and hooked the SCL stubs to redirect the calls to _count1 that the compiler generates at the start of each function to a little routine which logs the PC of the caller to the event log. There is a work in progress python script to convert this binary log into something more readable and filter out the uninteresting bits. Once I’ve added a few explicit log events for buffer sizes and anything interesting happening, hopefully I’ll be able to see what is going on. |
Jeffrey Lee (213) 6048 posts |
Not quite – although you’ve written multiple packets to the buffer, the buffer manager is only ever aware of one packet at a time. I.e. you write as many packets as possible to the buffer, then call the buffer insert routine with the right length for just the first packet to be ‘inserted’. Then when you get the buffer emptying upcall you’ll know that the packet has been removed, so you just make another insert call for the next packet (and/or fill the buffer with more pending packets).
No, because WakeUpForTx relies on the used space as reported by the buffer manager.
It should work fine, since the way the code is written means that the USB device isn’t aware of the fact that the transfer has been split in two. I’ve just been having a look at the NetBSD AX18872 sources, and it looks like it does something which the DeviceFS interface currently doesn’t allow – namely, it uses the USBD_FORCE_SHORT_XFER flag when writing packets. This flag causes the driver to send a zero-length packet at the end of the IRP if the IRP length is an exact multiple of the max packet size. And to quote the EHCI driver source, “needed for ethernet devices”. (This also points to a bug in my MUSBDriver, since I don’t handle that flag at the moment) So it’s safe to assume that the AX chips detect the end of the ethernet packet by looking for a non-full USB packet. The same method is also likely to be used for received packets, i.e. the AX chip will always send a non-full packet at the end of the transfer. AFAIK the current DeviceFS interface will handle this correctly (unless you explicity start another transfer by calling RxWakeUp) I’m tempted to suggest that we try switching to the API used by the Simtec USB stack, since it seems to be designed a lot better than Castle’s. However I’m not sure if the API is documented to the level that we’d require, and I’m not sure how easy it would be to get the two APIs to coexist so that existing drivers can continue to function. It would also obviously take a few weeks/months to get it up and running. Alternatively we could also try exposing the usbd_transfer function – this would certainly make things easier for ports of BSD code (Simtec’s C API seems to be a bespoke API as far as I can tell, so isn’t an exact match for BSD). If we exposed some/all of the BSD functions then it could be used as a stepping stone to a wrapper layer that implements the Simtec API. Thoughts? Also, a correction/addition to my above text:
A STALL packet is interpreted as an error condition that must be cleared by software (hence the ‘ClearStall’ USBDriver DeviceFS call; see also page 207 of the USB spec). So ordinarily when receiving data a client must know how much data they want, or have the code written in such a way that it automatically stops trying to receive once a non-full packet is received (like some of the DeviceFS interface seems to be constructed). |
Jeffrey Lee (213) 6048 posts |
It also looks like there’s a bug in your RX code – usb_start_read doesn’t pass the (undocumented) read size in R3 to RxWakeUp. So at the moment it’s pure luck that it works at all. You’ll also want to start using Thomas Milius’s extended TransferInfo USB DeviceFS call in order to read the number of padding bytes that were inserted by the crappy DeviceFS interface (see the read_cb function. And that code could even crash the system, if both the free space in the buffer and ‘ Have I mentioned how much of a horrible hack I now think this DeviceFS API is? ;) |
Jeffrey Lee (213) 6048 posts |
And an addendum to the addendum: Apart from a STALL response and a data response, a device can also give a NAK response to indicate that the device can’t service the request at the moment. E.g. if you send a mass storage device a read command and then try reading from the pipe that’s used to transmit the read data then the device will send NAK responses until there’s enough data in its internal buffer for it to send something useful over the USB connection. The NAKs are usually hidden from the client’s view, but sometimes endpoint descriptors have NAK timeouts specified so that an error condition can be raised by the host controller/USB stack if an endpoint returns too many NAKs in a row. |
Dave Higton (281) 668 posts |
Let’s give Castle a little more credit; I suspect the API may have been designed from a viewpoint that didn’t have enough experience to foresee its shortcomings. USB is in many respects fearsomely complex. There are lots of engineers out there who write code without realising how bad it is, or who design products that are difficult to drive :-( This has the effect of making USB programming even harder in order to avoid malfunctions caused by their bad decisions.
I have struggled to understand the documentation, so I don’t think it is documented well enough. But far worse than that: the interface is at such a low level, it makes it hard work to do the simplest of things.
Peter Naulls made a suggestion a while back: that RISC OS should adopt the libusb API. Now I haven’t looked at all at what would be involved, but if it were achievable, it would make it easy to port device drivers from Linux. This could have the benefit of making a hugely wider range of drivers available to RISC OS, but has a danger of poisoning the Shared Source licenced code of RISC OS with code of a fundamentally incompatible licence, if we’re not careful about where the new code comes from. |
Jeffrey Lee (213) 6048 posts |
You mean I can blame them for some other things as well? ;) I can understand that whoever thought DeviceFS was a good idea wasn’t fully aware of the problems it would cause. I think the main reason that I’m annoyed is that it’s been left to us to fix it. And since sufficiently skilled and motivated programmers are in short supply, the act of fixing it could significantly slow down development in other areas.
Yes, the one advantage that the DeviceFS interface has is that it’s easy to use. But as soon as you become interested in performance (which is really what every device driver of any kind should be interested in) you’ll find it wholly inadequate and have to start jumping through numerous hoops in order to try and get the best performance. The complexity of Simtec’s API is because it exposes the high-performance features of the USB stack that Castle’s interface does its best to hide.
The libusb API could probably be added as a wrapper around the BSD API, much like the Simtec API could be. I think our first step should be to try exposing the BSD API, since that’s what we’re currently using internally and where most of our device drivers are coming from. |
James Peacock (318) 129 posts |
I think lack of documentation is certainly a problem, as is the hoops you have to jump through to use it. Searching csap of these forums brings up things which people have had trouble with. The weird buffer padding behaviour is both undocumented and annoying. See the following thread (and for the reason for that ‘feature’): |
Stephen Leary (372) 272 posts |
Just like to add that i’ve been through EtherUSB with a fine tooth comb and its pretty clean/good. EtherGEP is based on the same code (except the direct USB bits) and goes like a bat out of hell. So my conclusion would tend to be that the USB interface is the issue. My own USB drive performance stinks. Its far faster for me to use sunfish and mount a share after i boot and use that as my working disk. |
Pages: 1 2