Network & USB inefficiencies
Jeffrey Lee (213) 6048 posts |
After many years of hiding from it in fear, I’ve finally taken another look at ticket #324. I haven’t observed any crashes yet, but I am seeing some behaviour that indicates the system is close to crashing (e.g. excessive CPU usage), and some undesirable side-effects which are perhaps related to that (e.g. excessive packet loss). Profiling on my BB-xM showed that when subjected to a packet flood, about 30% of the CPU time was spent in BufferManager, copying the received USB data into the DeviceFS buffer. Apart from the slow performance of the copy routine, the copy was also being performed with interrupts disabled, and seemed to be taking long enough to be breaking the age-old RISC OS 3 PRM rule of not spending more than 100us with interrupts disabled (although the overhead of the profiling may have contributed to that). Then after that the data would be copied out of the DeviceFS buffer and into mbufs (again with IRQs disabled, albeit much quicker this time due to being in cacheable memory) for processing by the Internet module (which, thankfully, is performed with IRQs enabled). Presumably the high packet loss when the system is running at a lower clock speed is a symptom of the interrupt-driven USB → DeviceFS copy routines taking far too much time compared to the callback-driven Internet code which is processing the packets (and very little time, if any, being left for foreground programs to run and actually process those packets) This suggests we have the following areas for improvement:
I’m yet to do any profiling on my Iyonix (hopefully tonight) – it’ll be interesting to see what the bottleneck is there, since there’s no USB/BufferManager involved. Perhaps it’ll be a similar problem, i.e. EtherK might be copying data into mbufs from within its IRQ handler. |
Rick Murray (539) 13850 posts |
Aside: A few lines down in the source linked, it says “this is meant to be called from a kernel debugger”. |
Jeffrey Lee (213) 6048 posts |
I suspect that comment hails from BSD-land, where good debugging tools grow on trees. |
Steve Pampling (1551) 8172 posts |
The note by Sprow suggests that the problem is buffer related – pinging with a packet size bigger than the MTU will involve packet split and re-assembly at/in the destination and the re-assembly is either failing or causing the buffer to fill while the packet re-assembly delay is active. Interesting from my viewpoint as I regularly test for underlying problems by increasing the packet size for the ping test1 where duplex mis-match issues and other bandwidth affecting problems tend to show up. In the declared scenario I wonder what happens if you set -f on the ping sent from an interface with a larger MTU. If it’s a buffer issue the problem shouldn’t arise as the interface won’t (well shouldn’t) accept the larger packet. 1 on a PC ping 8.8.8.8 -l 2048 gives a packet size of 2048 bytes for those interested. |
Jeffrey Lee (213) 6048 posts |
I’m fairly certain my original testing was done with packets which were smaller than the MTU. In any case, my testing over the past few days hasn’t tried going above the MTU. |
Steve Pampling (1551) 8172 posts |
Regular use case but hardly likely to trigger errors |
Jeffrey Lee (213) 6048 posts |
The Iyonix profile results are a lot harder to read – lots of jumping around between MManager, EtherK and Internet. After hastily adding a histogram option to profanal, it looks like 43% of the time is spent in MManager, 29% in Internet, 14% in EtherK, and 5% in the SCL – but it’s hard to say exactly which functions are taking up all the time since I don’t have a convenient way of getting the addresses of C ‘static’ functions. And of course MManager is closed source (but if the BB results were anything to go by, I can guess that most of the time in there will be being spent copying from the NIC receive buffer) |
Rick Murray (539) 13850 posts |
Management? Hmm, on second thoughts, naaaaaahhh…. |
Rick Murray (539) 13850 posts |
Seriously, though, Steve is right. If you design something to be capable of “x”, you shouldn’t really be trying it with values less than “x”. Preferably more, but that depends what “x” is – there’s a big difference between packet sizes and volts… Is there no way of getting hold of the MManager source code? It’s a little worrying that nearly half the time is spent within. Heck, for all we know it could be doing a dumb single register LDRB-STRB copy for everything…? |
Steve Pampling (1551) 8172 posts |
Typical manufacturer testing of items like telephone switching systems1 will usually involve running the tested item at PSU output levels a percentage above the rated value. The hope is to test the limits of any particular batch. 1 Personal experience, years ago. |
Colin (478) 2433 posts |
As I see it here are 2 problems 1) High CPU usage due to buffer bouncing in EtherUSB For me the best solution for 1 is to use the NetBSD interface directly. I have submitted to ROOL a demonstration on how to do it (I’ve removed the keyboard and mouse – which use the NetBSD interface – into separate modules without any changes to the NetBSD code) for them to evaluate. 2 is a bit trickier. I’ll ignore the fact that GB Ethernet can supply data quicker than USB or some Ethernet Phys can handle (if you are using a GB Adapter/phy). The incoming packets are read from the device’s DMA buffer and converted into mbufs. This happens with interrupts disabled. Large transfers continually create mbufs until they are exhausted. Mbufs can never be consumed while more data is arriving because Mbufs are consumed/released in callbacks (I think). So after mbufs are exhausted packets are dropped. If you delay the reading of the device by, for example, reading the device in a call back – effectively using the interrupt to trigger a new callback if there isn’t one already in progress – the delay from interrupt to callback can be .25 sec. This may result in the device missing packets as it isn’t being emptied quick enough. Once a device misses a packet a transfer can be subject to long delays waiting for missed replies – which I think is what happens in ShareFS. Unfortunately a lack of PMT means waiting doesn’t multitask. |
Colin (478) 2433 posts |
Buffer insertion is done in USBDriver so there is no need to modify each backend. |
Jeffrey Lee (213) 6048 posts |
Yes, using the BSD interface directly would certainly help. Hopefully your changes get accepted! |
Steve Pampling (1551) 8172 posts |
And there people were wondering what on earth RO could usefully do with an unused core on a multicore system. Note that on PeeCee systems there is typically a smallish processor in the NIC that you can offload stuff to in order to achieve better throughput. Not sure whether the on board NIC of the Iyonix has that but the equivalent chipset on a PCI card does. |
David Feugey (2125) 2709 posts |
Yep. And on Intel cards it was probably sometimes an Xscale :) |
Jeffrey Lee (213) 6048 posts |
The irony being that the Iyonix uses an XScale and currently can’t deal with 100Mb/s of traffic, let alone 1Gb/s? ;-) |
Steffen Huber (91) 1953 posts |
Can someone remind me why MBufManager is such a “magic piece” of code that no one ever dared to replace with something better (and where source would be available)? |
Colin (478) 2433 posts |
MBufManager isn’t a problem it’s just a means to buffer incoming packets which is optimised for the internet module – though I do think it may be better if rx and tx had their own mbuf pool. It’s a lot easier to move packets around and to add headers to packets with MBufManager than using a buffer like usb has for example. The problem is the buffer size isn’t infinite. There are 2 situations 1) packets come in small bursts smaller than the total remaining mbufs or mbuf creation is quicker than arriving packets. 2) packets arrive in large chunks greater than remaining mbufs. for 1 packets are converted into MBufs and linked together – so you get a list of mbufs – until the ethernet device has no more packets to convert at which point the interrupt returns and the MBufs get a chance to be consumed/released. for 2 the interrupt doesn’t return until all mbufs are used. No replies can be made until a received mbuf is consumed and released but even then any reply may not get the chance as you may still be receiving interrupts which may consume the released MBuf. You need to 1 switch the network interrupt off after the first packet arrives |
Jeffrey Lee (213) 6048 posts |
Plus if there was a zero-copy option. |
Colin (478) 2433 posts |
Isn’t that what the ‘unsafe’ mbuf is for – not too sure the details are sketchy. |
Colin (478) 2433 posts |
Regarding zero copy options. Does non cached memory come from the same pool as cached memory? I’m just wondering if it should be used sparingly. Take audio for example. If you read a file from a callback created from a 2cs callevery event you need to queue about .25sec audio – callbacks can take a while to happen – which would take a large chunk of uncached memory at high resolution. Is uncached memory scarce? should I minimise it’s use. |
Jeffrey Lee (213) 6048 posts |
I don’t believe so – or at least, I’m fairly certain I’ve seen a comment somewhere in the OS sources saying that zero-copy mbufs are still on the wishlist.
Fundamentally, yes, cached & non-cached memory come from the same pool. Pages which are in the free pool DA can be allocated to either cached or non-cached DAs. So apart from some known issues with how physically contiguous memory is handled, there aren’t any limits on non-cached memory usage beyond those which would also apply for cached memory. The main reason you’d want to avoid making heavy use of non-cached memory is because it’s very slow for CPU access, especially read operations (writes can usually be buffered to an acceptable degree). DMAManager can work around that by allowing you to use cacheable memory directly (with DMAManager doing the appropriate cache/TLB management to make this usage pattern safe). But not everything makes use of DMAManager, partly because not every DMA controller is suited to the way DMAManager does things. And more often than not, the stuff that doesn’t use DMAManager simply uses regular uncacheable memory allocated via PCI_RAMAlloc. |