Building a better memcpy()
Pages: 1 2
Adrian Lees (1349) 122 posts |
FWIW, there are XScale-optimised implementations of memcpy/memmove (and memset, strcpy and strlen). They were written for Aemulor’s internal use (it cannot use SharedCLibrary, fairly obviously) and some/all were later incorporated into UnixLib. There’s also a test rig for checking correctness and measuring performance. I can make all this available if it’s helpful. These routines are likely inappropriate for ARM’s CPU implementations, being built for the long pipeline of the XScale, its LDM/STM decoding overhead and its high memory latency. I think addresses of the appropriate routines should be returned via an OS SWI call. Obviously the SharedCLibrary can remember these addresses when it starts up and avoid extra indirection by directly patching up the stubs in calling applications. The OS should probably get some/all of these addresses from the HAL, since I’m reminded that one issue in the boot time of RISC OS is the cost of zero-initialising the memory. Overlapping – does make a difference to DMA implementations, more so than software implementations, because the DMA engine will usually have a larger internal buffer and may cope only with transfers to/from ascending addresses. Also consider reentrancy. Interrupt handlers should not be performing large transfers, but callback handlers may; I don’t know the internal details of ShareFS but that must surely be running in callbacks, and of course TaskWindow could preempt an operation that is running in USR mode. A comment regarding the use of hardware acceleration; it would be good to have available some fast ‘safe’ routines that can be relied upon not to optionally exploit accelerators; not just ones that automatically employ the fastest method. By ‘safe’ I mean ones that use only the stack as temporary storage, software only, and will not potentially run into issues with cacheing/interrupts/reentrancy etc. memcpy/memmove/memset etc cannot generally know the conditions under which they are being called. As an aside, I did have a prototype RAMFS implementation that used the XScale DMA hardware to perform memory-memory transfers and it was radically faster. Geminus uses that hardware to render large sprites to video memory, and to transfer data between graphics cards. Regards testing, I have all targets…well, except IGEPv5 :) |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Yes please!
The RAM clear is generally either done by the HAL using DMA, or by the kernel using CPU (e.g. if the HAL has no DMA or the DMA isn’t very fast). So the HAL doesn’t need a fast memset for clearing memory, as it can just let the kernel do it. However it might be useful if the HAL had a fast memcpy for dealing with relocating the ROM image, so having the code available to the HAL is probably still a good idea. I’m planning on having the routines be built into a library, mainly to allow them to be built into both the ROM and testbed. That could also be a convenient way of allowing the HAL to get hold of any routines it desires – so by default they’ll all be in the kernel, but if a HAL wants one of them then it can also link to the lib and use any of the HAL-safe routines that it desires.
Yes, for the moment I’m mainly interested in seeing ‘safe’ versions, so that they can be used as drop-in replacements for all the existing routines that the OS uses. |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
I now have Ben’s code converted to objasm format and running under RISC OS (although I’m yet to tweak the unaligned load/store handling). Unsurprisingly, both Ben’s and Adrian’s code trounces the CLib memcpy/etc. implementations in quite a few situations. Also, I’m particularly impressed that this little trick works on RISC OS without things exploding: https://github.com/bavison/arm-mem/blob/master/memcmp.S#L214 (or at least nothing’s exploded yet… it’s entirely possible there’ll be some code somewhere which messes with the PSR and doesn’t restore the data endianness flag correctly) Next step will be to put together my own testbed and get some initial figures for sharing with everyone here. |
||||||||||||||||||||||||
Colin (478) 2433 posts |
Does copying out of or into uncached memory require different optimisations? |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Yes. E.g. when copying out of uncached memory there’s no point using PLD (might save a cycle or two), and you’d want to keep the number of loads to a minimum (e.g. if the last word contains 3 bytes, you’d probably want to LDR it rather than issue 3 LDRB’s). Aligning the loads to the bus burst size will also help. For writes to uncached memory the same basic rules apply, although if the memory is bufferable (and you’re writing in increasing address order) then the write buffer will help a lot and you may find the code for writing to cacheable memory is already good enough. That’s one of the things my testbed will be aiming to help with – try each routine on all the different memory types so we can identify which areas might be in need of improvement. |
||||||||||||||||||||||||
Alan Robertson (52) 420 posts |
Hi guys It sounds great that we would have targeted routines for different scenarios. I’m just curious to know what sort of benefit we’re likely to see. As ever, great work guys. |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
I’m expecting to see improvements anywhere where large amounts of data are being moved in/out of buffers by the OS. Filesystem, USB, networking, some graphics/sprite ops, etc. Whether this will result in noticeable improvements is yet to be seen, but I’m hoping to see at least some kind of improvement to anything USB related simply because of the fact that everything has to be transferred between DeviceFS buffers and the uncacheable IO buffers used by the hardware. Plus there might be another buffer which is feeding the DeviceFS buffer (e.g. filesystem data might pass through internal buffers held by FileCore/FileSwitch, network data will pass through buffers held by MbufManager, etc.) |
||||||||||||||||||||||||
Colin (478) 2433 posts |
Better still bypass DeviceFS. |
||||||||||||||||||||||||
Chris Evans (457) 1614 posts |
Probably of no relevance to a better memcpy() but do any of the systems running RISC OS have a ‘Blitter’ or could DMA be used in some way? |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
The major problem with relying on a GPU for blitting/memory transfers is that we generally don’t have any documentation for them. At the moment the GraphicsV acceleration that’s done on OMAP and Pi is done entirely using the systems general-purpose DMA controller. Having said that, using the system DMA controllers is a viable approach for some memory operations. In particular for the Iyonix has its AAU (application accelerator unit), which is capable of giving much better performance than the CPU. |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Brief progress update to say that I now have Ben’s and Adrian’s code (and the CLib for reference) running from within my own testbed app. There’s some more work I need to do on the memcpy performance tests (mainly to stop them taking three and a half hours to run!), but once that’s done I should be able to get some initial performance stats and upload the code somewhere for other people to have a play around with. |
||||||||||||||||||||||||
Rick Murray (539) 13840 posts |
http://homepage.ntlworld.com/rik.griffin/appacc.html ← I thought he had a real domain? All Google could see was ntlworld. I suppose a good implementation would use all of the features available on a given platform, but I can imagine the code would start to become hellish. To look at it from another perspective… what does the Linux (netbsd, etc) |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Not that much, it would seem. The linux kernel doesn’t appear to have much – apart from Ben’s Pi-optimised code it looks like there’s just one generic ARM implementation which is used for all ARM targets (for memcpy, at least – haven’t bothered checking everything). glibc doesn’t seem much better either (although I can’t quite see where some of their macros are coming from). NetBSD seems to be better, with about 3 different optimisations of each routine – and considering the licensing it’s probably worth investigating reusing some of their code (if I can stand converting more code from gas syntax to objasm!). Taking just the NEON code might be enough, as that would give us coverage for everything except IOMD (and it’s doubtful that any BSD maintainers care enough about IOMD to have tried to make sure their code is optimal there). |
||||||||||||||||||||||||
Alan Robertson (52) 420 posts |
@Jeffrey Lee |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Not other offers yet, I’m afraid. Although they may be waiting for me to actually release the testbed app. |
||||||||||||||||||||||||
Fred Graute (114) 645 posts |
Please keep in mind that there are a number of memcpy implementations already in various applications; ZapRedraw, StrongED, modules providing sliding heaps. I doubt any of these have been optimised for any particular architecture but they might be useful as starting points and to compare results.
That too, the StrongED routines were written by Guttorm Vik ages ago and may not be the most optimal for current ARM CPUs, it be interesting to see how well they do. |
||||||||||||||||||||||||
Kuemmel (439) 384 posts |
@EDIT: Ups, posted that one in the wrong thread (“Wait for one bus…”)…can somebody delete it ? Now here again in the right one… …I wonder what would be the upper theoretical limit of throughput of a memcpy loop ? When I wrote a small basic assembler test (256 Bytes in one loop (ARM or NEON), source and destination aligned to a 256 Byte boundary), it peaked while being within the 1st level cache at around 4.8 GByte/s (ARM) and 4.5 GByte/s (NEON) with my Panda (1500 MHz) and at 5.6 GByte/s (ARM) and 7.4 GByte/s (NEON) on the IGEPv5 (1500 MHz). While googleing around I didn’t find much data for an OMAP5432, but some guys tested an Exynos 5420 here. The Exynos seems quite fast at 12.5 GByte/s. Though when one looks at the Wikipedia it makes totally sense. It has a much faster memory interface, 14.9 GByte/s is listed (32-bit dual channel 933 MHz). So checking for Panda we got ‘only’ 400 MHz and for an IGEP we have 533 MHz of dual channcel DDR. This would translate into 6.4 Gbyte/s and 8.5 GByte/s as far as I get it. |
||||||||||||||||||||||||
Chris Hall (132) 3554 posts |
The upper limit of a memory loop (other than from the faster cache) is as follows:
But it is not as simple as that. In earler designs there was a hardware MMU which interposed between processor and memory to do the logical to physical memory address translation. Application space was always from &8000 upwards but was mapped at different physical positions to keep each one separate and to avoid lots of memory copying as each task was paged in. Not sure how this is done in a HAL model – it seems to be more or less efficient on different platforms, making a large difference to the perceived speed of memory access… |
||||||||||||||||||||||||
Rick Murray (539) 13840 posts |
No Pi for comparison? ;-)
There still is, that’s a feature of the ARM.
It still does, that’s a feature of RISC OS.
Messing around with the paging was still “expensive”, that’s partly what lazy task swapping was about. |
Pages: 1 2