RISC OS Open: Forum: Building a better memcpy()

Oct 30, 2014 1:41am

FWIW, there are XScale-optimised implementations of memcpy/memmove (and memset, strcpy and strlen). They were written for Aemulor’s internal use (it cannot use SharedCLibrary, fairly obviously) and some/all were later incorporated into UnixLib. There’s also a test rig for checking correctness and measuring performance. I can make all this available if it’s helpful.

These routines are likely inappropriate for ARM’s CPU implementations, being built for the long pipeline of the XScale, its LDM/STM decoding overhead and its high memory latency.

I think addresses of the appropriate routines should be returned via an OS SWI call. Obviously the SharedCLibrary can remember these addresses when it starts up and avoid extra indirection by directly patching up the stubs in calling applications. The OS should probably get some/all of these addresses from the HAL, since I’m reminded that one issue in the boot time of RISC OS is the cost of zero-initialising the memory.

Overlapping – does make a difference to DMA implementations, more so than software implementations, because the DMA engine will usually have a larger internal buffer and may cope only with transfers to/from ascending addresses.

Also consider reentrancy. Interrupt handlers should not be performing large transfers, but callback handlers may; I don’t know the internal details of ShareFS but that must surely be running in callbacks, and of course TaskWindow could preempt an operation that is running in USR mode.

A comment regarding the use of hardware acceleration; it would be good to have available some fast ‘safe’ routines that can be relied upon not to optionally exploit accelerators; not just ones that automatically employ the fastest method. By ‘safe’ I mean ones that use only the stack as temporary storage, software only, and will not potentially run into issues with cacheing/interrupts/reentrancy etc. memcpy/memmove/memset etc cannot generally know the conditions under which they are being called.

As an aside, I did have a prototype RAMFS implementation that used the XScale DMA hardware to perform memory-memory transfers and it was radically faster. Geminus uses that hardware to render large sprites to video memory, and to transfer data between graphics cards.

Regards testing, I have all targets…well, except IGEPv5 :)

Oct 30, 2014 2:29pm

Jeffrey Lee (213) 6048 posts

FWIW, there are XScale-optimised implementations of memcpy/memmove (and memset, strcpy and strlen). They were written for Aemulor’s internal use (it cannot use SharedCLibrary, fairly obviously) and some/all were later incorporated into UnixLib. There’s also a test rig for checking correctness and measuring performance. I can make all this available if it’s helpful.

Yes please!

The OS should probably get some/all of these addresses from the HAL, since I’m reminded that one issue in the boot time of RISC OS is the cost of zero-initialising the memory.

The RAM clear is generally either done by the HAL using DMA, or by the kernel using CPU (e.g. if the HAL has no DMA or the DMA isn’t very fast). So the HAL doesn’t need a fast memset for clearing memory, as it can just let the kernel do it. However it might be useful if the HAL had a fast memcpy for dealing with relocating the ROM image, so having the code available to the HAL is probably still a good idea.

I’m planning on having the routines be built into a library, mainly to allow them to be built into both the ROM and testbed. That could also be a convenient way of allowing the HAL to get hold of any routines it desires – so by default they’ll all be in the kernel, but if a HAL wants one of them then it can also link to the lib and use any of the HAL-safe routines that it desires.

A comment regarding the use of hardware acceleration; it would be good to have available some fast ‘safe’ routines that can be relied upon not to optionally exploit accelerators; not just ones that automatically employ the fastest method. By ‘safe’ I mean ones that use only the stack as temporary storage, software only, and will not potentially run into issues with cacheing/interrupts/reentrancy etc. memcpy/memmove/memset etc cannot generally know the conditions under which they are being called.

Yes, for the moment I’m mainly interested in seeing ‘safe’ versions, so that they can be used as drop-in replacements for all the existing routines that the OS uses.

Nov 1, 2014 11:03pm

Jeffrey Lee (213) 6048 posts

I now have Ben’s code converted to objasm format and running under RISC OS (although I’m yet to tweak the unaligned load/store handling). Unsurprisingly, both Ben’s and Adrian’s code trounces the CLib memcpy/etc. implementations in quite a few situations. Also, I’m particularly impressed that this little trick works on RISC OS without things exploding:

https://github.com/bavison/arm-mem/blob/master/memcmp.S#L214

(or at least nothing’s exploded yet… it’s entirely possible there’ll be some code somewhere which messes with the PSR and doesn’t restore the data endianness flag correctly)

Next step will be to put together my own testbed and get some initial figures for sharing with everyone here.

Nov 2, 2014 8:52am

Colin (478) 2433 posts

Does copying out of or into uncached memory require different optimisations?

Nov 2, 2014 12:49pm

Jeffrey Lee (213) 6048 posts

Yes. E.g. when copying out of uncached memory there’s no point using PLD (might save a cycle or two), and you’d want to keep the number of loads to a minimum (e.g. if the last word contains 3 bytes, you’d probably want to LDR it rather than issue 3 LDRB’s). Aligning the loads to the bus burst size will also help. For writes to uncached memory the same basic rules apply, although if the memory is bufferable (and you’re writing in increasing address order) then the write buffer will help a lot and you may find the code for writing to cacheable memory is already good enough.

That’s one of the things my testbed will be aiming to help with – try each routine on all the different memory types so we can identify which areas might be in need of improvement.

Nov 2, 2014 1:05pm

Alan Robertson (52) 420 posts

Hi guys
I’m not a low level programmer unfortunately, so I was wondering if you could briefly explain where we might see the benefit of these optimised memory routines and if there would be any real-world noticeable impact to end users?

It sounds great that we would have targeted routines for different scenarios. I’m just curious to know what sort of benefit we’re likely to see.

As ever, great work guys.

Nov 2, 2014 1:45pm

Jeffrey Lee (213) 6048 posts

I’m expecting to see improvements anywhere where large amounts of data are being moved in/out of buffers by the OS. Filesystem, USB, networking, some graphics/sprite ops, etc. Whether this will result in noticeable improvements is yet to be seen, but I’m hoping to see at least some kind of improvement to anything USB related simply because of the fact that everything has to be transferred between DeviceFS buffers and the uncacheable IO buffers used by the hardware. Plus there might be another buffer which is feeding the DeviceFS buffer (e.g. filesystem data might pass through internal buffers held by FileCore/FileSwitch, network data will pass through buffers held by MbufManager, etc.)

Nov 2, 2014 2:22pm

Colin (478) 2433 posts

but I’m hoping to see at least some kind of improvement to anything USB related

Better still bypass DeviceFS.

Nov 11, 2014 9:32pm

Chris Evans (457) 1614 posts

Probably of no relevance to a better memcpy() but do any of the systems running RISC OS have a ‘Blitter’ or could DMA be used in some way?
The brief research I’ve done says that GPUs have taken over from Blitters to offload memory transfers from the CPU, I think the Iyonix is the only system that doesn’t use shared memory which presumably means it couldn’t benefit from using a GPU.

Nov 12, 2014 2:00am

Jeffrey Lee (213) 6048 posts

The major problem with relying on a GPU for blitting/memory transfers is that we generally don’t have any documentation for them. At the moment the GraphicsV acceleration that’s done on OMAP and Pi is done entirely using the systems general-purpose DMA controller.

Having said that, using the system DMA controllers is a viable approach for some memory operations. In particular for the Iyonix has its AAU (application accelerator unit), which is capable of giving much better performance than the CPU.

Nov 24, 2014 1:42pm

Jeffrey Lee (213) 6048 posts

Brief progress update to say that I now have Ben’s and Adrian’s code (and the CLib for reference) running from within my own testbed app. There’s some more work I need to do on the memcpy performance tests (mainly to stop them taking three and a half hours to run!), but once that’s done I should be able to get some initial performance stats and upload the code somewhere for other people to have a play around with.

Nov 24, 2014 9:38pm

Rick Murray (539) 13840 posts

In particular for the Iyonix has its AAU (application accelerator unit), which is capable of giving much better performance than the CPU.

http://homepage.ntlworld.com/rik.griffin/appacc.html ← I thought he had a real domain? All Google could see was ntlworld.

I suppose a good implementation would use all of the features available on a given platform, but I can imagine the code would start to become hellish.

To look at it from another perspective… what does the Linux (netbsd, etc) memcpy() code do? Does it perform nifty optimisations, or not?

Nov 25, 2014 2:09pm

Jeffrey Lee (213) 6048 posts

To look at it from another perspective… what does the Linux (netbsd, etc) memcpy() code do? Does it perform nifty optimisations, or not?

Not that much, it would seem. The linux kernel doesn’t appear to have much – apart from Ben’s Pi-optimised code it looks like there’s just one generic ARM implementation which is used for all ARM targets (for memcpy, at least – haven’t bothered checking everything). glibc doesn’t seem much better either (although I can’t quite see where some of their macros are coming from).

NetBSD seems to be better, with about 3 different optimisations of each routine – and considering the licensing it’s probably worth investigating reusing some of their code (if I can stand converting more code from gas syntax to objasm!). Taking just the NEON code might be enough, as that would give us coverage for everything except IOMD (and it’s doubtful that any BSD maintainers care enough about IOMD to have tried to make sure their code is optimal there).

Nov 27, 2014 10:14am

Alan Robertson (52) 420 posts

@Jeffrey Lee
It’s great news that you’ve got your hands on Adrian’s and Ben’s code, but I’d also like to know if your original request for others to write some routines has come to fruition?

Nov 27, 2014 1:21pm

Jeffrey Lee (213) 6048 posts

Not other offers yet, I’m afraid. Although they may be waiting for me to actually release the testbed app.

Nov 27, 2014 3:14pm

Fred Graute (114) 645 posts

Not other offers yet, I’m afraid.

Please keep in mind that there are a number of memcpy implementations already in various applications; ZapRedraw, StrongED, modules providing sliding heaps.

I doubt any of these have been optimised for any particular architecture but they might be useful as starting points and to compare results.

Although they may be waiting for me to actually release the testbed app.

That too, the StrongED routines were written by Guttorm Vik ages ago and may not be the most optimal for current ARM CPUs, it be interesting to see how well they do.

Dec 8, 2014 3:54pm

Kuemmel (439) 384 posts

@EDIT: Ups, posted that one in the wrong thread (“Wait for one bus…”)…can somebody delete it ? Now here again in the right one…

…I wonder what would be the upper theoretical limit of throughput of a memcpy loop ?

When I wrote a small basic assembler test (256 Bytes in one loop (ARM or NEON), source and destination aligned to a 256 Byte boundary), it peaked while being within the 1st level cache at around 4.8 GByte/s (ARM) and 4.5 GByte/s (NEON) with my Panda (1500 MHz) and at 5.6 GByte/s (ARM) and 7.4 GByte/s (NEON) on the IGEPv5 (1500 MHz).

While googleing around I didn’t find much data for an OMAP5432, but some guys tested an Exynos 5420 here. The Exynos seems quite fast at 12.5 GByte/s. Though when one looks at the Wikipedia it makes totally sense. It has a much faster memory interface, 14.9 GByte/s is listed (32-bit dual channel 933 MHz).

So checking for Panda we got ‘only’ 400 MHz and for an IGEP we have 533 MHz of dual channcel DDR. This would translate into 6.4 Gbyte/s and 8.5 GByte/s as far as I get it.

Dec 8, 2014 8:35pm

Chris Hall (132) 3554 posts

The upper limit of a memory loop (other than from the faster cache) is as follows:

Computer	Memory bus speed	Type/amount of memory	Max speed Mbyte/s
RiscPC	16MHz		64
Iyonix	200MHz	512M 64bit DDR	3200
BeagleXM	166MHz	512M 32bit DDR	1333
PandaES	400MHz	1G 32bit LPDDR2	6400
OMAP5	532MHz	4G LPDDR3	8500

But it is not as simple as that. In earler designs there was a hardware MMU which interposed between processor and memory to do the logical to physical memory address translation. Application space was always from &8000 upwards but was mapped at different physical positions to keep each one separate and to avoid lots of memory copying as each task was paged in. Not sure how this is done in a HAL model – it seems to be more or less efficient on different platforms, making a large difference to the perceived speed of memory access…

Dec 9, 2014 7:24am

Rick Murray (539) 13840 posts

The upper limit of a memory loop (other than from the faster cache) is as follows:

No Pi for comparison? ;-)

In earler designs there was a hardware MMU which interposed between processor and memory to do the logical to physical memory address translation.

There still is, that’s a feature of the ARM.

Application space was always from &8000 upwards but was mapped at different physical positions to keep each one separate

It still does, that’s a feature of RISC OS.

and to avoid lots of memory copying as each task was paged in.

Messing around with the paging was still “expensive”, that’s partly what lazy task swapping was about.

Building a better memcpy()

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options