Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → General →

Findings on Memory Speed on the BB

10 posts, 3 voices

Mar 20, 2011 2:42pm Kuemmel (439) 384 posts	Hi there, while playing around with ARM NEON on the BB I thought it might be of a benefit to find out if that part of the CPU also could improve memory transfer routines, as it offers VLDM/VSTM instructions using up to 16 double word registers (Dx). One could also use quad word registers, but according to the TRM’s of the Cortex-A8 that would be assembled also to Dx anyway. So I wrote a benchmark for transferring different amounts of memory blocks for the following source and destinations: VRAM → VRAM VRAM → RAM RAM → VRAM RAM → RAM As I’ve already seen in my Firebench compared to the StrongARM the BB is much faster regarding memory transfer. So I wanted to see what figures can be reached for the transfer rate using old school STM/LDM and compared that to the ARM NEON commands. Therefore I wrote !MemSpeed" to measure the transfer speed in [Mbyte/Second]. More details about the code in the ReadMe file. I compiled my findings in the two following graphs: Please note that in the first graph there’s a primary axis for the BB and a secondary axis for the StrongARM, as the StrongARM is (depending on the size of memory block moved about 10 to 32 times (!) slower than the BB, when it comes to RAM → VRAM or RAM → RAM transfer. On the other side one can see that basically all transfers from VRAM → RAM/VRAM are more or less equally slow. But in the real world those transfers are not used very much anyway. Regarding the effect of the NEON unit, it can be seen that it can be beneficial. RAM → VRAM transfer can be up to 40% faster and also RAM → RAM can be up to 20% faster. Even if I couldn’t reproduce the exact the findings of similar tests on the BB from here Link, I think it’s really worth a try for all memory copy intense applications, also may be stuff like a Web Browser, to experiment with NEON routines. One more hint is to use the PLD (PreLoadData) instruction. I didn’t find it beneficial in all circumstances, but from the former link and also partly my experiments it looks like there’s even an additional benefit for that. Although the parameters for PLD look like kind of “try and error” for me. As usual any comments welcome…

Mar 20, 2011 8:20pm W P Blatchley (147) 247 posts	Very interesting! Thanks for this research. The first thing that springs to my mind is screen block copies. Could they benefit from this?

Mar 20, 2011 9:22pm Kuemmel (439) 384 posts	…I think so. Just as far as I know, the support for NEON isn’t there yet totally on Risc OS/BB, also not totally for the GCC (though Terje did some work to be able to include !ExtASM files into GCC), I suppose and what I would think most people use for software developement. But I hope it’s encouraging people working on NEON support, because there really seems a lot of benefit, of course not mainly regarding memory transfer :-)

Mar 21, 2011 12:11am Jeffrey Lee (213) 6048 posts	Interesting how the NEON VRAM → VRAM copies are so much faster. Maybe it’s because each transfer is bigger, so there’s less of an impact from setting up the SDRAM accesses. The first thing that springs to my mind is screen block copies. Could they benefit from this? Screen block copies (assuming they go via GraphicsV) already use the OMAP’s DMA controller. According to some old notes of mine that gives them a transfer speed of about 120MB/s… apart from left-to-right copies which can be as slow as 3.5MB/s! Plus the smallest transfer unit the DMA controller supports is 1 byte, so 1/2/4bpp transfers will often fall back to a software copy routine (which would get around 40MB/s). So there are a few things that could do with happening: Fix the left-to-right copies. There are a couple of ways of doing this, but the easiest would obviously be to make the OMAPVideo module refuse to handle them so that the kernel resorts to doing a software copy instead. Modify the kernels software copy/fill routines to use NEON. Add RAM-to-RAM DMA transfer support to DMAManager, so that programs can opt to use that for transferring large amounts of data Do more performance tests to work out how big a transfer has to be for it to be worth setting up a DMA transfer. Would also have to take into account whether any of the pages are cached, because cache maintenance could add quite an overhead. Experiment with making screen memory cached. Any volunteers? ;)

Mar 22, 2011 5:54pm Kuemmel (439) 384 posts	I would like to help, just I don’t have a clue on OS programming…but if someone gives me a piece of assembler to convert/optimize it with NEON, I’m glad to give it a try…are the kernel software in C or ASM ? About the DMA for screen block copy, how can this be used, is there any software example somewhere ?

Mar 22, 2011 11:57pm Jeffrey Lee (213) 6048 posts	are the kernel software in C or ASM ? The kernel is assembler. About the DMA for screen block copy, how can this be used, is there any software example somewhere ? There’s some basic documentation for GraphicsV here. You’re interested in reason code 13 (render). Here’s a quick example: DIM block% 24 block%!0 = src_left% block%!4 = src_bottom% block%!8 = dest_left% block%!12 = dest_bottom% block%!16 = width%-1 block%!20 = height%-1 SYS "OS_CallAVector",1,1,block%,,13,,,,,&2A Note that if hardware acceleration isn’t available, nothing will happen; it’s down to the caller to implement his own fallback routine. For kernel code which makes use of GraphicsV 13, take a look at: The RectangleFill routine in Kernel.s.vdu.grafa (although that obviously performs a fill, not a copy!) BlockCopyMove in Kernel.s.vdu.vdugrafd FastCLS in s.vdu.vduwrch TryCopySceeenUp in s.vdu.vduwrch RectangleFill and BlockCopyMove are probably the only two worth looking at – unless you’re redirecting to sprite it would be highly unlikely for FastCLS or TryCopyScreenUp to fail to use DMA. Good luck! :)

Mar 23, 2011 12:02am Jeffrey Lee (213) 6048 posts	Also note that when calling GraphicsV, 0,0 is the bottom-left of the screen, the units are in pixels (not OS units), and it’s expected that the rectangle(s) have been properly clipped to the screen size.

Mar 27, 2011 9:40pm Kuemmel (439) 384 posts	Thanks for all the insight ! I finally looked a bit into the ‘vdugrafd’ and ‘grafa’. Regarding BlockCopyMove I found it strange that it seems to use either a 7 word copy or a 1 word copy routine as a workhorse. I always thought an 4 or 8 word thing is faster. Also there’s a command called like “ShiftR R6,R7,R3,R4”…did I miss something, or what is that kind of instruction ? Anyway if (you ;-) (?)) want to give it a try it should be straight forward to change the 7 word copy routines to a 8 word VLDMIA Rx!,{D0-D3}/STMIA Rx!,{D0-D3} and adjusting the counter from 7 to 8….but I guess it’s still a problem finding an Assembler/Compiler to assemble the kernel with NEON as there’s only !ExtASM doing it !? Of course I see not so much of a point if GraphicsV is used anyway. The question for me is for example if any web browser or graphics software uses any of these routines or they copy lots of memory ‘by hand’ ähh ‘arm’ ;-) with their own code…then the benefit could be only made by adjusting the application code itself for NEON or GraphicsV.

Mar 27, 2011 10:23pm Jeffrey Lee (213) 6048 posts	Regarding BlockCopyMove I found it strange that it seems to use either a 7 word copy or a 1 word copy routine as a workhorse. I always thought an 4 or 8 word thing is faster. Yeah, that does seem a bit odd. Maybe they just never put much effort into optimising it. Also there’s a command called like “ShiftR R6,R7,R3,R4”…did I miss something, or what is that kind of instruction ? That’s a macro – see Kernel.s.vdu.vdudecl for the definition. Anyway if (you ;-) (?)) want to give it a try it should be straight forward to change the 7 word copy routines to a 8 word VLDMIA Rx!,{D0-D3}/STMIA Rx!,{D0-D3} and adjusting the counter from 7 to 8….but I guess it’s still a problem finding an Assembler/Compiler to assemble the kernel with NEON as there’s only !ExtASM doing it !? I think the best way to test out any changes would be to create a testbed program containing a copy of the kernel code and a copy of the new NEON-enhanced code. That way the interested people (i.e. you ;-)) can easily debug and optimise the code without worrying about objasm’s lack of NEON support, or having to build new ROM images all the time. Of course I see not so much of a point if GraphicsV is used anyway. The question for me is for example if any web browser or graphics software uses any of these routines or they copy lots of memory ‘by hand’ ähh ‘arm’ ;-) with their own code…then the benefit could be only made by adjusting the application code itself for NEON or GraphicsV. I suspect that most desktop software (or at least all the well-written stuff) simply uses Wimp_BlockCopy when copying/moving/scrolling/etc. Since Wimp_BlockCopy relies on OS_Plot 190 to do all the copying/moving, this means it should end up using GraphicsV whenever possible.

Mar 28, 2011 8:42pm Kuemmel (439) 384 posts	Hm, if most software uses Wimp_BlockCopy it’s a little disencouraging…as it seems hardware accelerated anyway by GraphicsV. Just, I thought the RAM to VRAM transfer is also used a lot by desktop software like browsers and image processing / drawing software. And that would benefit quite well from NEON. Or is there also some hardware accelerated routine for this kind of memory transfer ? In other words, is there a DMA copy routine from RAM to VRAM ?

Reply

To post replies, please first log in.

Forums → General →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

General discussions.

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails