Fun with NEON
Jeffrey Lee (213) 6048 posts |
Here are a few NEON routines I’ve been playing around with today. I haven’t tested any of them yet (that’s where the fun part is ;-)), but I figured they’d be of interest to the wider community. I wrote them with the view of using them in arcem, but it wouldn’t surprise me if some of them (particularly the sound code) could be reused in RISC OS. 8bpp screen blittingCode to convert pixels from 8bpp palettised to 16bpp, using the VIDC1/2 8bpp pixel format. The naive way of handling this would be to just use a 256-entry lookup table to directly convert each source pixel to a 16 bit colour. But by using the VTBL instruction to index two 16-entry lookup tables held in registers we can come up with a smaller and faster solution that won’t strain the cache/memory bus: VLDMIA in!,{D4-D5} VSHR.U8 Q0,Q2,#4 VAND Q2,Q2,#&0F.... ; get pal indices. Note '...' is used to imply a repeated pattern, i.e. VAND with &0F in each byte of the immediate constant VAND Q0,Q0,#&0F.... ; get rgb bits VTBL.8 D6,{D24-D25},D4 ; high palette VTBL.8 D7,{D24-D25},D5 VTBL.8 D2,{D26-D27},D0 ; high rgb bits VTBL.8 D3,{D26-D27},D1 VTBL.8 D4,{D28-D29},D4 ; low palette VTBL.8 D5,{D28-D29},D5 VTBL.8 D0,{D30-D31},D0 ; low rgb bits VTBL.8 D1,{D30-D31},D1 VZIP.8 Q2,Q3 ; Q2 = pixels 0-7, Q3 = 8-15 VZIP.8 Q0,Q1 VORR Q0,Q0,Q2 VORR Q1,Q1,Q3 VSTIMA out!,{D0-D3} That’s 16 pixels converted in 17 instructions, with plenty of registers spare to allow several instances of the code to be interleaved (to hide any pipeline stalls). Determining the exact format of the lookup tables (D24-D31) is left as an exercise to the reader. Unfortunately this approach is only useful for VIDC1/2 8bpp modes; for fully palettised ones like VIDC20 uses, code would still have to use a palette/lookup table held in memory :-( 4bpp screen blittingA basic variation of the above code; 16 input bytes are converted to 32 output pixels (64 bytes @ 16bpp). This one is also 17 instructions long. VLDMIA in!,{D0-D1} VSHR.U8 Q1,Q0,#4 VAND Q0,Q0,#&0F.... ; even pixels VAND Q1,Q1,#&0F.... ; odd pixels VTBL.8 D4,{D28-D29},D0 ; high even VTBL.8 D5,{D28-D29},D1 VTBL.8 D6,{D28-D29},D2 ; high odd VTBL.8 D7,{D28-D29},D3 VTBL.8 D0,{D30-D31},D0 ; low even VTBL.8 D1,{D30-D31},D1 VTBL.8 D2,{D30-D31},D2 ; low odd VTBL.8 D3,{D30-D31},D3 VZIP.8 Q0,Q2 ; Q0 = 0,2,4,6,8,10,12,14, Q2 = 16,18,20,22,24,26,28,30 VZIP.8 Q1,Q3 ; Q1 = 1,3,5,7,9,11,13,15, Q3 = 17,19,21,23,25,27,29,31 VZIP.16 Q0,Q1 ; Q0 = 0-7, Q1 = 8-15 VZIP.16 Q2,Q3 ; Q2 = 16-23, Q3 = 24-31 VSTIMA out!,{D0-D7} 2bpp screen blittingAlthough you could load and convert 16 bytes of data at once, you’d end up needing 16 doubleword registers to store the results. This would cause difficulty with interleaving multiple instances of the code, since some registers are also needed to store the lookup tables. So instead, this routine loads 8 bytes at once: VLDMIA in!,{D0} VSHR.U8 D2,D0,#4 VSHR.U8 D4,D0,#2 VSHR.U8 D6,D0,#6 VAND D0,D0,#&03... VAND D2,D2,#&03... VAND D4,D4,#&03... VAND D6,D6,#&03... VTBL.8 D1,{D30},D0 ; high 00 VTBL.8 D3,{D30},D2 ; high 10 VTBL.8 D5,{D30},D4 ; high 01 VTBL.8 D7,{D30},D6 ; high 11 VTBL.8 D0,{D31},D0 ; low 00 VTBL.8 D2,{D31},D2 ; low 10 VTBL.8 D4,{D31},D4 ; low 01 VTBL.8 D6,{D31},D6 ; low 11 VZIP.8 D0,D1 ; Q0 = 0,4,8,12,16,20,24,28 VZIP.8 D2,D3 ; Q1 = 2,6,10,14,18,22,26,30 VZIP.8 D4,D5 ; Q2 = 1,5,9,13,17,21,25,29 VZIP.8 D6,D7 ; Q3 = 3,7,11,15,19,23,27,31 VZIP.16 Q0,Q2 ; Q0 = 0,1,4,5,8,9,12,13, Q2 = 16,17,20,21,24,25,28,29 VZIP.16 Q1,Q3 ; Q1 = 2,3,6,7,10,11,14,15, Q3 = 18,19,22,23,26,27,30,31 VZIP.32 Q0,Q1 ; Q0 = 0-7, Q1 = 8-15 VZIP.32 Q2,Q3 ; Q2 = 16-23, Q3 = 24-31 VSTMIA out!,{D0-D7} That’s 25 instructions per 8 input bytes (32 pixels) 1bpp screen blittingDespite this ones size, it looks like it’s still a lot quicker than an ARM equivalent. Since it loads 8 bytes at once it needs 16 doublewords to store the output; so you might have some trouble interleaving it. VLDMIA in!,{D0} VSHR.U8 D1,D0,#1 VSHR.U8 Q1,Q0,#2 VSHR.U8 Q2,Q0,#4 VSHR.U8 Q3,Q0,#6 VAND Q0,Q0,#&01... VAND Q1,Q1,#&01... VAND Q2,Q2,#&01... VAND Q3,Q3,#&01... VTBL.8 D8,{D30},D0 ; high bytes 000 VTBL.8 D9,{D30},D1 ; 001 VTBL.8 D10,{D30},D2 ; 010 VTBL.8 D11,{D30},D3 ; 011 VTBL.8 D12,{D30},D4 ; 100 VTBL.8 D13,{D30},D5 ; 101 VTBL.8 D14,{D30},D6 ; 110 VTBL.8 D15,{D30},D7 ; 111 VTBL.8 D0,{D31},D0 ; low bytes VTBL.8 D1,{D31},D1 VTBL.8 D2,{D31},D2 VTBL.8 D3,{D31},D3 VTBL.8 D4,{D31},D4 VTBL.8 D5,{D31},D5 VTBL.8 D6,{D31},D6 VTBL.8 D7,{D31},D7 VZIP.8 Q0,Q4 ; Q0 = 0,8,16,24,32,40,48,56, Q4 = 1,9,17,25,33,41,49,57 VZIP.8 Q1,Q5 ; Q1 = 2,10,18,26,34,42,50,58, Q5 = 3,11,19,27,35,43,51,59 VZIP.8 Q2,Q6 ; Q2 = 4,12,20,28,36,44,52,60, Q6 = 5,13,21,29,37,45,53,61 VZIP.8 Q3,Q7 ; Q3 = 6,14,22,30,38,46,54,62, Q7 = 7,15,23,31,39,47,55,63 VZIP.16 Q0,Q4 ; Q0 = 0,1,8,9,16,17,24,25, Q4 = 32,33,40,41,48,49,56,57 VZIP.16 Q1,Q5 ; Q1 = 2,3,10,11,18,19,26,27, Q5 = 34,35,42,43,50,51,58,59 VZIP.16 Q2,Q6 ; Q2 = 4,5,12,13,20,21,28,29, Q6 = 36,37,44,45,52,53,60,61 VZIP.16 Q3,Q7 ; Q3 = 6,7,14,15,22,23,30,31, Q7 = 38,39,46,47,54,55,62,63 VZIP.32 Q0,Q1 ; Q0 = 0-3,8-11, Q1 = 16-19,24-27 VZIP.32 Q4,Q5 ; Q4 = 32-35,40-43, Q5 = 48-51,56-59 VZIP.32 Q2,Q3 ; Q2 = 4-7,12-15, Q3 = 20-23,28-31 VZIP.32 Q6,Q7 ; Q6 = 36-39,44-47, Q7 = 52-55,60-63 VSWP D1,D4 ; Q0 = 0-7, Q2 = 8-15 VSWP D3,D6 ; Q1 = 16-23, Q3 = 24-31 VSWP D9,D12 ; Q4 = 32-29, Q6 = 40-47 VSWP D11,D14 ; Q5 = 48-55, Q7 = 56-63 VSWP Q1,Q2 VSWP Q5,Q6 VSTMIA out!,{D0-D15} Thats 8 input bytes (64 pixels) in 44 instructions. VIDC1/2 8-bit audioSome snippets of code that can (hopefully) be used to convert the contents of the 8-bit sound buffer into a standard 16-bit stereo buffer. log-to-lin conversion is done using arithmetic rather than lookup tables. However for simplicty, a lookup table/vector is used to scale the resulting values by the stereo offset and channel count scale factor (and any other output scale factor you desire). This code assumes that:
Here’s the main chunk of the code, to load 16 bytes and convert them to 16 pairs of scaled 16bit samples. Note the conditional at the start to select between VIDC1 & VIDC2 format data. [ VIDC1 VLDMIA in!,{D0-D1} ; 16 bytes | ; VIDC2 VLDMIA in!,{D2-D3} ; 16 bytes VSHR.8 Q0,Q1,#1 VSLI.8 Q0,Q1,#7 ] VMOVL.S8 Q1,D1 ; 8->16 VMOVL.S8 Q0,D0 VSHR.U16 Q3,Q1,#4 VAND Q3,Q3,#&0007... ; get chords VSHR.U16 Q2,Q0,#4 VAND Q2,Q2,#&0007 VSHR.S16 Q5,Q5,#7 ; get sign VSHR.S16 Q4,Q4,#7 VAND Q0,Q0,#&000F... ; get points VAND Q1,Q1,#&000F... VSHL.16 Q2,Q15,Q2 ; 16<<chord VSHL.16 Q3,Q15,Q3 VSUB.I16 Q6,Q2,Q15 ; chord base(*16) VSUB.I16 Q7,Q3,Q15 VSHL.16 Q6,Q6,#4 ; base*16(*16) VSHL.16 Q7,Q7,#4 VMLA.I16 Q6,Q0,Q2 ; base+point*step VMLA.I16 Q7,Q1,Q3 VNEG.S16 Q2,Q6 ; negate ready for sign application VNEG.S16 Q3,Q7 VBSL Q4,Q2,Q6 ; select pos/neg version VBSL Q5,Q3,Q7 ; results are 16-bit signed, full range ; apply scaling: VQDMULH.S16 Q0,Q4,Q13 ; left 0-7 VQDMULH.S16 Q1,Q4,Q14 ; right 0-7 VQDMULH.S16 Q2,Q5,Q13 ; left 8-15 VQDMULH.S16 Q3,Q5,Q14 ; right 8-15 That’s 27 instructions for the VIDC1 version and 29 for the VIDC2 version. Now we need to mix together the channels to produce the final stereo data. For 1 channel input: VZIP.16 Q0,Q1 ; 0-7 VZIP.16 Q2,Q3 ; 8-15 VSTMIA out!,{D0-D7} 2 channel: VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3 VPADD.I16 D2,D2,D3 ; r0-7 -> r0-3 VPADD.I16 D1,D4,D5 ; l8-15 -> l4-7 VPADD.I16 D3,D6,D7 ; r8-15 -> r4-7 VZIP.16 Q0,Q1 ; 0-7 VSTIMA out!,{D0-D3} 4 channel: VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3 VPADD.I16 D2,D2,D3 ; r0-7 -> r0-3 VPADD.I16 D1,D4,D5 ; l8-15 -> l4-7 VPADD.I16 D3,D6,D7 ; r8-15 -> r4-7 VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3 VPADD.I16 D1,D2,D3 ; r0-7 -> r0-3 VZIP.16 D0,D1 ; 0-3 VSTIMA out!,{D0-D1} 8 channel: VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3 VPADD.I16 D2,D2,D3 ; r0-7 -> r0-3 VPADD.I16 D1,D4,D5 ; l8-15 -> l4-7 VPADD.I16 D3,D6,D7 ; r8-15 -> r4-7 VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3 VPADD.I16 D1,D2,D3 ; r0-7 -> r0-3 VPADD.I16 D0,D0,D0 ; l0-3 -> l0-1, junk VPADD.I16 D1,D1,D1 ; r0-3 -> r0-1, junk VZIP.16 D0,D1 ; 0-1, junk VSTIMA out!,{D0} Worst-case (8 channel VIDC2), that’s 39 instructions to process 16 input bytes (producing two stereo pairs). By way of comparison, the code generator that SoundDMA uses seems to have a worst-case of 63 instructions to produce one stereo pair, so the NEON version (assuming it works!) is around 3.2 times faster. And SoundDMA makes use of a memory-based lookup table for the log-to-lin conversion, so will be subject to more cache thrashing/memory stalls. The endOf course the only downside to all the above routines (despite them being completely untested!) is that they’re designed to work on fixed size amounts of input data, producing varying amounts of output data. In reality the opposite is probably needed – routines that take varying amounts of input data and produce a fixed amount of output data. This will allow them to adapt to any buffer size/alignment constraints that the host places on them. Although some of the routines can easily be updated to do less work, a lot of their efficiency comes from doing things in parallel, and at some point a point will be reached where the code starts becoming less efficient. So the easiest solution might be to just use two or more routines, switching between them depending on how much input/output space is available. And in some cases the fast routines can also be made even faster, by swapping the VDLM/VSTM instructions for VLDR/VSTR with alignment specifiers (assuming the buffers are appropriately aligned). Anyone have any fun routines of their own to share? Or can they spot any bugs or improvements to the above? |
||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
I still haven’t tested the code, but I did recently spot that the immediate constants I use in the audio code aren’t available, and using chord 7 would have resulted in an overflow. So here’s some new code, for VIDC2 format only (since nothing really uses VIDC1). This time, Q15 needs to be loaded with &0004… VLDMIA in!,{D0-D1} ; 16 bytes VMOVL.U8 Q1,D1 ; 8->16 VMOVL.U8 Q0,D0 VSHR.U16 Q3,Q1,#5 ; get chords VSHR.U16 Q2,Q0,#5 VSHL.U16 Q5,Q1,#15 ; get sign VSHL.U16 Q4,Q0,#15 VSHR.S16 Q5,Q5,#15 ; extend sign VSHR.S16 Q4,Q4,#15 VBIC Q0,Q0,#&00E1... ; get points(*2) VBIC Q1,Q1,#&00E1... VSHL.U16 Q2,Q15,Q2 ; 4<<chord = half the step value VSHL.U16 Q3,Q15,Q3 VSUB.I16 Q6,Q2,Q15 VSUB.I16 Q7,Q3,Q15 VSHL.16 Q6,Q6,#5 ; chord base, max 16256 VSHL.16 Q7,Q7,#5 VMLA.I16 Q6,Q0,Q2 ; base+(point*2)*halfstep VMLA.I16 Q7,Q1,Q3 VNEG.S16 Q2,Q6 ; negate ready for sign application VNEG.S16 Q3,Q7 VBSL Q4,Q2,Q6 ; select pos/neg version VBSL Q5,Q3,Q7 ; results are 16-bit signed, full range ; apply scaling: VQDMULH.S16 Q0,Q4,Q13 ; left 0-7 VQDMULH.S16 Q1,Q4,Q14 ; right 0-7 VQDMULH.S16 Q2,Q5,Q13 ; left 8-15 VQDMULH.S16 Q3,Q5,Q14 ; right 8-15 Now that objasm 4 is out, I think one of the things I’ll try doing over the next few weeks is putting this code into SoundDMA. It’ll be interesting to see how much faster it is in practice (although I doubt anyone will be able to notice the difference!) |
||||||||||||||||||||||||||
Steve Revill (20) 1361 posts |
I think you might want a few of those VSTIMA instructions to say VSTMIA. I know nothing of the ARMv7 instruction set yet (embarrassingly) so I’ll pass on making any useful observations. |
||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Well spotted!
Although I haven’t had much chance to test the code I’ve been writing, I think I’m now pretty familiar with the NEON instruction set. It looks like it’s pretty powerful, although there are a couple of caveats to be aware of – e.g. different instructions have different sets of immediate constants available, there’s a 20 or so cycle delay/stall to transfer NEON registers to ARM registers, you need to be careful not to mix NEON instructions with VFP instructions because it’ll cause the NEON/VFP pipeline to flush, etc. About the only thing I can think of that’s missing is a version of the VEXT instruction which uses a register (scalar) index/shift amount instead of an immediate constant. This would be very useful for situations where you need to consume a stream of source data at a variable rate. E.g. I’ve been working on some NEON optimised sound code for ArcEm. Although the log → linear code will be pretty much as above, the mixing of the stereo channels needs to be completely different, since there’s no surefire way for the emulator to determine how many channels are actually in use. So I’m ending up modeling something which is a bit closer to how the actual hardware worked. Plus it needs to perform samplerate conversion to whatever arbitrary sample rate the host is running at. The end result is that each destination sample could be formed from anywhere between 8 and 16 source samples (assuming a maximum of 8x downsampling), and it consumes the source samples at a variable rate. Not being able to use VEXT to step through the source data means that I’m unable to perform the log → linear conversion on-the-fly. Instead there needs to be one pass for log→linear conversion, which writes the data to a temporary buffer, and then the second pass for the mixing and samplerate conversion – which will have to load 16*2 samples of source data for each iteration of the loop! Admittedly 99% of the data will be in the cache, and the mixing/sample rate conversion loop is probably long enough for me to hide the load delay, but it’s not the most elegant solution compared to just using VEXT (or perhaps VSLI) to shift in a couple of quadwords of new data every so often. Although I’m not even sure if there are enough registers available for a combined loop. So in summary, I’m enjoying NEON quite a lot :-) |
||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Over the past few days I’ve been working on adding the NEON-optimised log-to-linear sound code to SoundDMA. I think it’s now at the state where it’s fairly optimal (or at least as optimal as I can be bothered with for now), so it’s time for some stats: Stats for an idle sound system (i.e. no sounds being played):
You’ll notice how the original ARM routine takes about the same amount of time, no matter how many voices are active. I think this all comes down to how many memory writes are occuring; the ARM routine only writes out one word at a time, while the NEON routine writes out between 16 and 2, depending on how many voices are active. Now some figures for when the sound system is under load (playing music through Maestro, so 8 sound channels in use):
You’ll notice that the ARM CPU usage is significantly more than it was before, while the NEON usage is about the same (technically it should be identical, but I guess there were measurement errors). This is because the ARM routine contains a shortcut for skipping silent input data, while the NEON one doesn’t. I did try adding such a shortcut to the NEON version, but it only made the code slower; perhaps someone else will try their hand at it once the code is checked in. Note that all timings were made on a BB-xM downclocked to 300MHz (both to increase the resolution/accuracy of the timing results, and to get a better picture of how the changes affect an idle machine), and with the default sample rate of 22kHz. Also there’ll be some error in the results since I didn’t go to the effort of making the code run with IRQs disabled. The code is pretty much ready to go, so expect to see it soon (not that I doubt many people would notice an extra 0.5% of performance!) |
||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
The NEON sound code got checked in over the weekend, as well as a few other important changes:
It’s worth pointing out that those changes are only for the HAL version of the module – the seperate versions of SoundDMA used by the Iyonix and IOMD ROMs remain as they are, and didn’t suffer from any of the bugs that were in the HAL one. The NEON code is here, if anyone’s interested in how that turned out. In the end it turned out pretty different to the code I posted to this thread, since it needs to cope with the SoundGain setting, and the mulaw decoding is a bit more optimal too (No longer using VMLA to multiply by a power of two!) |
||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Here’s a little optimisation tip that I’ve just discovered: Using VAND instead of VMOV can result in better performance (on Cortex-A8, at least) because it can dual-issue with other instructions. I’d been seeing this kind of thing in generated code for a while now, but have only just found an explanation for it. https://git.xiph.org/?p=opus.git;a=blob;f=celt/arm/celt_pitch_xcorr_arm.s;h=f96e0a88bbe609ed638b1a44b67e9038a3ed3447;hb=HEAD#l80 |
||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Some idle thoughts on how we could NEON optimise sprite plotting in SpriteExtend. Actually, this is based on some thoughts I was having for optimising the pixel format conversion in vncserv, so it’s more about pixel format conversion than any of the more advanced sprite plotting operations (e.g. alpha blending). Looking at the NEON instruction set, it seems that the key instructions would be VSHL/VSHR and VBIT/VBIF/VBSL. If you use them in pairs then you can transfer arbitrary bits from one position in one vector to another position in another vector – which is exactly what you need for pixel format conversion, e.g.: VSHL Qtemp, QIn, #shift VBIT Qout, Qtemp, Qmask Is roughly equivalent to: AND Rtemp, Rmask, Rin, LSL #shift ORR Rout, Rout, Rtemp ; (Assume Rout starts as 0, otherwise BIC Rout, Rout, Rmask might be needed beforehand) If you have a routine which is able to take two pixel format descriptors as an input, and returns an array as a result which indicates which source bit each destination bit maps to, then it becomes fairly trivial to use that array to generate a list of shift and mask values that are required for generating either NEON or ARM code for the pixel format conversion (NEON is SIMD with 128bit or 64bit registers; treat ARM as being SIMD with 32bit registers). Then you can pass that list of shift and mask values into a code generator which either spits out the code directly (naive approach) or is able to do some extra optimisation (e.g. it could build a simple DAG and interleave instructions to improve parallelism, or it could try and find shorter code sequences by using a wider range of instructions or by making use of intermediate results). But the key thing is that you would have a fairly generic code generator which will scale well depending on the number of available registers, and can easily be extended to support any new pixel formats, or any other kind of bit remapping operation. |