Fun with NEON

8 posts, 2 voices

Jul 30, 2011 4:36pm

Jeffrey Lee (213) 6048 posts

Here are a few NEON routines I’ve been playing around with today. I haven’t tested any of them yet (that’s where the fun part is ;-)), but I figured they’d be of interest to the wider community. I wrote them with the view of using them in arcem, but it wouldn’t surprise me if some of them (particularly the sound code) could be reused in RISC OS.

8bpp screen blitting

Code to convert pixels from 8bpp palettised to 16bpp, using the VIDC1/2 8bpp pixel format. The naive way of handling this would be to just use a 256-entry lookup table to directly convert each source pixel to a 16 bit colour. But by using the VTBL instruction to index two 16-entry lookup tables held in registers we can come up with a smaller and faster solution that won’t strain the cache/memory bus:

VLDMIA in!,{D4-D5}
VSHR.U8 Q0,Q2,#4
VAND Q2,Q2,#&0F.... ; get pal indices. Note '...' is used to imply a repeated pattern, i.e. VAND with &0F in each byte of the immediate constant
VAND Q0,Q0,#&0F.... ; get rgb bits
VTBL.8 D6,{D24-D25},D4 ; high palette
VTBL.8 D7,{D24-D25},D5
VTBL.8 D2,{D26-D27},D0 ; high rgb bits
VTBL.8 D3,{D26-D27},D1
VTBL.8 D4,{D28-D29},D4 ; low palette
VTBL.8 D5,{D28-D29},D5
VTBL.8 D0,{D30-D31},D0 ; low rgb bits
VTBL.8 D1,{D30-D31},D1
VZIP.8 Q2,Q3 ; Q2 = pixels 0-7, Q3 = 8-15
VZIP.8 Q0,Q1
VORR Q0,Q0,Q2
VORR Q1,Q1,Q3
VSTIMA out!,{D0-D3}

That’s 16 pixels converted in 17 instructions, with plenty of registers spare to allow several instances of the code to be interleaved (to hide any pipeline stalls). Determining the exact format of the lookup tables (D24-D31) is left as an exercise to the reader. Unfortunately this approach is only useful for VIDC1/2 8bpp modes; for fully palettised ones like VIDC20 uses, code would still have to use a palette/lookup table held in memory :-(

4bpp screen blitting

A basic variation of the above code; 16 input bytes are converted to 32 output pixels (64 bytes @ 16bpp). This one is also 17 instructions long.

VLDMIA in!,{D0-D1}
VSHR.U8 Q1,Q0,#4
VAND Q0,Q0,#&0F.... ; even pixels
VAND Q1,Q1,#&0F.... ; odd pixels
VTBL.8 D4,{D28-D29},D0 ; high even
VTBL.8 D5,{D28-D29},D1
VTBL.8 D6,{D28-D29},D2 ; high odd
VTBL.8 D7,{D28-D29},D3
VTBL.8 D0,{D30-D31},D0 ; low even
VTBL.8 D1,{D30-D31},D1
VTBL.8 D2,{D30-D31},D2 ; low odd
VTBL.8 D3,{D30-D31},D3
VZIP.8 Q0,Q2 ; Q0 = 0,2,4,6,8,10,12,14, Q2 = 16,18,20,22,24,26,28,30
VZIP.8 Q1,Q3 ; Q1 = 1,3,5,7,9,11,13,15, Q3 = 17,19,21,23,25,27,29,31
VZIP.16 Q0,Q1 ; Q0 = 0-7, Q1 = 8-15
VZIP.16 Q2,Q3 ; Q2 = 16-23, Q3 = 24-31
VSTIMA out!,{D0-D7}

2bpp screen blitting

Although you could load and convert 16 bytes of data at once, you’d end up needing 16 doubleword registers to store the results. This would cause difficulty with interleaving multiple instances of the code, since some registers are also needed to store the lookup tables. So instead, this routine loads 8 bytes at once:

VLDMIA in!,{D0}
VSHR.U8 D2,D0,#4
VSHR.U8 D4,D0,#2
VSHR.U8 D6,D0,#6
VAND D0,D0,#&03...
VAND D2,D2,#&03...
VAND D4,D4,#&03...
VAND D6,D6,#&03...
VTBL.8 D1,{D30},D0 ; high 00
VTBL.8 D3,{D30},D2 ; high 10
VTBL.8 D5,{D30},D4 ; high 01
VTBL.8 D7,{D30},D6 ; high 11
VTBL.8 D0,{D31},D0 ; low 00
VTBL.8 D2,{D31},D2 ; low 10
VTBL.8 D4,{D31},D4 ; low 01
VTBL.8 D6,{D31},D6 ; low 11
VZIP.8 D0,D1 ; Q0 = 0,4,8,12,16,20,24,28
VZIP.8 D2,D3 ; Q1 = 2,6,10,14,18,22,26,30
VZIP.8 D4,D5 ; Q2 = 1,5,9,13,17,21,25,29
VZIP.8 D6,D7 ; Q3 = 3,7,11,15,19,23,27,31
VZIP.16 Q0,Q2 ; Q0 = 0,1,4,5,8,9,12,13, Q2 = 16,17,20,21,24,25,28,29
VZIP.16 Q1,Q3 ; Q1 = 2,3,6,7,10,11,14,15, Q3 = 18,19,22,23,26,27,30,31
VZIP.32 Q0,Q1 ; Q0 = 0-7, Q1 = 8-15
VZIP.32 Q2,Q3 ; Q2 = 16-23, Q3 = 24-31
VSTMIA out!,{D0-D7}

That’s 25 instructions per 8 input bytes (32 pixels)

1bpp screen blitting

Despite this ones size, it looks like it’s still a lot quicker than an ARM equivalent. Since it loads 8 bytes at once it needs 16 doublewords to store the output; so you might have some trouble interleaving it.

VLDMIA in!,{D0}
VSHR.U8 D1,D0,#1
VSHR.U8 Q1,Q0,#2
VSHR.U8 Q2,Q0,#4
VSHR.U8 Q3,Q0,#6
VAND Q0,Q0,#&01...
VAND Q1,Q1,#&01...
VAND Q2,Q2,#&01...
VAND Q3,Q3,#&01...
VTBL.8 D8,{D30},D0 ; high bytes 000
VTBL.8 D9,{D30},D1 ; 001
VTBL.8 D10,{D30},D2 ; 010
VTBL.8 D11,{D30},D3 ; 011
VTBL.8 D12,{D30},D4 ; 100
VTBL.8 D13,{D30},D5 ; 101
VTBL.8 D14,{D30},D6 ; 110
VTBL.8 D15,{D30},D7 ; 111
VTBL.8 D0,{D31},D0 ; low bytes
VTBL.8 D1,{D31},D1
VTBL.8 D2,{D31},D2
VTBL.8 D3,{D31},D3
VTBL.8 D4,{D31},D4
VTBL.8 D5,{D31},D5
VTBL.8 D6,{D31},D6
VTBL.8 D7,{D31},D7
VZIP.8 Q0,Q4 ; Q0 = 0,8,16,24,32,40,48,56, Q4 = 1,9,17,25,33,41,49,57
VZIP.8 Q1,Q5 ; Q1 = 2,10,18,26,34,42,50,58, Q5 = 3,11,19,27,35,43,51,59
VZIP.8 Q2,Q6 ; Q2 = 4,12,20,28,36,44,52,60, Q6 = 5,13,21,29,37,45,53,61
VZIP.8 Q3,Q7 ; Q3 = 6,14,22,30,38,46,54,62, Q7 = 7,15,23,31,39,47,55,63
VZIP.16 Q0,Q4 ; Q0 = 0,1,8,9,16,17,24,25, Q4 = 32,33,40,41,48,49,56,57
VZIP.16 Q1,Q5 ; Q1 = 2,3,10,11,18,19,26,27, Q5 = 34,35,42,43,50,51,58,59
VZIP.16 Q2,Q6 ; Q2 = 4,5,12,13,20,21,28,29, Q6 = 36,37,44,45,52,53,60,61
VZIP.16 Q3,Q7 ; Q3 = 6,7,14,15,22,23,30,31, Q7 = 38,39,46,47,54,55,62,63
VZIP.32 Q0,Q1 ; Q0 = 0-3,8-11, Q1 = 16-19,24-27
VZIP.32 Q4,Q5 ; Q4 = 32-35,40-43, Q5 = 48-51,56-59
VZIP.32 Q2,Q3 ; Q2 = 4-7,12-15, Q3 = 20-23,28-31
VZIP.32 Q6,Q7 ; Q6 = 36-39,44-47, Q7 = 52-55,60-63
VSWP D1,D4 ; Q0 = 0-7, Q2 = 8-15
VSWP D3,D6 ; Q1 = 16-23, Q3 = 24-31
VSWP D9,D12 ; Q4 = 32-29, Q6 = 40-47
VSWP D11,D14 ; Q5 = 48-55, Q7 = 56-63
VSWP Q1,Q2
VSWP Q5,Q6
VSTMIA out!,{D0-D15}

Thats 8 input bytes (64 pixels) in 44 instructions.

VIDC1/2 8-bit audio

Some snippets of code that can (hopefully) be used to convert the contents of the 8-bit sound buffer into a standard 16-bit stereo buffer. log-to-lin conversion is done using arithmetic rather than lookup tables. However for simplicty, a lookup table/vector is used to scale the resulting values by the stereo offset and channel count scale factor (and any other output scale factor you desire). This code assumes that:

Q15 is loaded with #&0010….
Q14 is a vector of 8 unsigned 16-bit scale factors, used for stereo conversion & output scaling. Element 0 is for channel 0, element 1 for channel 1, etc. If less than 8 channels are in use then you duplicate the scale factors into the remaining elements (e.g. 0,1,2,3,0,1,2,3 for 4-channel input)
Q13 is the scale factor vector used for the left channel
Scale factors are fixed point, with 1 integer bit and 15 fraction bits.
Scale factors must be chosen to avoid any overflows; so for 1-channel input the maximum factor is &8000, for 8-channel input it’s &1000
For the output data, it’s assumed that the left channel is in the low halfword and the right channel is in the high halfword. If they need to be the other way round, just swap the contents of Q13 & Q14.

Here’s the main chunk of the code, to load 16 bytes and convert them to 16 pairs of scaled 16bit samples. Note the conditional at the start to select between VIDC1 & VIDC2 format data.

 [ VIDC1
VLDMIA	in!,{D0-D1} ; 16 bytes
 | ; VIDC2
VLDMIA	in!,{D2-D3} ; 16 bytes
VSHR.8	Q0,Q1,#1
VSLI.8	Q0,Q1,#7
 ]
VMOVL.S8 Q1,D1 ; 8->16
VMOVL.S8 Q0,D0
VSHR.U16 Q3,Q1,#4
VAND	Q3,Q3,#&0007... ; get chords
VSHR.U16 Q2,Q0,#4
VAND	Q2,Q2,#&0007
VSHR.S16 Q5,Q5,#7 ; get sign
VSHR.S16 Q4,Q4,#7
VAND	Q0,Q0,#&000F... ; get points
VAND	Q1,Q1,#&000F...
VSHL.16	Q2,Q15,Q2 ; 16<<chord
VSHL.16 Q3,Q15,Q3
VSUB.I16 Q6,Q2,Q15 ; chord base(*16)
VSUB.I16 Q7,Q3,Q15
VSHL.16 Q6,Q6,#4 ; base*16(*16)
VSHL.16 Q7,Q7,#4
VMLA.I16 Q6,Q0,Q2 ; base+point*step
VMLA.I16 Q7,Q1,Q3
VNEG.S16 Q2,Q6 ; negate ready for sign application
VNEG.S16 Q3,Q7
VBSL	Q4,Q2,Q6 ; select pos/neg version
VBSL	Q5,Q3,Q7
; results are 16-bit signed, full range
; apply scaling:
VQDMULH.S16 Q0,Q4,Q13 ; left 0-7
VQDMULH.S16 Q1,Q4,Q14 ; right 0-7
VQDMULH.S16 Q2,Q5,Q13 ; left 8-15
VQDMULH.S16 Q3,Q5,Q14 ; right 8-15

That’s 27 instructions for the VIDC1 version and 29 for the VIDC2 version. Now we need to mix together the channels to produce the final stereo data. For 1 channel input:

VZIP.16 Q0,Q1 ; 0-7
VZIP.16 Q2,Q3 ; 8-15
VSTMIA out!,{D0-D7}

2 channel:

VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3
VPADD.I16 D2,D2,D3 ; r0-7 -> r0-3
VPADD.I16 D1,D4,D5 ; l8-15 -> l4-7
VPADD.I16 D3,D6,D7 ; r8-15 -> r4-7
VZIP.16 Q0,Q1 ; 0-7
VSTIMA out!,{D0-D3}

4 channel:

VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3
VPADD.I16 D2,D2,D3 ; r0-7 -> r0-3
VPADD.I16 D1,D4,D5 ; l8-15 -> l4-7
VPADD.I16 D3,D6,D7 ; r8-15 -> r4-7
VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3
VPADD.I16 D1,D2,D3 ; r0-7 -> r0-3
VZIP.16 D0,D1 ; 0-3
VSTIMA out!,{D0-D1}

8 channel:

VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3
VPADD.I16 D2,D2,D3 ; r0-7 -> r0-3
VPADD.I16 D1,D4,D5 ; l8-15 -> l4-7
VPADD.I16 D3,D6,D7 ; r8-15 -> r4-7
VPADD.I16 D0,D0,D1 ; l0-7 -> l0-3
VPADD.I16 D1,D2,D3 ; r0-7 -> r0-3
VPADD.I16 D0,D0,D0 ; l0-3 -> l0-1, junk
VPADD.I16 D1,D1,D1 ; r0-3 -> r0-1, junk
VZIP.16 D0,D1 ; 0-1, junk
VSTIMA out!,{D0}

Worst-case (8 channel VIDC2), that’s 39 instructions to process 16 input bytes (producing two stereo pairs). By way of comparison, the code generator that SoundDMA uses seems to have a worst-case of 63 instructions to produce one stereo pair, so the NEON version (assuming it works!) is around 3.2 times faster. And SoundDMA makes use of a memory-based lookup table for the log-to-lin conversion, so will be subject to more cache thrashing/memory stalls.

The end

Of course the only downside to all the above routines (despite them being completely untested!) is that they’re designed to work on fixed size amounts of input data, producing varying amounts of output data. In reality the opposite is probably needed – routines that take varying amounts of input data and produce a fixed amount of output data. This will allow them to adapt to any buffer size/alignment constraints that the host places on them. Although some of the routines can easily be updated to do less work, a lot of their efficiency comes from doing things in parallel, and at some point a point will be reached where the code starts becoming less efficient. So the easiest solution might be to just use two or more routines, switching between them depending on how much input/output space is available. And in some cases the fast routines can also be made even faster, by swapping the VDLM/VSTM instructions for VLDR/VSTR with alignment specifiers (assuming the buffers are appropriately aligned).

Anyone have any fun routines of their own to share? Or can they spot any bugs or improvements to the above?

Sep 20, 2011 1:15pm

Jeffrey Lee (213) 6048 posts

Or can they spot any bugs or improvements to the above?

I still haven’t tested the code, but I did recently spot that the immediate constants I use in the audio code aren’t available, and using chord 7 would have resulted in an overflow. So here’s some new code, for VIDC2 format only (since nothing really uses VIDC1). This time, Q15 needs to be loaded with &0004…

VLDMIA	in!,{D0-D1} ; 16 bytes
VMOVL.U8 Q1,D1 ; 8->16
VMOVL.U8 Q0,D0
VSHR.U16 Q3,Q1,#5 ; get chords
VSHR.U16 Q2,Q0,#5
VSHL.U16 Q5,Q1,#15 ; get sign
VSHL.U16 Q4,Q0,#15
VSHR.S16 Q5,Q5,#15 ; extend sign
VSHR.S16 Q4,Q4,#15
VBIC	Q0,Q0,#&00E1... ; get points(*2)
VBIC	Q1,Q1,#&00E1...
VSHL.U16 Q2,Q15,Q2 ; 4<<chord = half the step value
VSHL.U16 Q3,Q15,Q3
VSUB.I16 Q6,Q2,Q15
VSUB.I16 Q7,Q3,Q15
VSHL.16 Q6,Q6,#5 ; chord base, max 16256
VSHL.16 Q7,Q7,#5
VMLA.I16 Q6,Q0,Q2 ; base+(point*2)*halfstep
VMLA.I16 Q7,Q1,Q3
VNEG.S16 Q2,Q6 ; negate ready for sign application
VNEG.S16 Q3,Q7
VBSL	Q4,Q2,Q6 ; select pos/neg version
VBSL	Q5,Q3,Q7
; results are 16-bit signed, full range
; apply scaling:
VQDMULH.S16 Q0,Q4,Q13 ; left 0-7
VQDMULH.S16 Q1,Q4,Q14 ; right 0-7
VQDMULH.S16 Q2,Q5,Q13 ; left 8-15
VQDMULH.S16 Q3,Q5,Q14 ; right 8-15

Now that objasm 4 is out, I think one of the things I’ll try doing over the next few weeks is putting this code into SoundDMA. It’ll be interesting to see how much faster it is in practice (although I doubt anyone will be able to notice the difference!)

Sep 22, 2011 12:03am

Steve Revill (20) 1393 posts

I think you might want a few of those VSTIMA instructions to say VSTMIA. I know nothing of the ARMv7 instruction set yet (embarrassingly) so I’ll pass on making any useful observations.

Sep 22, 2011 12:38pm

Jeffrey Lee (213) 6048 posts

I think you might want a few of those VSTIMA instructions to say VSTMIA

Well spotted!

I know nothing of the ARMv7 instruction set yet (embarrassingly) so I’ll pass on making any useful observations.

Although I haven’t had much chance to test the code I’ve been writing, I think I’m now pretty familiar with the NEON instruction set. It looks like it’s pretty powerful, although there are a couple of caveats to be aware of – e.g. different instructions have different sets of immediate constants available, there’s a 20 or so cycle delay/stall to transfer NEON registers to ARM registers, you need to be careful not to mix NEON instructions with VFP instructions because it’ll cause the NEON/VFP pipeline to flush, etc.

About the only thing I can think of that’s missing is a version of the VEXT instruction which uses a register (scalar) index/shift amount instead of an immediate constant. This would be very useful for situations where you need to consume a stream of source data at a variable rate. E.g. I’ve been working on some NEON optimised sound code for ArcEm. Although the log → linear code will be pretty much as above, the mixing of the stereo channels needs to be completely different, since there’s no surefire way for the emulator to determine how many channels are actually in use. So I’m ending up modeling something which is a bit closer to how the actual hardware worked. Plus it needs to perform samplerate conversion to whatever arbitrary sample rate the host is running at. The end result is that each destination sample could be formed from anywhere between 8 and 16 source samples (assuming a maximum of 8x downsampling), and it consumes the source samples at a variable rate. Not being able to use VEXT to step through the source data means that I’m unable to perform the log → linear conversion on-the-fly. Instead there needs to be one pass for log→linear conversion, which writes the data to a temporary buffer, and then the second pass for the mixing and samplerate conversion – which will have to load 16*2 samples of source data for each iteration of the loop! Admittedly 99% of the data will be in the cache, and the mixing/sample rate conversion loop is probably long enough for me to hide the load delay, but it’s not the most elegant solution compared to just using VEXT (or perhaps VSLI) to shift in a couple of quadwords of new data every so often. Although I’m not even sure if there are enough registers available for a combined loop.

So in summary, I’m enjoying NEON quite a lot :-)

Feb 22, 2012 12:22am

Jeffrey Lee (213) 6048 posts

Over the past few days I’ve been working on adding the NEON-optimised log-to-linear sound code to SoundDMA. I think it’s now at the state where it’s fairly optimal (or at least as optimal as I can be bothered with for now), so it’s time for some stats:

Stats for an idle sound system (i.e. no sounds being played):

Voices	ARM CPU usage	NEON CPU usage	Performance improvement
1	0.6903%	0.1105%	6.24x
2	0.6885%	0.1701%	4.04x
4	0.6874%	0.3028%	2.27x
8	0.6887%	0.5621%	1.22x

You’ll notice how the original ARM routine takes about the same amount of time, no matter how many voices are active. I think this all comes down to how many memory writes are occuring; the ARM routine only writes out one word at a time, while the NEON routine writes out between 16 and 2, depending on how many voices are active.

Now some figures for when the sound system is under load (playing music through Maestro, so 8 sound channels in use):

ARM CPU usage	NEON CPU usage	Performance improvement
1.07993%	0.5766%	1.87x

You’ll notice that the ARM CPU usage is significantly more than it was before, while the NEON usage is about the same (technically it should be identical, but I guess there were measurement errors). This is because the ARM routine contains a shortcut for skipping silent input data, while the NEON one doesn’t. I did try adding such a shortcut to the NEON version, but it only made the code slower; perhaps someone else will try their hand at it once the code is checked in.

Note that all timings were made on a BB-xM downclocked to 300MHz (both to increase the resolution/accuracy of the timing results, and to get a better picture of how the changes affect an idle machine), and with the default sample rate of 22kHz. Also there’ll be some error in the results since I didn’t go to the effort of making the code run with IRQs disabled.

The code is pretty much ready to go, so expect to see it soon (not that I doubt many people would notice an extra 0.5% of performance!)

Mar 26, 2012 1:07pm

Jeffrey Lee (213) 6048 posts

The NEON sound code got checked in over the weekend, as well as a few other important changes:

Added support for oversampling (including a NEON version of the code – I didn’t spend any time trying to optimise it, but profiling showed it’s about twice as fast as the ARM version)
Other bits of buffer manipulation code (mono conversion & stereo swapping) have NEON variants too
Fixed a bug where setting the sample rate via Sound_SampleRate would set one of the internal variables to the wrong value, causing the sample rate to be set to something completely different the next time Sound_Configure was called (even if you called with all 0 parameters)
Fixed module finalisation not releasing the DMA channel
Changed buffer fills to occur in an RTSupport routine instead of in the DMASync callback (the old code was enabling IRQs during the callback, which the DMAManager docs say you shouldn’t do. This was causing buffer filling to go out of sync with DMA if the IRQ timing was right)
Various internal bits reworked so that if one of the required modules dies (DMAManager, RTSupport, VFPSupport) then sound will stop and then automatically restart once the module(s) are available again.

It’s worth pointing out that those changes are only for the HAL version of the module – the seperate versions of SoundDMA used by the Iyonix and IOMD ROMs remain as they are, and didn’t suffer from any of the bugs that were in the HAL one.

The NEON code is here, if anyone’s interested in how that turned out. In the end it turned out pretty different to the code I posted to this thread, since it needs to cope with the SoundGain setting, and the mulaw decoding is a bit more optimal too (No longer using VMLA to multiply by a power of two!)

Jan 26, 2016 3:23pm

Jeffrey Lee (213) 6048 posts

Here’s a little optimisation tip that I’ve just discovered: Using VAND instead of VMOV can result in better performance (on Cortex-A8, at least) because it can dual-issue with other instructions. I’d been seeing this kind of thing in generated code for a while now, but have only just found an explanation for it.

https://git.xiph.org/?p=opus.git;a=blob;f=celt/arm/celt_pitch_xcorr_arm.s;h=f96e0a88bbe609ed638b1a44b67e9038a3ed3447;hb=HEAD#l80

Apr 13, 2016 1:24pm

Jeffrey Lee (213) 6048 posts

Some idle thoughts on how we could NEON optimise sprite plotting in SpriteExtend. Actually, this is based on some thoughts I was having for optimising the pixel format conversion in vncserv, so it’s more about pixel format conversion than any of the more advanced sprite plotting operations (e.g. alpha blending).

Looking at the NEON instruction set, it seems that the key instructions would be VSHL/VSHR and VBIT/VBIF/VBSL. If you use them in pairs then you can transfer arbitrary bits from one position in one vector to another position in another vector – which is exactly what you need for pixel format conversion, e.g.:

VSHL Qtemp, QIn, #shift
VBIT Qout, Qtemp, Qmask

Is roughly equivalent to:

AND Rtemp, Rmask, Rin, LSL #shift
ORR Rout, Rout, Rtemp ; (Assume Rout starts as 0, otherwise BIC Rout, Rout, Rmask might be needed beforehand)

If you have a routine which is able to take two pixel format descriptors as an input, and returns an array as a result which indicates which source bit each destination bit maps to, then it becomes fairly trivial to use that array to generate a list of shift and mask values that are required for generating either NEON or ARM code for the pixel format conversion (NEON is SIMD with 128bit or 64bit registers; treat ARM as being SIMD with 32bit registers). Then you can pass that list of shift and mask values into a code generator which either spits out the code directly (naive approach) or is able to do some extra optimisation (e.g. it could build a simple DAG and interleave instructions to improve parallelism, or it could try and find shorter code sequences by using a wider range of instructions or by making use of intermediate results). But the key thing is that you would have a fairly generic code generator which will scale well depending on the number of available registers, and can easily be extended to support any new pixel formats, or any other kind of bit remapping operation.

Reply

To post replies, please first log in.

Forums → General →

Fun with NEON

8bpp screen blitting

4bpp screen blitting

2bpp screen blitting

1bpp screen blitting

VIDC1/2 8-bit audio

The end

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options