VFP advice/tutorial

137 posts, 11 voices

Pages: 1 2 3 4 5 6

Nov 4, 2013 2:47pm Chris Gransden (337) 1207 posts	That’s much better. I now get 80fps at 640 × 480. I tried 1920×1200 but got a ‘Memory cannot be moved’ error.

Nov 4, 2013 2:55pm Jan Rinze (235) 368 posts	Triple buffering with 1920×1200 would cost about 27 MB. Maybe that is a bit too much. Also the memory bus of the Panda-ES might be a lot faster. I only get 42 fps in 640×480

Nov 4, 2013 5:16pm Steve Drain (222) 1620 posts	A new disassembler is in the works, but isn’t quite ready yet. However since everyone seems to have come down with a case of VFP fever, I could prioritise it and probably get VFP/NEON disassembly in and working sometime in this coming week? Thanks. I think it is just Rick and myself that have the fever, but it would be good to have the disassembly in StrongED the same as the assembler we write. ;-) I have now found some example code with explanatory notes. It comes with the original TBA BASIC module that was released two years back: http://www.tbasoftware.co.uk/p/downloads.html You need to use TBAFS32 to open the archive. This does not provide the guide I really want, but I have got a simple calculation working now. Some of that guide would point out the care needed when interacting between normal and VFP registers and the importance of using true floating point format values. Part of my problem was imagining that VMUL etc would accept any old values in its parameters, whereas it seems to throw a wobbly if it does not. ;-)

Nov 10, 2013 4:21pm Steve Drain (222) 1620 posts	When this was discussed a year or so ago, I produced a trial BasicHelp module with space to include VFP and NEON help. I feel now that I could probably fill in some of those spaces myself, but would be happier if a more knowledgeable person did it. I have updated the module with VFP help taken from the ARM Quick Reference Card, but it is not registered yet: http://kappa.me.uk/Modules/swBasicHelp030.zip Is this useful? Can it be improved? Should I go on and add NEON?

Nov 10, 2013 10:33pm Kuemmel (439) 384 posts	Hi Steve, in case you didn’t find it by now…I wrote lots of VFP/NEON Basic code, that should still work fine on BB/Panda. Here’s the Link: http://www.mikusite.de/pages/riscos.htm Long time ago I used ExtASM, but almost all was converted into the Basic Assembler later …and of course the help module is very welcome to me, NEON adding would be great…it’s such a hassle always flip the papers of the reference manuals from ARM ;-)

Nov 11, 2013 11:14am Steve Drain (222) 1620 posts	Yes, I have looked at your programs. I remembered that you said a couple of years back that you had used ExtASM, but I did think the ones I downloaded looked like BASIC. ;-) My problem in using them as examples is that they seem to be almost all about vector manipulation, which can scramble the mind, and I am really interested in VPF scalar. I am only using the Pi, but I think I have a reasonable grasp for the moment. the help module is very welcome to me, NEON adding would be great…it’s such a hassle always flip the papers of the reference manuals from ARM I have printed off the Quick Reference Guide for VFP, but I cannot find a similar beast for NEON. If I can get a reasonable handle on how it works, I will certainly add that. You could look at the source to the module, in the Resources directory, where there is a blank MessageTrans file for NEON. If you can work out the protocol I have used to write the VFP one, you might be tempted to add it yourself. ;-)

Nov 11, 2013 11:44am Steve Drain (222) 1620 posts	In my original post I wrote: VFP does not have the transcendental functions of the FPE I have been thinking again about this, and of these lines in the Roadmap: VFP support Support in FPEmulator so that FPA code maps to VFP hardware Use of VFP for floating point in BASIC64 It is clear that VFP on its own cannot povide all the functions that BASIC VI gets from from the FPE. A solution to this arises automatically from the implementation of first item, if the transcendental functions are also re-written to take advantage of VFP. Is anyone looking at this, or is there the prospect of it in the near future? I am afraid that it is just too far beyond my ability, but for my own purposes I have been contemplating a module to provide SWIs to do some of it. I would not want to waste my effort if there is a new FPE in the offing. ;-)

Nov 11, 2013 1:42pm Jeffrey Lee (213) 6048 posts	Is anyone looking at this, or is there the prospect of it in the near future? I am afraid that it is just too far beyond my ability, but for my own purposes I have been contemplating a module to provide SWIs to do some of it. I would not want to waste my effort if there is a new FPE in the offing. ;-) I’m not looking at it, and aren’t likely to look at it any time soon. I’ll try and take a look at your BASIC help module tonight, and see if I can come up with a full set of help text for the VFP/NEON assembler.

Nov 11, 2013 3:54pm Steve Drain (222) 1620 posts	I’ll try and take a look at your BASIC help module tonight Thanks. If you approve the way I have done it, I can add the NEON easily once I have a list of what instructions need to be there. I have to make a summary of them all from the ARM Reference site. Not difficult, but a lot of pages to look at.

Nov 11, 2013 3:58pm Steve Drain (222) 1620 posts	I’m not looking at it, and aren’t likely to look at it any time soon. OK. I will have a go at my module, which will be fun. I am already drowning in Taylor series and Chebyshev constants, which I have not needed for nearly half a century, and CORDIC algorithms were hardly heard of then. ;-)

Nov 11, 2013 4:20pm Kuemmel (439) 384 posts	Dear Steve, regarding the transcendantal functions you might have seen my examples also. I’m always using this library that some brave guy did https://code.google.com/p/math-neon/source/browse/#svn%2Ftrunk Yes, it’s based on NEON but often just uses scalar math and is well documented so that one with knowledge of VFP could always recode it. Somebody could make a library out of this…it’s not the same accuracy of the FPEMu…but that depends in the end on your purpose.

Nov 11, 2013 4:23pm Jeffrey Lee (213) 6048 posts	If you approve the way I have done it, I can add the NEON easily once I have a list of what instructions need to be there. I have to make a summary of them all from the ARM Reference site. Not difficult, but a lot of pages to look at. I should hopefully be able to produce the list of instructions for you. There are a few gotchas to keep in mind with the syntax, so it’s probably easiest for me to do it than you.

Nov 11, 2013 5:05pm Steve Drain (222) 1620 posts	There are a few gotchas to keep in mind with the syntax VDUP?

Nov 11, 2013 5:40pm Steve Drain (222) 1620 posts	@Kuemmel I noticed that you had sine in NEON code. That library is useful, but it is quite specifically in NEON code and would not translate simply to VFP double precision. With a cursory look at sine, it appears to use a Taylor series to the 7th power. This would be relatively easy to code from new, but would not provide the necessary precision for an FPE substitute, nor is it a very efficient algorithm. With Chebyshev coefficients I think you could get the precision with one fewer term, but they have to be calculated for each exact situation. I sort of understand the theory, but the execution is beyond me at the moment.

Nov 11, 2013 8:05pm Kuemmel (439) 384 posts	@Steve: I’ll check that Chebyshev thingy out. Found a nice page on it that generates a C-Code from your input → http://metamerist.com/cheby/example38.htm From there should be easy to implement in VFP.

Nov 12, 2013 12:15am Jeffrey Lee (213) 6048 posts	There are a few gotchas to keep in mind with the syntax VDUP? Amongst others, yes. E.g. optional suffixes being supported (or not), VLDM/VSTM not supporting Q registers, using strings for 64bit constants, etc. It looks like you’re on the right track with your help module, although I’d say the main missing feature is explanations for what values the different registers/fields can take (e.g. what <Fd> is, or what the valid data type suffixes are). I was initially thinking it could automatically pick which registers to explain, although you’re welcome to try a different approach if you wish! A few extra notes: VFP is using {<cond>} where it should be [<cond>] VMOV output is cut off It would be nice to allow (e.g.) “[ VMOV” to be used instead of “[VFP VMOV”, to save on typing. Or is this a limitation to the way the ‘[’ command is implemented? (Until now I didn’t even realise you could have a * command called ‘[’!). Long-term of course we’d want this integrated with BASIC. I’m still in the middle of producing a NEON instruction listing (there’s a lot of them!), but here’s a VFP listing I produced – it looks like you were missing a few instructions. I’ve also arranged them by required VFP version, including ‘common’ for instructions that are available to both VFP-only and NEON-only machines. Perhaps it would be worth making the output of “[VFP” group the instructions by VFP version? Or do we think just listing the version in the instruction details will be enough? Common ------ VLDM<IA\|DB>[<cond>][.32] <reg>[!],{<sregs>} VLDM<IA\|DB>[<cond>][.64] <reg>[!],{<dregs>} Load sequential list of VFP registers from <reg> VSTM<IA\|DB>[<cond>][.32] <reg>[!],{<sregs>} VSTM<IA\|DB>[<cond>][.64] <reg>[!],{<dregs>} Store sequential list of VFP registers to <reg> VPOP[<cond>] {<sregs>} VPOP[<cond>] {<dregs>} Load sequential list of VFP registers from stack VPUSH[<cond>] {<sregs>} VPUSH[<cond>] {<dregs>} Store sequential list of VFP registers to stack VLDR[<cond>] <Dreg>, '[ <reg>[,#<expr>] '] VLDR[<cond>] <Sreg>, '[ <reg>[,#<expr>] '] Load single VFP register from <reg>+<expr> VLDR[<cond>] <Dreg>,<label> VLDR[<cond>] <Sreg>,<label> Load single VFP register from label VSTR[<cond>] <Dreg>, '[ <reg>[,#<expr>] '] VSTR[<cond>] <Sreg>, '[ <reg>[,#<expr>] '] Store single VFP register at <reg>+<expr> VSTR[<cond>] <Dreg>,<label> VSTR[<cond>] <Sreg>,<label> Store single VFP register at label VMOV[<cond>] <Dreg>,<reglo>,<reghi> Transfer 64bit value from ARM to VFP VMOV[<cond>] <reglo>,<reghi>,<Dreg> Transfer 64bit value from VFP to ARM VMOV[<cond>].32 <Dreg[x]>,<reg> Transfer 32bit value from ARM to VFP VMOV[<cond>].32 <reg>,<Dreg[x]> Transfer 32bit value from VFP to ARM VMSR[<cond>] FPSID\|FPSCR\|FPEXC,<reg>\|APSR_nzcv\|APSR_f Transfer <reg> or APSR_nzcv or APSR_f to VFP special reg VMRS[<cond>] <reg>\|APSR_nzcv\|APSR_f,FPSID\|FPSCR\|FPEXC\|MVFR0\|MVFR1 Transfer VFP special reg to <reg> or APSR_nzcv or APSR_f VFPv2 ----- VABS[<cond>].F32 <Sreg>,<Sreg> VABS[<cond>].F64 <Dreg>,<Dreg> Calculate absolute value of floating point vector VADD[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VADD[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Add floating point vectors VCMP[E][<cond>].F32 <Sreg>,<Sreg>\|#0 VCMP[E][<cond>].F64 <Dreg>,<Dreg>\|#0 Compare floating point reg with register or zero, optionally raising exceptions VCVT[R][<cond>].<S\|U>32.F32 <Sreg>,<Sreg> VCVT[R][<cond>].<S\|U>32.F64 <Sreg>,<Dreg> Convert single or double precision float to signed/unsigned integer, rounding to zero (VCVT) or as specified by FPSCR (VCVTR) VCVT[<cond>].F32.<S\|U>32 <Sreg>,<Sreg> VCVT[<cond>].F64.<S\|U>32 <Dreg>,<Sreg> Convert signed/unsigned integer to single or double precision float VCVT[<cond>].F64.F32 <Dreg>,<Sreg> VCVT[<cond>].F32.F64 <Sreg>,<Dreg> Convert single precision float to/from double precision float VMUL[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VMUL[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Multiply floating point vector VMLA[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VMLA[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Multiply and accumulate floating point vector VMLS[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VMLS[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Multiply and subtract floating point vector VDIV[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VDIV[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Divide floating point vector VSQRT[<cond>].F32 <Sreg>,<Sreg> VSQRT[<cond>].F64 <Dreg>,<Dreg> Calculate square root of floating point vector VMOV[<cond>] <reg>,<Sreg> VMOV[<cond>] <Sreg>,<reg> Transfer VFP single precision register to/from ARM register VMOV[<cond>] <reg>,<reg>,<Sreg>,<Sreg> VMOV[<cond>] <Sreg>,<Sreg>,<reg>,<reg> Transfer two adjacent VFP single precision registers to/from ARM registers VMOV[<cond>] <Sreg>,<Sreg> VMOV[<cond>] <Dreg>,<Dreg> Transfer VFP vector between VFP registers VNEG[<cond>] <Sreg>,<Sreg> VNEG[<cond>] <Dreg>,<Dreg> Negate floating point vector VSUB[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VSUB[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Subtract floating point vectors VFPv3 ----- VCVT[<cond>].F32.<S\|U><16\|32> <Sreg>,#<expr> VCVT[<cond>].F64.<S\|U><16\|32> <Dreg>,#<expr> Convert signed/unsigned 16/32bit fixed point scalar with <expr> fractional bits to floating point VCVT[<cond>].<S\|U><16\|32>.F32 <Sreg>,#<expr> VCVT[<cond>].<S\|U><16\|32>.F64 <Dreg>,#<expr> Convert floating point scalar to signed/unsigned 16/32bit fixed point value with <expr> fractional bits VMOV[<cond>].F32 <Sreg>,#<expr> VMOV[<cond>].F64 <Dreg>,#<expr> Set floating point register to constant floating point value VPFv4 ----- VF[N]MA[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VF[N]MA[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Fused multiply and accumulate of floating point scalar, optionally negating the output register before the addition VF[N]MS[<cond>].F32 <Sreg>,<Sreg>,<Sreg> VF[N]MS[<cond>].F64 <Dreg>,<Dreg>,<Dreg> Fused multiply and subtract of floating point scalar, optionally negating the output register before the subtraction

Nov 12, 2013 9:36am Steve Drain (222) 1620 posts	Found a nice page on it that generates a C-Code from your input Fantastic. I knew there would be a site doing that sort of thing, but I could not find one myself in the last week or so. From there should be easy to implement in VFP. !!! Not so fast. My head hurts already, and now the suitable parameters have to be defined. I thought I might look in the FPE source to see how it calculates the existential functions and maybe even find the constants. It appears that the source is embargoed – or am I missing something?

Nov 12, 2013 10:35am Steve Drain (222) 1620 posts	@ Jeffrey B***** this beast. I had just composed a lengthy reply and the page has redrawn, removing everything. ;-( I’m off for a coffee and crossword to regain my composure. I’ll do it again in an editor later.

Nov 12, 2013 1:00pm Steve Drain (222) 1620 posts	@Jeffrey the main missing feature is explanations for what values the different registers/fields can take (e.g. what is, or what the valid data type suffixes are). That is in the ‘parameters’ topic, eg [par Fd . It has been quite tricky designing this as it provides for other main topics, such as ARM, as well. There is scope for discussion here. You might also have mised the ‘directives’ topic, which has, eg: [dir ALIGN. This was something you brought up a year ago. VFP is using {} where it should be [] I used the QRC as a guide, where it is {C}, but to be consistent with BASIC’s HELP is should be as you suggest and I will do that. VMOV output is cut off Yes, I had noticed that, but not worked on it. It is a buffer problem. There are so many VMOV variants that they exceed the 512 byte I had allocated. Easily fixed. It would be nice to allow (e.g.) “[ VMOV” to be used instead of “[VFP VMOV”, to save on typing. Or is this a limitation to the way the ‘[’ command is implemented? The code is pretty simple and exploits the facilities provided by MessageTrans. I have divided up the topics into separate files, which makes lookup easier and probably quicker. In an early version I had everything in one file and used a prefix to discriminate. To just have a single command, [ , I would probably have to go back to that, although I have an idea that might work to give both methods. The more discrimation involved, the more complex the code, but that is ok if there is a well defined specification. Long-term of course we’d want this integrated with BASIC. You mean both BASICs. I suggested BASICTrans a year ago, but Sprow is in the process of removing that, which I vehemently oppose. I’m still in the middle of producing a NEON instruction listing (there’s a lot of them!), but here’s a VFP listing I produced – it looks like you were missing a few instructions. There is lot there; I thought there might be. It will bear some study. Everything in the QRC I found is in my module, but it may be out of date. Certainly it does not mention VFP4, just VFP3H. I’ve also arranged them by required VFP version, including ‘common’ for instructions that are available to both VFP-only and NEON-only machines. I realise that some of this is going to be tricky to design. It is easy enough to throw at the user in one chunk, like HELP [ , but organising the data sensibly is not. Perhaps it would be worth making the output of “[VFP” group the instructions by VFP version? Or do we think just listing the version in the instruction details will be enough? As I have done so far, my preference it to label the instuctions with the version when they are not common. It might be possible to do [VFP3 * to list just the relevant instructions without much change to the code, just the files. MessageTrans is pretty flexible.

Nov 12, 2013 1:39pm Jeffrey Lee (213) 6048 posts	I thought I might look in the FPE source to see how it calculates the existential functions and maybe even find the constants. It appears that the source is embargoed – or am I missing something? A while ago ARM released the sources under a BSD license. I’m not sure where each instruction is handled, but all the relevant sources should be here Long-term of course we’d want this integrated with BASIC. You mean both BASICs. Yes. I suggested BASICTrans a year ago, but Sprow is in the process of removing that, which I vehemently oppose. Perhaps there’s a compromise somewhere, e.g. including BASICTrans in the disc image. I’m still in the middle of producing a NEON instruction listing (there’s a lot of them!), but here’s a VFP listing I produced – it looks like you were missing a few instructions. There is lot there; I thought there might be. It will bear some study. Everything in the QRC I found is in my module, but it may be out of date. Certainly it does not mention VFP4, just VFP3H. One of the instructions I think you were missing was VABS – or maybe I just wasn’t looking close enough? Either way, the list I produced is based around the assembler sources, so it should be an accurate representation of what the code can handle. This file contains all the instruction patterns that the VFP/NEON assembler accepts. I did consider just modifying it to allow it to produce a help text file, but unfortunately the way the instructions/encodings are listed isn’t quite in the right format to make a full automatic listing possible. Plus while cross-referencing things against the ARM ARM I’ve spotted a couple of bugs in the assembler which will need fixing, which could have been missed if we just generated the text straight from the assembler rules.

Nov 12, 2013 2:23pm Steve Drain (222) 1620 posts	You can tell I am not really familiar with CVS, but I have identified the file in the CVS tarball now. At a glance I am going to need to do a lot of work to get the methods. Something for later. VABS is there. I had a copy of the TBA file, but did not use it as a source. I will keep it in mind.

Nov 13, 2013 5:08pm Steve Drain (222) 1620 posts	I have an idea that might work to give both methods. That is now implemented. You can do, eg: [ VMOV , and all instances of that keyword in all topics are shown, or you can do [VFP VMOV to just show those in the VFP common topic. This also goes for wildcard searches, where the results are displayed by topic. VFP 3 and 4 topics are VF3 and VF4. Since this now seems like a runner, I have done a lot to tidy up the definitions and their layout, and made some small improvements to the code. I have cross-checked the instructions with Jeffrey’s and made some ammendments immediately, but there are few I want to confirm, either from the TBA source or with J. It is not vital to get all the data right now and I am awaiting the NEON ones. The latest version is at: http://kappa.me.uk/Modules/swBasicHelp034.zip

Nov 13, 2013 10:49pm Jeffrey Lee (213) 6048 posts	NEON instructions below. Only a couple of them allow a condition code to be used, in case you were thinking that’s a mistake! NEON ---- VABA.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VABA.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Qreg> Calculate absolute difference of NEON vectors of signed/unsigned 8/16/32 bit integers and accumulate VABAL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Calculate absolute difference of NEON vectors of signed/unsigned 8/16/32 bit integers and accumulate into double width output VABD.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VABD.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Qreg> Calculate absolute difference of NEON vectors of signed/unsigned 8/16/32 bit integers VABDL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Calculate absolute difference of NEON vectors of signed/unsigned 8/16/32 bit integers into double width output VABD.F32 <Dreg>,<Dreg>,<Dreg> VABD.F32 <Qreg>,<Qreg>,<Qreg> Calculate absolute difference of NEON vectors of single precision floats VABS.<S8\|S16\|S32\|F32> <Dreg>,<Dreg> VABS.<S8\|S16\|S32\|F32> <Qreg>,<Qreg> Calculate absolute value of NEON vector of integers or single precision floats VADD.<I8\|I16\|I32\|I64\|F32> <Dreg>,<Dreg>,<Dreg> VADD.<I8\|I16\|I32\|I64\|F32> <Qreg>,<Qreg>,<Qreg> Add NEON vectors of integers or single precision floats V[R]ADDHN.I<16\|32\|64> <Dreg>,<Qreg>,<Qreg> Add NEON vectors of 16/32/64 bit integers and store high half of results in destination register, truncating (VADDHN) or rounding (VRADDHN) VADDL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Add NEON vectors of signed/unsigned 8/16/32 bit integers and zero or sign extend to double width destination register VADDW.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Dreg> Take NEON vector <Dreg> of signed/unsigned 8/16/32 bit integers and zero or sign extend to double width, add to double-width <Qreg> to produce double-width result VAC<GE\|GT\|LE\|LT>.F32 <Dreg>,<Dreg>,<Dreg> VAC<GE\|GT\|LE\|LT>.F32 <Qreg>,<Qreg>,<Qreg> Compare absolute values of the elements two floating point NEON vectors, setting elements of destination vector to all ones or all zeroes VBIC.<I8\|I16\|I32\|I64> <Dreg>,#<expr> VBIC.<I8\|I16\|I32\|I64> <Qreg>,#<expr> Bit clear of elements of a NEON vector against a constant. For I64, <expr> can be a string containing the desired 64bit value VBIC.F32 <Dreg>,#<expr> VBIC.F32 <Qreg>,#<expr> Bit clear of elements of a NEON vector against a floating point constant VBIC[.<dt>] <Dreg>,<Dreg>,<Dreg> VBIC[.<dt>] <Qreg>,<Qreg>,<Qreg> Bit clear of elements of a NEON vector against another vector. <dt> ignored. VBIF[.<dt>] <Dreg>,<Dreg>,<Dreg> VBIF[.<dt>] <Qreg>,<Qreg>,<Qreg> Bitwise insert if false. <dt> ignored. VBIT[.<dt>] <Dreg>,<Dreg>,<Dreg> VBIT[.<dt>] <Qreg>,<Qreg>,<Qreg> Bitwise insert if true. <dt> ignored. VBSL[.<dt>] <Dreg>,<Dreg>,<Dreg> VBSL[.<dt>] <Qreg>,<Qreg>,<Qreg> Bitwise select. <dt> ignored. VCEQ.<I8\|I16\|I32\|F32> <Dreg>,<Dreg>,[<Dreg>\|#0] VCEQ.<I8\|I16\|I32\|F32> <Qreg>,<Qreg>,[<Qreg>\|#0] VC<GE\|GT\|LE\|LT>.<S8\|S16\|S32\|U8\|U16\|U32\|F32> <Dreg>,<Dreg>,[<Dreg>\|#0] VC<GE\|GT\|LE\|LT>.<S8\|S16\|S32\|U8\|U16\|U32\|F32> <Qreg>,<Qreg>,[<Qreg>\|#0] Vector compare of signed/unsigned integer or float NEON vector against another vector or zero, setting destination elements to all ones or all zeroes VCLS.S<8\|16\|32> <Dreg>,<Dreg> VCLS.S<8\|16\|32> <Qreg>,<Qreg> Count leading sign bits of elements of integer NEON vector VCLZ.I<8\|16\|32> <Dreg>,<Dreg> VCLZ.I<8\|16\|32> <Qreg>,<Qreg> Count leading zero bits of elements of integer NEON vector VCNT.8 <Dreg>,<Dreg> VCNT.8 <Qreg>,<Qreg> Count one bits of elements of integer NEON vector VCVT.<S\|U>32.F32 <Dreg>,<Dreg>[,#<expr>] VCVT.<S\|U>32.F32 <Qreg>,<Qreg>[,#<expr>] Convert NEON floating point vector to signed/unsigned integer (with <expr> fractional bits) VCVT.F32.<S\|U>32 <Dreg>,<Dreg>[,#<expr>] VCVT.F32.<S\|U>32 <Qreg>,<Qreg>[,#<expr>] Convert NEON signed/unsigned integer vector (with <expr> fractional bits) to floating point VDPL.8\|16\|32 <Dreg>,<Dreg[x]>\|<reg> VDPL.8\|16\|32 <Qreg>,<Dreg[x]>\|<reg> Duplicate element of NEON register or low bits of ARM register into all elements of destination vector VEOR[.<dt>] <Dreg>,<Dreg>,<Dreg> VEOR[.<dt>] <Qreg>,<Qreg>,<Qreg> Exclusive or of NEON vectors. <dt> ignored. VEXT.8\|16\|32\|64 <Dreg>,<Dreg>,<Dreg>,#<expr> VEXT.8\|16\|32\|64 <Qreg>,<Qreg>,<Qreg>,#<expr> Extract elements from source registers (treating as one extra-large register) starting at element <expr> V[R]HADD.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> V[R]HADD.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Qreg> Add two signed/unsigned integer NEON vectors, halve the results (truncating or rounding) and assign to destination VHSUB.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VHSUB.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Qreg> Subtract two signed/unsigned integer NEON vectors, halve the results (truncating), and assign to destination VLD1.<8\|16\|32\|64> {<Dreg>[,<Dreg+1>[,<Dreg+2>[,<Dreg+3>]]]}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] Load 8, 16, 32, or 64 bit values into all elements of up to 4 D registers, with memory access optionally optimised for a guaranteed minimum 64, 128 or 256 bit alignment, and with optional writeback of base register by transfer size or register value VLD2.<8\|16\|32> {<Dreg>,<Dreg+1>[,<Dreg+2>,<Dreg+3>]}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] VLD2.<8\|16\|32> {<Dreg>,<Dreg+2>}, '[ <reg>[@<64\|128>] '] [!\|<reg>] Load and de-interleave 2x8, 2x16, or 2x32 bit values into all elements of 2 or 4 D registers, with memory access optionally optimised for a guaranteed minimum 64, 128 or 256 bit alignment, and with optional writeback of base register by transfer size or register value VLD3.<8\|16\|32> {<Dreg>,<Dreg+1>,<Dreg+2>}, '[ <reg>[@64] '] [!\|<reg>] VLD3.<8\|16\|32> {<Dreg>,<Dreg+2>,<Dreg+4>}, '[ <reg>[@64] '] [!\|<reg>] Load and de-interleave 3x8, 3x16, or 3x32 bit values into all elements of 3 D registers, with memory access optionally optimised for a guaranteed minimum 64 bit alignment, and with optional writeback of base register by transfer size or register value VLD4.<8\|16\|32> {<Dreg>,<Dreg+1>,<Dreg+2>,<Dreg+3>}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] VLD4.<8\|16\|32> {<Dreg>,<Dreg+2>,<Dreg+4>,<Dreg+6>}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] Load and de-interleave 4x8, 4x16, or 4x32 bit values into all elements of 4 D registers, with memory access optionally optimised for a guaranteed minimum 64, 128 or 256 bit alignment, and with optional writeback of base register by transfer size or register value VLD1.<8\|16\|32> {<Dreg[x]>} '[ <reg>[@<16\|32>] '] [!\|<reg>] Load an 8, 16 or 32 bit value into one element of a D register, with memory access optionally optimised for a guaranteed minimum of 16 or 32 bit alignment, and with optional writeback of base register by transfer size or register value VLD2.<8\|16\|32> {<Dreg[x]>,<Dreg+1[x]>} '[ <reg>[@<16\|32\|64>] '] [!\|<reg>] VLD2.<8\|16\|32> {<Dreg[x]>,<Dreg+2[x]>} '[ <reg>[@<16\|32\|64>] '] [!\|<reg>] Load 2x8, 2x16 or 2x32 bit values into one element of two D registers, with memory access optionally optimised for a guaranteed minimum of 16, 32 or 64 bit alignment, and with optional writeback of base register by transfer size or register value VLD3.<8\|16\|32> {<Dreg[x]>,<Dreg+1[x]>,<Dreg+2[x]>} '[ <reg> '] [!\|<reg>] VLD3.<8\|16\|32> {<Dreg[x]>,<Dreg+2[x]>,<Dreg+4[x]>} '[ <reg> '] [!\|<reg>] Load 3x8, 3x16 or 3x32 bit values into one element of three D registers, with optional writeback of base register by transfer size or register value VLD4.<8\|16\|32> {<Dreg[x]>,<Dreg+1[x]>,<Dreg+2[x]>,<Dreg+3[x]>} '[ <reg>[@<32\|64\|128>] '] [!\|<reg>] VLD4.<8\|16\|32> {<Dreg[x]>,<Dreg+2[x]>,<Dreg+4[x]>,<Dreg+6[x]>} '[ <reg>[@<32\|64\|128>] '] [!\|<reg>] Load 4x8, 4x16 or 4x32 bit values into one element of four D registers, with memory access optionally optimised for a guaranteed minimum of 32, 64 or 128 bit alignment, and with optional writeback of base register by transfer size or register value VLD1.<8\|16\|32> {<Dreg[]>[,<Dreg+1[]>]} '[ <reg>[@<16\|32>] '] [!\|<reg>] Load and replicate an 8, 16 or 32 bit value into all elements of up to two D registers, with memory access optionally optimised for a guaranteed minimum of 16 or 32 bit alignment, and with optional writeback of base register by transfer size or register value VLD2.<8\|16\|32> {<Dreg[]>,<Dreg+1[]>} '[ <reg>[@<16\|32\|64>] '] [!\|<reg>] VLD2.<8\|16\|32> {<Dreg[]>,<Dreg+2[]>} '[ <reg>[@<16\|32\|64>] '] [!\|<reg>] Load, de-interleave and replicate 2x8, 2x16 or 2x32 bit values into all elements of two D registers, with memory access optionally optimised for a guaranteed minimum of 16, 32 or 64 bit alignment, and with optional writeback of base register by transfer size or register value VLD3.<8\|16\|32> {<Dreg[]>,<Dreg+1[]>,<Dreg+2[]>} '[ <reg> '] [!\|<reg>] VLD3.<8\|16\|32> {<Dreg[]>,<Dreg+2[]>,<Dreg+3[]>} '[ <reg> '] [!\|<reg>] Load, de-interleave and replicate 3x8, 3x16 or 3x32 bit values into all elements of three D registers, with optional writeback of base register by transfer size or register value VLD4.<8\|16\|32> {<Dreg[]>,<Dreg+1[]>,<Dreg+2[]>,<Dreg+3[]>} '[ <reg>[@<32\|64\|128>] '] [!\|<reg>] VLD4.<8\|16\|32> {<Dreg[]>,<Dreg+2[]>,<Dreg+4[]>,<Dreg+6[]>} '[ <reg>[@<32\|64\|128>] '] [!\|<reg>] Load, de-interleave and replicate 4x8, 4x16 or 4x32 bit values into all elements of four D registers, with memory access optionally optimised for a guaranteed minimum of 32, 64 or 128 bit alignment, and with optional writeback of base register by transfer size or register value VMAX.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VMAX.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Qreg> Compute maximum values of two integer NEON vectors VMIN.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VMIN.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Qreg> Compute minimum values of two integer NEON vectors VMAX.F32 <Dreg>,<Dreg>,<Dreg> VMAX.F32 <Qreg>,<Qreg>,<Qreg> Compute maximum values of two floating point NEON vectors VMIN.F32 <Dreg>,<Dreg>,<Dreg> VMIN.F32 <Qreg>,<Qreg>,<Qreg> Compute minimum values of two floating point NEON vectors VMLA.F32 <Dreg>,<Dreg>,<Dreg> VMLA.F32 <Qreg>,<Qreg>,<Qreg> VMLA.I<8\|16\|32> <Dreg>,<Dreg>,<Dreg> VMLA.I<8\|16\|32> <Qreg>,<Qreg>,<Qreg> Multiply and accumulate two NEON vectors VMLS.F32 <Dreg>,<Dreg>,<Dreg> VMLS.F32 <Qreg>,<Qreg>,<Qreg> VMLS.I<8\|16\|32> <Dreg>,<Dreg>,<Dreg> VMLS.I<8\|16\|32> <Qreg>,<Qreg>,<Qreg> Multiply and subtract two NEON vectors VMLAL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Multiply and accumulate two integer NEON vectors into double-width result VMLSL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Multiply and subtract two integer NEON vectors into double-width result VMLA.F32 <Dreg>,<Dreg>,<Dreg[x]> VMLA.F32 <Qreg>,<Qreg>,<Dreg[x]> VMLA.I<8\|16\|32> <Dreg>,<Dreg>,<Dreg[x]> VMLA.I<8\|16\|32> <Qreg>,<Qreg>,<Dreg[x]> Multiply and accumulate NEON vector by scalar VMLS.F32 <Dreg>,<Dreg>,<Dreg[x]> VMLS.F32 <Qreg>,<Qreg>,<Dreg[x]> VMLS.I<8\|16\|32> <Dreg>,<Dreg>,<Dreg[x]> VMLS.I<8\|16\|32> <Qreg>,<Qreg>,<Dreg[x]> Multiply and subtract NEON vector by scalar VMLAL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg[x]> Multiply and accumulate two NEON vector by scalar into double-width result VMLSL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg[x]> Multiply and subtract two NEON vector by scalar into double-width result VMOV.I<8\|16\|32\|64> <Dreg>,#<expr> VMOV.I<8\|16\|32\|64> <Qreg>,#<expr> Set all elements of NEON vector to constant integer value VMOV.F32 <Dreg>,#<expr> VMOV.F32 <Qreg>,#<expr> Set all elements of NEON vector to constant floating point value VMOV[.<dt>] <Dreg>,<Dreg> VMOV[.<dt>] <Qreg>,<Qreg> Transfer NEON registers. <dt> ignored. VMOV[<cond>].<8\|16> <Dreg[x]>,<Rt> Transfer 8 or 16 bits from low bits of ARM register to NEON register VMOV[<cond>].<S\|U><8\|16> <Rt>,<Dreg[x]> Transfer 8 or 16 bit integer from NEON register to ARM register, sign or zero extending VMOVL.<S\|U><8\|16\|32> <Qreg>,<Dreg> Transfer elements of integer NEON vector to double-width destination, sign or zero-extending values VMOVN.I<16\|32\|64> <Dreg>,<Qreg> Transfer low bits of elements of integer NEON vector to half-width destination VMUL.F32 <Dreg>,<Dreg>,<Dreg> VMUL.F32 <Qreg>,<Qreg>,<Qreg> Multiply floating point NEON vectors VMUL.I<8\|16\|32> <Dreg>,<Dreg>,<Dreg> VMUL.I<8\|16\|32> <Qreg>,<Qreg>,<Qreg> Multiply integer NEON vectors VMUL.P8 <Dreg>,<Dreg>,<Dreg> VMUL.P8 <Qreg>,<Qreg>,<Qreg> Multiply polynomial NEON vectors VMULL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Multiply integer NEON vectors to double-width destination VMULL.P8 <Qreg>,<Dreg>,<Dreg> Multiply polynomial NEON vectors to double-width destination VMUL.F32 <Dreg>,<Dreg>,<Dreg[x]> VMUL.F32 <Qreg>,<Qreg>,<Dreg[x]> Multiply floating point NEON vector by scalar VMUL.I<16\|32> <Dreg>,<Dreg>,<Dreg[x]> VMUL.I<16\|32> <Qreg>,<Qreg>,<Dreg[x]> Multiply integer NEON vector by scalar VMULL.<S\|U><16\|32> <Qreg>,<Dreg>,<Dreg[x]> Multiply integer NEON vector by scalar to double-width destination VMVN.I<16\|32> <Dreg>,#<expr> VMVN.I<16\|32> <Qreg>,#<expr> Set all elements of NEON vector to bitwise inverse of constant integer value VMVN[.<dt>] <Dreg>,<Dreg> VMVN[.<dt>] <Qreg>,<Qreg> Bitwise inverse of NEON vector. <dt> ignored. VNEG.S<8\|16\|32> <Dreg>,<Dreg> VNEG.S<8\|16\|32> <Qreg>,<Qreg> Negate integer NEON vector VNEG.F32 <Dreg>,<Dreg> VNEG.F32 <Qreg>,<Qreg> Negate floating point NEON vector VORN.I<16\|32> <Dreg>,#<expr> VORN.I<16\|32> <Qreg>,#<expr> Bitwise OR NOT of NEON vector with constant integer value VORN[.<dt>] <Dreg>,<Dreg>,<Dreg> VORN[.<dt>] <Qreg>,<Qreg>,<Qreg> Bitwise OR NOT of NEON vector with vector. <dt> ignored. VORR.I<16\|32> <Dreg>,#<expr> VORR.I<16\|32> <Qreg>,#<expr> Bitwise OR of NEON vector with constant integer value VORR[.<dt>] <Dreg>,<Dreg>,<Dreg> VORR[.<dt>] <Qreg>,<Qreg>,<Qreg> Bitwise OR of NEON vector with vector. <dt> ignored. VPADAL.<S\|U><8\|16\|32> <Dreg>,<Dreg> VPADAL.<S\|U><8\|16\|32> <Qreg>,<Qreg> Pairwise addition and accumulation of integer NEON vector into double-width destination VPADD.I<8\|16\|32> <Dreg>,<Dreg>,<Dreg> VPADD.F32 <Dreg>,<Dreg>,<Dreg> Pairwise addition of two integer or floating point NEON vectors into destination VPADDL.<S\|U><8\|16\|32> <Dreg>,<Dreg> VPADDL.<S\|U><8\|16\|32> <Qreg>,<Qreg> Pairwise addition of integer NEON vector into double-width destination VPMAX.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VPMAX.F32 <Dreg>,<Dreg>,<Dreg> Pairwise maximum of two integer or floating point NEON vectors VPMIN.<S\|U><8\|16\|32> <Dreg>,<Dreg>,<Dreg> VPMIN.F32 <Dreg>,<Dreg>,<Dreg> Pairwise minimum of two integer or floating point NEON vectors VQABS.S<8\|16\|32> <Dreg>,<Dreg> VQABS.S<8\|16\|32> <Qreg>,<Qreg> Absolute value of integer NEON vector, with saturation VQADD.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,<Dreg> VQADD.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,<Qreg> Add integer NEON vectors, with saturation VQDMLAL.S<16\|32> <Qreg>,<Dreg>,<Dreg> VQDMLAL.S<16\|32> <Qreg>,<Dreg>,<Dreg[x]> Multiply and double integer NEON vector with vector or scalar, accumulating into double-width result, with saturation VQDMLSL.S<16\|32> <Qreg>,<Dreg>,<Dreg> VQDMLSL.S<16\|32> <Qreg>,<Dreg>,<Dreg[x]> Multiply and double integer NEON vector with vector or scalar, subtracting into double-width result, with saturation VQDMULH.S<16\|32> <Dreg>,<Dreg>,<Dreg> VQDMULH.S<16\|32> <Qreg>,<Qreg>,<Qreg> Multiply and double integer NEON vectors, storing high half of result, with saturation VQDMULH.S<16\|32> <Dreg>,<Dreg>,<Dreg[x]> VQDMULH.S<16\|32> <Qreg>,<Qreg>,<Dreg[x]> Multiply and double integer NEON vector with scalar, storing high half of result, with saturation VQDMULL.S<16\|32> <Dreg>,<Dreg>,<Dreg> VQDMULL.S<16\|32> <Qreg>,<Qreg>,<Qreg> Multiply and double integer NEON vectors, storing low half of result, with saturation VQDMULL.S<16\|32> <Dreg>,<Dreg>,<Dreg[x]> VQDMULL.S<16\|32> <Qreg>,<Qreg>,<Dreg[x]> Multiply and double integer NEON vector with scalar, storing low half of result, with saturation VQMOVN.<S\|U><16\|32\|64> <Dreg>,<Qreg> Transfer integer NEON vector to half-width destination, saturating VQMOVUN.S<16\|32\|64> <Dreg>,<Qreg> Transfer integer NEON vector to unsigned half-width destination, saturating VQNEG.S<8\|16\|32> <Dreg>,<Dreg> VQNEG.S<8\|16\|32> <Qreg>,<Qreg> Negate integer NEON vector, saturating VQRDMULH.S<16\|32> <Dreg>,<Dreg>,<Dreg> VQRDMULH.S<16\|32> <Qreg>,<Qreg>,<Qreg> Multiply and double integer NEON vectors, storing rounded high half of result, with saturation VQRDMULH.S<16\|32> <Dreg>,<Dreg>,<Dreg[x]> VQRDMULH.S<16\|32> <Qreg>,<Qreg>,<Dreg[x]> Multiply and double integer NEON vector with scalar, storing rounded high half of result, with saturation VQRSHL.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,<Dreg> VQRSHL.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,<Qreg> Saturating, rounding left or right shift of integer NEON vector by vector VQRSHRN.<S\|U><16\|16\|32\|64> <Dreg>,<Qreg>,#<expr> Saturating, rounding shift right of integer NEON vector by constant, with half-width destination VQRSHRUN.S<16\|16\|32\|64> <Dreg>,<Qreg>,#<expr> Saturating, rounding shift right of integer NEON vector by constant, with unsigned half-width destination VQSHL.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,<Dreg> VQSHL.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,<Qreg> Saturating left or right shift of integer NEON vector by vector VQSHL.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VQSHL.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Saturating left shift of integer NEON vector by constant VQSHLU.S<8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VQSHLU.S<8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Saturating left shift of integer NEON vector by constant, with unsigned destination VQSHRN.<S\|U><16\|16\|32\|64> <Dreg>,<Qreg>,#<expr> Saturating shift right of integer NEON vector by constant, with half-width destination VQSHRUN.S<16\|16\|32\|64> <Dreg>,<Qreg>,#<expr> Saturating shift right of integer NEON vector by constant, with unsigned half-width destination VQSUB.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,<Dreg> VQSUB.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,<Qreg> Subtract integer NEON vectors, with saturation V[R]SUBHN.I<16\|32\|64> <Dreg>,<Qreg>,<Qreg> Subtract integer NEON vectors and store high half of results in destination register, truncating (VSUBHN) or rounding (VRSUBHN) VRECPE.<U32\|F32> <Dreg>,<Dreg> VRECPE.<U32\|F32> <Qreg>,<Qreg> Estimate reciprocal of NEON vector VRECPS.F32 <Dreg>,<Dreg>,<Dreg> VRECPS.F32 <Qreg>,<Qreg>,<Qreg> Step reciprocal of NEON vector to increase accuracy VREV<16\|32\|64>.<8\|16\|32> <Dreg>,<Dreg> VREV<16\|32\|64>.<8\|16\|32> <Qreg>,<Qreg> Reverse order of 8, 16 or 32 bit elements within 16, 32 or 64 bit structures VRSHL.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,<Dreg> VRSHL.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,<Qreg> Rounding shift left or right of integer NEON vector by vector VRSHR.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VRSHR.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Rounding shift right of integer NEON vector by constant VRSHRN.I<16\|32\|64> <Dreg>,<Qreg>,#<expr> Roudning shift right of integer NEON vector by constant, with half-width result VRSQRTE.<U32\|F32> <Dreg>,<Dreg> VRSQRTE.<U32\|F32> <Qreg>,<Qreg> Estimate reciprocal square root of NEON vector VRSQRTS.F32 <Dreg>,<Dreg>,<Dreg> VRSQRTS.F32 <Qreg>,<Qreg>,<Qreg> Step reciprocal square root of NEON vector to increase accuracy VRSRA.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VRSRA.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Rounding shift right and accumulate of integer NEON vector by constant VSHL.I<8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VSHL.I<8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Shift left of integer NEON vector by constant VSHL.I<8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VSHL.I<8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Shift left of integer NEON vector by constant VSHL.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,<Dreg> VSHL.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,<Qreg> Shift left or right of integer NEON vector by vector VSHLL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,#<expr> Shift left of integer NEON vector by constant, to double-width destination VSHLL.I<8\|16\|32> <Qreg>,<Dreg>,#<8\|16\|32> Shift left of integer NEON vector by constant, to double-width destination (source width equals shift amount) VSHR.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VSHR.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Shift right of integer NEON vector by constant VSHRN.I<16\|32\|64> <Dreg>,<Qreg>,#<expr> Shift right of integer NEON vector by constant, to half-width destination VSLI.<8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VSLI.<8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Shift elements of NEON vector left by constant, and insert in destination VSHRA.<S\|U><8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VSHRA.<S\|U><8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Shift right and accumulate of integer NEON vector by constant VSRI.<8\|16\|32\|64> <Dreg>,<Dreg>,#<expr> VSRI.<8\|16\|32\|64> <Qreg>,<Qreg>,#<expr> Shift elements of NEON vector right by constant, and insert in destination VST1.<8\|16\|32\|64> {<Dreg>[,<Dreg+1>[,<Dreg+2>[,<Dreg+3>]]]}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] Store 8, 16, 32, or 64 bit values from all elements of up to 4 D registers, with memory access optionally optimised for a guaranteed minimum 64, 128 or 256 bit alignment, and with optional writeback of base register by transfer size or register value VST2.<8\|16\|32> {<Dreg>,<Dreg+1>[,<Dreg+2>,<Dreg+3>]}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] VST2.<8\|16\|32> {<Dreg>,<Dreg+2>}, '[ <reg>[@<64\|128>] '] [!\|<reg>] Interleave and store 2x8, 2x16, or 2x32 bit values from all elements of 2 or 4 D registers, with memory access optionally optimised for a guaranteed minimum 64, 128 or 256 bit alignment, and with optional writeback of base register by transfer size or register value VST3.<8\|16\|32> {<Dreg>,<Dreg+1>,<Dreg+2>}, '[ <reg>[@64] '] [!\|<reg>] VST3.<8\|16\|32> {<Dreg>,<Dreg+2>,<Dreg+4>}, '[ <reg>[@64] '] [!\|<reg>] Interleave and store 3x8, 3x16, or 3x32 bit values from all elements of 3 D registers, with memory access optionally optimised for a guaranteed minimum 64 bit alignment, and with optional writeback of base register by transfer size or register value VST4.<8\|16\|32> {<Dreg>,<Dreg+1>,<Dreg+2>,<Dreg+3>}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] VST4.<8\|16\|32> {<Dreg>,<Dreg+2>,<Dreg+4>,<Dreg+6>}, '[ <reg>[@<64\|128\|256>] '] [!\|<reg>] Interleave and store 4x8, 4x16, or 4x32 bit values from all elements of 4 D registers, with memory access optionally optimised for a guaranteed minimum 64, 128 or 256 bit alignment, and with optional writeback of base register by transfer size or register value VST1.<8\|16\|32> {<Dreg[x]>} '[ <reg>[@<16\|32>] '] [!\|<reg>] Store an 8, 16 or 32 bit value from one element of a D register, with memory access optionally optimised for a guaranteed minimum of 16 or 32 bit alignment, and with optional writeback of base register by transfer size or register value VST2.<8\|16\|32> {<Dreg[x]>,<Dreg+1[x]>} '[ <reg>[@<16\|32\|64>] '] [!\|<reg>] VST2.<8\|16\|32> {<Dreg[x]>,<Dreg+2[x]>} '[ <reg>[@<16\|32\|64>] '] [!\|<reg>] Store 2x8, 2x16 or 2x32 bit values from one element of two D registers, with memory access optionally optimised for a guaranteed minimum of 16, 32 or 64 bit alignment, and with optional writeback of base register by transfer size or register value VST3.<8\|16\|32> {<Dreg[x]>,<Dreg+1[x]>,<Dreg+2[x]>} '[ <reg> '] [!\|<reg>] VST3.<8\|16\|32> {<Dreg[x]>,<Dreg+2[x]>,<Dreg+4[x]>} '[ <reg> '] [!\|<reg>] Store 3x8, 3x16 or 3x32 bit values from one element of three D registers, with optional writeback of base register by transfer size or register value VST4.<8\|16\|32> {<Dreg[x]>,<Dreg+1[x]>,<Dreg+2[x]>,<Dreg+3[x]>} '[ <reg>[@<32\|64\|128>] '] [!\|<reg>] VST4.<8\|16\|32> {<Dreg[x]>,<Dreg+2[x]>,<Dreg+4[x]>,<Dreg+6[x]>} '[ <reg>[@<32\|64\|128>] '] [!\|<reg>] Store 4x8, 4x16 or 4x32 bit values from one element of four D registers, with memory access optionally optimised for a guaranteed minimum of 32, 64 or 128 bit alignment, and with optional writeback of base register by transfer size or register value VSUB.<I8\|I16\|I32\|I64\|F32> <Dreg>,<Dreg>,<Dreg> VSUB.<I8\|I16\|I32\|I64\|F32> <Qreg>,<Qreg>,<Qreg> Subtract NEON vectors of integers or single precision floats VSUBL.<S\|U><8\|16\|32> <Qreg>,<Dreg>,<Dreg> Subtract NEON vectors of signed/unsigned 8/16/32 bit integers and zero or sign extend to double width destination register VSUBW.<S\|U><8\|16\|32> <Qreg>,<Qreg>,<Dreg> Take NEON vector <Dreg> of signed/unsigned 8/16/32 bit integers and zero or sign extend to double width, subtract from double-width <Qreg> to produce double-width result VSWP[.<dt>] <Dreg>,<Dreg> VSWP[.<dt>] <Qreg>,<Qreg> Swap contents of two NEON registers. <dt> ignored. VTBL.8 <Dreg>,{<Dreg>[,<Dreg+1>[,<Dreg+2>[,<Dreg+3>]]]},<Dreg> Use up to four sequential D registers as a lookup table to translate NEON vector of 8 bit values, with out of range values mapped to zero VTBX.8 <Dreg>,{<Dreg>[,<Dreg+1>[,<Dreg+2>[,<Dreg+3>]]]},<Dreg> Use up to four sequential D registers as a lookup table to translate NEON vector of 8 bit values, with out of range values left unchanged VTRN.<8\|16\|32> <Dreg>,<Dreg> VTRN.<8\|16\|32> <Qreg>,<Qreg> Treat two registers as a series of 2x2 matrices containing 8, 16 or 32 bit values and transpose them VTST.<8\|16\|32> <Dreg>,<Dreg>,<Dreg> VTST.<8\|16\|32> <Qreg>,<Qreg>,<Qreg> Bitwise AND two NEON vectors and set destination elements to all ones or all zeroes depending on if result was non-zero or not. VUZP.<8\|16\|32> <Dreg>,<Dreg> VUZP.<8\|16\|32> <Qreg>,<Qreg> Treat two NEON vectors as one data stream and de-interleave their elements VZIP.<8\|16\|32> <Dreg>,<Dreg> VZIP.<8\|16\|32> <Qreg>,<Qreg> Interleave the elements of two NEON vectors to produce one data stream NEONv2 ------ VFMA.F32 <Dreg>,<Dreg>,<Dreg> VFMA.F32 <Qreg>,<Qreg>,<Qreg> Fused multiply and accumulate of floating point NEON vector VFMS.F32 <Dreg>,<Dreg>,<Dreg> VFMS.F32 <Qreg>,<Qreg>,<Qreg> Fused multiply and subtraction of floating point NEON vector I’ve given the new version of the module a quick go and it’s looking good. However it would be great if it could sort the instructions alphabetically when you display the summary list (e.g. for ‘*[VFP’). That will reduce the chances of me foolishly claiming that VABS is missing ;) If you have any questions about instructions, syntax, etc. then feel free to fire away.

Nov 14, 2013 9:58am Steve Drain (222) 1620 posts	Wow! That is a lot of data. I may just take that file, do a minimum of formatting for MessageTrans and leave it at that, relying on your authority for its accuracy. Similarly for the VFP. it would be great if it could sort the instructions alphabetically when you display the summary list When I wrote the first version I exploited MessageTrans multi-token lookup to keep the data as small as possible, rather neatly I thought. That made alphabetic sorting impossible. The mass of the VPS and NEON data cannot be done this this way, so the saving is now a small fraction of the total. It will be easy to reformat the data to single tokens in alphabetical order, but I will regret the more sledgehammer approach. ;-)

Nov 15, 2013 5:50pm Steve Drain (222) 1620 posts	I have done as I said, and taken your files, reformatting them to suit MessageTrans, but otherwise not changing the text. I have not yet made an attempt the analyse what new parameters should go in the PAR topic. The list is also alphabetical now. ;-) I have also made some changes in the code to improve the way tokens are handled and provide sub-headings for the topics so that you can see where help applies when a keyword is found in more than one topic. This version is at: http://kappa.me.uk/Modules/swBasicHelp037.zip I have one question. If you request help from VFP 4, then help is also displayed from VFP common and 2 and 3. What should be displayed of the VFP topics for NEON? I have just VFP common for now.

Pages: 1 2 3 4 5 6

Reply

To post replies, please first log in.

Forums → Code review →

VFP advice/tutorial

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options