RISC OS Open: Forum: VFP advice/tutorial

Nov 18, 2013 1:38pm

Jeffrey Lee (213) 6048 posts

I have one question. If you request help from VFP 4, then help is also displayed from VFP common and 2 and 3. What should be displayed of the VFP topics for NEON? I have just VFP common for now.

Yes, VFP common is correct.

I do have one small correction to the NEON instruction list that I produced. VDPL needs to be two entries, because one can accept a condition code and the other can’t:

VDPL.8|16|32 <Dreg>,<Dreg[x]>
VDPL.8|16|32 <Qreg>,<Dreg[x]>

Duplicate element of NEON register into all elements of destination vector

VDPL[<cond>].8|16|32 <Dreg>,<reg>
VDPL[<cond>].8|16|32 <Qreg>,<reg>

Duplicate element low bits of ARM register into all elements of destination vector

Nov 18, 2013 2:44pm

Steve Drain (222) 1620 posts

Thanks. I have made those corrections. If all are agreed that this provides the help required and the command names and format are suitable I will register the module and formally release it.

Nov 22, 2013 3:29pm

Steve Drain (222) 1620 posts

I need some help with VFP double precision. I think all the examples I have seen use single precision. What I have done with single so far seems to work, but if I just change to doing the same with double I hit problems.

A small example with a Raspberry Pi and RO5.19, using the BASIC assembler:


STMFD r13!,{r14}
VLDR s0,n0
VLDR s1,n0
VADD.F32 s0,s0,s1
VSTR s0,result
STMFD r13!,{r14}
.n0:EQUFS 100
.n1:EQUFS 200
.result:EQUFS 0

This works fine, but:


STMFD r13!,{r14}
VLDR d0,n0
VLDR d1,n0
VADD.F64 d0,d0,d1;*
VSTR d0,result
LDMFD r13!,{pc};+
.n0:EQUFD 100
.n1:EQUFD 200
.result:EQUFD 0

This produces ‘Undefined instruction’ at the line marked +. If I remove the line marked * it works.

I have checked the disassembly in StrongED and the assembled code appears to be just what I would expect. What am I missing?

Nov 22, 2013 7:45pm

Kuemmel (439) 384 posts

Hm…just used the instruction in some code on my PandaBoard (though a quite old ROM from a year ago or so) and it works correctly…is EQUFD correct !? Wasn’t there something like DCFD.vfp ? I remember the thread where I complained about DCFD not working.

Nov 22, 2013 9:04pm

Martin Bazley (331) 379 posts

I have checked the disassembly in StrongED and the assembled code appears to be just what I would expect. What am I missing?

If it works on the PandaBoard but not on the Pi, that strongly suggests to me that the Pi doesn’t support double-precision.

Remember: the Pi is ARMv6, the dev boards are ARMv7. ARMv7 can do quite a few floating point-type things that ARMv6 can’t. While I know next to nothing about the subject myself, it’s always worth bearing in mind the possibility that you’ve run into one of the very few RISC OS-relevant differences between the two architectures.

Nov 22, 2013 9:29pm

Kuemmel (439) 384 posts

Hm, googleing suggests the Pi supports Double Precision, e.g.

http://mindplusplus.wordpress.com/2013/06/25/arm-vfp-vector-programming-part-1-introduction/
https://github.com/PeterLemon/RaspberryPi/blob/master/VFP/Fractal/Mandelbrot/kernel.asm

On the other hand it could still be a special VFP-activiation issue on the Pi ? Jeffrey ?

Nov 23, 2013 1:05pm

Steve Drain (222) 1620 posts

is EQUFD correct?

I assumed that it was. It is there in the FPA directives to reserve an IEEE double float, which should be the same as VFP requires.

Nov 23, 2013 3:34pm

Jeffrey Lee (213) 6048 posts

The Pi does support double precision floats, but when they’re saved to memory the two words are saved in a different order compared to FPA. FPA saves them in big-endian word order, VFP saves them according to the CPU endianness (i.e. little-endian for all RISC OS machines). But since EQUFD was created back in the days when only FPA was available, it will by default be using the wrong word order for VFP.

To work around this I added support for .vfp and .fpa suffixes on EQUFD & DCFD. So if you use EQUFD.vfp then BASIC will initialise the double using little-endian word order, ready for VFP.

The reason you’re getting an undefined instruction abort is probably because the word-swapped values being loaded were either being interpreted as NaNs or denormalised numbers – which the VFP unit in the Pi can’t handle properly and so raises the exception so that support code on the CPU can handle it (except we haven’t implemented any support code for that yet). The exceptions can also be avoided by using “RunFast” mode (enable flush-to-zero and default NaN modes, by setting bits 24 & 25 of the FPSCR).

Nov 23, 2013 4:17pm

Steve Drain (222) 1620 posts

Ah! All is revealed. It momentarily left a problem with VSTR, which also stores the words in the wrong order to read in BASIC VI. I solved that by decomposing the double into two singles:


VSTR s0,Result+4;d0[0]
VSTR s1,Result  ;d0[1]

There may well be a better way to do this. ;-)

Nov 28, 2013 10:37pm

Kuemmel (439) 384 posts

Hi Jeffrey,

I might have come up another error in the Basic Assembler (though it might be fixed, as I use a quite old version…). Whereas

VMLA.F32 Q2,Q4,D14[0]

works without any problem as it should,

VMLA.F32 Q2,Q4,D22[0]

assembles, but produces wrong computation. When I looked in the ARM-ARM, it says that only D0-D15 are allowed in that circumstances, so I suspect that the assembler doesn’t detect that mistake and assembles to some wrong code. Might be a problem also for similar instructions.

Nov 29, 2013 10:13am

Steve Drain (222) 1620 posts

If you have any questions about instructions, syntax, etc. then feel free to fire away.

Oops! In addition to what I reported elsewhere:

VNEG requires a suffix: VNEG.F32 or VNEG.F64

I have found that:

VSTM and VLDM do not require a suffix

Nov 29, 2013 10:52am

Jeffrey Lee (213) 6048 posts

I might have come up another error in the Basic Assembler (though it might be fixed, as I use a quite old version…). Whereas

VMLA.F32 Q2,Q4,D14[0]

works without any problem as it should,

VMLA.F32 Q2,Q4,D22[0]

assembles, but produces wrong computation

OK, I’ll check to see if this is still a problem with the latest version.

VNEG requires a suffix: VNEG.F32 or VNEG.F64

Yeah, looks like that was a mistake in the list I made. The assembler sources clearly show that the suffix is required.

VSTM and VLDM do not require a suffix

Do not require or do not support? The sources suggest that the suffix is optional, and that’s what I had in the instruction list I made:

VLDM<IA|DB>[<cond>][.32] <reg>[!],{<sregs>}
VLDM<IA|DB>[<cond>][.64] <reg>[!],{<dregs>}

Nov 29, 2013 5:34pm

Steve Drain (222) 1620 posts

In the UAL QRC that I have, a suffix is not shown for any of the store and load instructions.

I have so far just found out that the BASIC assembler does not seem to support a suffix. There may also be some problems with the stack type suffices, but I have not had time to refine that yet, and I will go to the source to check some of these before I comment further.

Nov 29, 2013 5:41pm

Jeffrey Lee (213) 6048 posts

In the UAL QRC that I have, a suffix is not shown for any of the store and load instructions.

The ARM ARM shows them as being optional. They don’t really add anything to the instruction since you can only specify a size that’s identical to the register size.

Nov 30, 2013 2:34pm

Jeffrey Lee (213) 6048 posts

I’ve now submitted a fix for the VMLA issue. It was a mistake in the definitions of a few of the instructions – VMLA[L] and VMLS[L] were wrong but the others were all fine.

Dec 13, 2013 2:34pm

Steve Drain (222) 1620 posts

The reason you’re getting an undefined instruction abort is probably because the word-swapped values being loaded were either being interpreted as NaNs or denormalised numbers – which the VFP unit in the Pi can’t handle properly and so raises the exception so that support code on the CPU can handle it (except we haven’t implemented any support code for that yet).

I implemented your ‘RunFast’ mode and I do avoid various aborts. However, I am getting to the stage where I would like to detect and take action on exceptions. Is there a prospect of that support code becoming available, or is there something I do myself to use them?

Dec 13, 2013 2:57pm

Steve Drain (222) 1620 posts

I’ll check that Chebyshev thingy out. Found a nice page on it that generates a C-Code from your input → http://metamerist.com/cheby/example38.htm From there should be easy to implement in VFP.

I have used the code from that site and I also found another that purported to do something similar. Unfortunately, neither can provide sufficient precision to return true double floats to IEEE standard.

However, I did find a couple of papers from ‘The Ganssle Group’ that were very helpful. These have allowed me to implement the basic trig functions, but do not have enough information for the others. The precision is even a little better than the FPE but the speed is only about 30% quicker for the most complex, TAN. Simple VFP is about 50 time faster.

In several places I have found mention of “Computer Approximations” by John Hart et al, but it is long out of print and quite expensive second hand. This appears to be the ‘bible’ for the algorithms I need. I will continue my search. ;-)

Dec 15, 2013 3:17pm

Jeffrey Lee (213) 6048 posts

I implemented your ‘RunFast’ mode and I do avoid various aborts. However, I am getting to the stage where I would like to detect and take action on exceptions. Is there a prospect of that support code becoming available, or is there something I do myself to use them?

I will be implementing the support code at some point (unless someone beats me to it!), but I don’t have any timescales for when. It’s something which I’d consider a requirement before the Pi port can be considered suitable for a stable release, so if the plan is for an official 5.22 release on the Pi then I’ll presumably try and find the time to get it done before then!

Dec 18, 2013 4:54pm

Steve Drain (222) 1620 posts

“Computer Approximations” by John Hart et al

I have now acquired a copy of this book. It is certainly terse, but everything you could require is there, so I can see why it is so highly regarded. Putting it into practice is another mind-stretching operation. ;-)

I have realised that my timings were poorly implemented. I was not allowing for the time taken by the VFP context SWIs, which are quite expensive for small sections of code. I can now say that the VFP polynomial TAN code is over 20 times as fast as FPE, which is much more reasonable.

Dec 19, 2013 6:01pm

Steve Drain (222) 1620 posts

The key bit in that text is “supported on some systems”. At the moment I haven’t actually implemented support for that flag, so you’ll be able to use the contexts from user mode regardless of the setting. Depending on how lazy I’m feeling that might change in the future :)

This goes back a bit and refers to the User mode flag. I have just discovered where it matters. ;-)

I have been implementing my code as SWIs, with a context created by the module and changed to by each SWI. It worked just fine. Because of the delays introduced by the context change I decided to leave context control to the calling program, in this case BASIC.

I could check that the program context was active in the SWIs with VFPSupport_ActiveContext, but I was getting an ‘Undefined instruction’ error at the first VFP instruction, indicating that VFP was not actually correct. This led to a lot of head-banging until I eventually set the User mode flag. Now all is sweetness and light again, but I am not really sure what I have effected here.

I am using a Pi with v5.19.

Dec 19, 2013 6:28pm

Steve Drain (222) 1620 posts

There is something I am missing in the BASIC VFP assembler. I need a VLDRL pseudo-instruction to get at all the constants involved in these polynomial approximations. I cannot use an adaptation of the LDRL macros, because I have not spotted how you can pass a VFP register variable to a function; arm registers a just an integer. I have cobbled something together, but VLDRL would be rather neater.

Dec 19, 2013 7:24pm

Kuemmel (439) 384 posts

…why not use VLDMIA or that multiple purpose VLDn ? what’s your data structure like ?

Dec 19, 2013 11:12pm

Kuemmel (439) 384 posts

@Jeffrey: I might have found another error, not completely sure about the ARM ARM wording though.

Can it be the case that the current BASIC assembler doesn’t allow stuff like VSTMIA R0!,{S0,S2} or VSTMIA R0!,{D0,D2}. But those should be allowed similar to STMIA/LDMIA … {R0,R2} ? Only consecutive registers e.g. {S0,S1} or {D0,D1} or {Dx-Dy} seem to work, where the others give syntax errors.

Dec 19, 2013 11:50pm

Jeffrey Lee (213) 6048 posts

Only consecutive registers are allowed – it’s a limitation of the instruction set.

Dec 20, 2013 10:01am

Steve Drain (222) 1620 posts

@Kuemmel

VLDn is for NEON only; I am using VFP double precision. The constants I am concerned with are not used consecutively, so VLDMIA is not appropriate, at the moment.

In my routines the polynomial coefficients are special and stored locally, so there is no problem there. I am concerned with the number of multiples of PI etc, only a handful of which are used in any one routine. I also have 0, 1 and 10 as constants, because the RPi does not support VFP 3 and immediate constants.

The parsimony of years of ARM assembler tells me to store these constants once only, but that means addresses can be outside the immediate range – hence the desire for VLDRL. I am now thinking that this is not really necessary, so I will probably store copies of such constants along with each routine instead.

Here is a general question about VLDM. I believe that LDM is not always quicker than seperate LDRs, and that some assemblers unroll LDM. I have also read some comments about the VFP vector operations not being particularly quick. Does this affect VLDM and is there any substantial advantage in using it over seperate VLDRs?

VFP advice/tutorial

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Nov 18, 2013 1:38pm Jeffrey Lee (213) 6048 posts	I have one question. If you request help from VFP 4, then help is also displayed from VFP common and 2 and 3. What should be displayed of the VFP topics for NEON? I have just VFP common for now. Yes, VFP common is correct. I do have one small correction to the NEON instruction list that I produced. VDPL needs to be two entries, because one can accept a condition code and the other can’t: VDPL.8\|16\|32 <Dreg>,<Dreg[x]> VDPL.8\|16\|32 <Qreg>,<Dreg[x]> Duplicate element of NEON register into all elements of destination vector VDPL[<cond>].8\|16\|32 <Dreg>,<reg> VDPL[<cond>].8\|16\|32 <Qreg>,<reg> Duplicate element low bits of ARM register into all elements of destination vector

Nov 18, 2013 2:44pm Steve Drain (222) 1620 posts	Thanks. I have made those corrections. If all are agreed that this provides the help required and the command names and format are suitable I will register the module and formally release it.

Nov 22, 2013 3:29pm Steve Drain (222) 1620 posts	I need some help with VFP double precision. I think all the examples I have seen use single precision. What I have done with single so far seems to work, but if I just change to doing the same with double I hit problems. A small example with a Raspberry Pi and RO5.19, using the BASIC assembler: `STMFD r13!,{r14} VLDR s0,n0 VLDR s1,n0 VADD.F32 s0,s0,s1 VSTR s0,result STMFD r13!,{r14} .n0:EQUFS 100 .n1:EQUFS 200 .result:EQUFS 0` This works fine, but: `STMFD r13!,{r14} VLDR d0,n0 VLDR d1,n0 VADD.F64 d0,d0,d1;* VSTR d0,result LDMFD r13!,{pc};+ .n0:EQUFD 100 .n1:EQUFD 200 .result:EQUFD 0` This produces ‘Undefined instruction’ at the line marked +. If I remove the line marked * it works. I have checked the disassembly in StrongED and the assembled code appears to be just what I would expect. What am I missing?

Nov 22, 2013 7:45pm Kuemmel (439) 384 posts	Hm…just used the instruction in some code on my PandaBoard (though a quite old ROM from a year ago or so) and it works correctly…is EQUFD correct !? Wasn’t there something like DCFD.vfp ? I remember the thread where I complained about DCFD not working.

Nov 22, 2013 9:04pm Martin Bazley (331) 379 posts	I have checked the disassembly in StrongED and the assembled code appears to be just what I would expect. What am I missing? If it works on the PandaBoard but not on the Pi, that strongly suggests to me that the Pi doesn’t support double-precision. Remember: the Pi is ARMv6, the dev boards are ARMv7. ARMv7 can do quite a few floating point-type things that ARMv6 can’t. While I know next to nothing about the subject myself, it’s always worth bearing in mind the possibility that you’ve run into one of the very few RISC OS-relevant differences between the two architectures.

Nov 22, 2013 9:29pm Kuemmel (439) 384 posts	Hm, googleing suggests the Pi supports Double Precision, e.g. http://mindplusplus.wordpress.com/2013/06/25/arm-vfp-vector-programming-part-1-introduction/ https://github.com/PeterLemon/RaspberryPi/blob/master/VFP/Fractal/Mandelbrot/kernel.asm On the other hand it could still be a special VFP-activiation issue on the Pi ? Jeffrey ?

Nov 23, 2013 1:05pm Steve Drain (222) 1620 posts	is EQUFD correct? I assumed that it was. It is there in the FPA directives to reserve an IEEE double float, which should be the same as VFP requires.

Nov 23, 2013 3:34pm Jeffrey Lee (213) 6048 posts	The Pi does support double precision floats, but when they’re saved to memory the two words are saved in a different order compared to FPA. FPA saves them in big-endian word order, VFP saves them according to the CPU endianness (i.e. little-endian for all RISC OS machines). But since EQUFD was created back in the days when only FPA was available, it will by default be using the wrong word order for VFP. To work around this I added support for .vfp and .fpa suffixes on EQUFD & DCFD. So if you use EQUFD.vfp then BASIC will initialise the double using little-endian word order, ready for VFP. The reason you’re getting an undefined instruction abort is probably because the word-swapped values being loaded were either being interpreted as NaNs or denormalised numbers – which the VFP unit in the Pi can’t handle properly and so raises the exception so that support code on the CPU can handle it (except we haven’t implemented any support code for that yet). The exceptions can also be avoided by using “RunFast” mode (enable flush-to-zero and default NaN modes, by setting bits 24 & 25 of the FPSCR).

Nov 23, 2013 4:17pm Steve Drain (222) 1620 posts	Ah! All is revealed. It momentarily left a problem with VSTR, which also stores the words in the wrong order to read in BASIC VI. I solved that by decomposing the double into two singles: `VSTR s0,Result+4;d0[0] VSTR s1,Result ;d0[1]` There may well be a better way to do this. ;-)

Nov 28, 2013 10:37pm Kuemmel (439) 384 posts	Hi Jeffrey, I might have come up another error in the Basic Assembler (though it might be fixed, as I use a quite old version…). Whereas `VMLA.F32 Q2,Q4,D14[0]` works without any problem as it should, `VMLA.F32 Q2,Q4,D22[0]` assembles, but produces wrong computation. When I looked in the ARM-ARM, it says that only D0-D15 are allowed in that circumstances, so I suspect that the assembler doesn’t detect that mistake and assembles to some wrong code. Might be a problem also for similar instructions.

Nov 29, 2013 10:13am Steve Drain (222) 1620 posts	If you have any questions about instructions, syntax, etc. then feel free to fire away. Oops! In addition to what I reported elsewhere: `VNEG requires a suffix: VNEG.F32 or VNEG.F64` I have found that: `VSTM and VLDM do not require a suffix`

Nov 29, 2013 10:52am Jeffrey Lee (213) 6048 posts	I might have come up another error in the Basic Assembler (though it might be fixed, as I use a quite old version…). Whereas `VMLA.F32 Q2,Q4,D14[0]` works without any problem as it should, `VMLA.F32 Q2,Q4,D22[0]` assembles, but produces wrong computation OK, I’ll check to see if this is still a problem with the latest version. VNEG requires a suffix: VNEG.F32 or VNEG.F64 Yeah, looks like that was a mistake in the list I made. The assembler sources clearly show that the suffix is required. VSTM and VLDM do not require a suffix Do not require or do not support? The sources suggest that the suffix is optional, and that’s what I had in the instruction list I made: VLDM<IA\|DB>[<cond>][.32] <reg>[!],{<sregs>} VLDM<IA\|DB>[<cond>][.64] <reg>[!],{<dregs>}

Nov 29, 2013 5:34pm Steve Drain (222) 1620 posts	In the UAL QRC that I have, a suffix is not shown for any of the store and load instructions. I have so far just found out that the BASIC assembler does not seem to support a suffix. There may also be some problems with the stack type suffices, but I have not had time to refine that yet, and I will go to the source to check some of these before I comment further.

Nov 29, 2013 5:41pm Jeffrey Lee (213) 6048 posts	In the UAL QRC that I have, a suffix is not shown for any of the store and load instructions. The ARM ARM shows them as being optional. They don’t really add anything to the instruction since you can only specify a size that’s identical to the register size.

Nov 30, 2013 2:34pm Jeffrey Lee (213) 6048 posts	I’ve now submitted a fix for the VMLA issue. It was a mistake in the definitions of a few of the instructions – VMLA[L] and VMLS[L] were wrong but the others were all fine.

Dec 13, 2013 2:34pm Steve Drain (222) 1620 posts	The reason you’re getting an undefined instruction abort is probably because the word-swapped values being loaded were either being interpreted as NaNs or denormalised numbers – which the VFP unit in the Pi can’t handle properly and so raises the exception so that support code on the CPU can handle it (except we haven’t implemented any support code for that yet). I implemented your ‘RunFast’ mode and I do avoid various aborts. However, I am getting to the stage where I would like to detect and take action on exceptions. Is there a prospect of that support code becoming available, or is there something I do myself to use them?

Dec 13, 2013 2:57pm Steve Drain (222) 1620 posts	I’ll check that Chebyshev thingy out. Found a nice page on it that generates a C-Code from your input → http://metamerist.com/cheby/example38.htm From there should be easy to implement in VFP. I have used the code from that site and I also found another that purported to do something similar. Unfortunately, neither can provide sufficient precision to return true double floats to IEEE standard. However, I did find a couple of papers from ‘The Ganssle Group’ that were very helpful. These have allowed me to implement the basic trig functions, but do not have enough information for the others. The precision is even a little better than the FPE but the speed is only about 30% quicker for the most complex, TAN. Simple VFP is about 50 time faster. In several places I have found mention of “Computer Approximations” by John Hart et al, but it is long out of print and quite expensive second hand. This appears to be the ‘bible’ for the algorithms I need. I will continue my search. ;-)

Dec 15, 2013 3:17pm Jeffrey Lee (213) 6048 posts	I implemented your ‘RunFast’ mode and I do avoid various aborts. However, I am getting to the stage where I would like to detect and take action on exceptions. Is there a prospect of that support code becoming available, or is there something I do myself to use them? I will be implementing the support code at some point (unless someone beats me to it!), but I don’t have any timescales for when. It’s something which I’d consider a requirement before the Pi port can be considered suitable for a stable release, so if the plan is for an official 5.22 release on the Pi then I’ll presumably try and find the time to get it done before then!

Dec 18, 2013 4:54pm Steve Drain (222) 1620 posts	“Computer Approximations” by John Hart et al I have now acquired a copy of this book. It is certainly terse, but everything you could require is there, so I can see why it is so highly regarded. Putting it into practice is another mind-stretching operation. ;-) I have realised that my timings were poorly implemented. I was not allowing for the time taken by the VFP context SWIs, which are quite expensive for small sections of code. I can now say that the VFP polynomial TAN code is over 20 times as fast as FPE, which is much more reasonable.

Dec 19, 2013 6:01pm Steve Drain (222) 1620 posts	The key bit in that text is “supported on some systems”. At the moment I haven’t actually implemented support for that flag, so you’ll be able to use the contexts from user mode regardless of the setting. Depending on how lazy I’m feeling that might change in the future :) This goes back a bit and refers to the User mode flag. I have just discovered where it matters. ;-) I have been implementing my code as SWIs, with a context created by the module and changed to by each SWI. It worked just fine. Because of the delays introduced by the context change I decided to leave context control to the calling program, in this case BASIC. I could check that the program context was active in the SWIs with VFPSupport_ActiveContext, but I was getting an ‘Undefined instruction’ error at the first VFP instruction, indicating that VFP was not actually correct. This led to a lot of head-banging until I eventually set the User mode flag. Now all is sweetness and light again, but I am not really sure what I have effected here. I am using a Pi with v5.19.

Dec 19, 2013 6:28pm Steve Drain (222) 1620 posts	There is something I am missing in the BASIC VFP assembler. I need a VLDRL pseudo-instruction to get at all the constants involved in these polynomial approximations. I cannot use an adaptation of the LDRL macros, because I have not spotted how you can pass a VFP register variable to a function; arm registers a just an integer. I have cobbled something together, but VLDRL would be rather neater.

Dec 19, 2013 7:24pm Kuemmel (439) 384 posts	…why not use VLDMIA or that multiple purpose VLDn ? what’s your data structure like ?

Dec 19, 2013 11:12pm Kuemmel (439) 384 posts	@Jeffrey: I might have found another error, not completely sure about the ARM ARM wording though. Can it be the case that the current BASIC assembler doesn’t allow stuff like VSTMIA R0!,{S0,S2} or VSTMIA R0!,{D0,D2}. But those should be allowed similar to STMIA/LDMIA … {R0,R2} ? Only consecutive registers e.g. {S0,S1} or {D0,D1} or {Dx-Dy} seem to work, where the others give syntax errors.

Dec 19, 2013 11:50pm Jeffrey Lee (213) 6048 posts	Only consecutive registers are allowed – it’s a limitation of the instruction set.

Dec 20, 2013 10:01am Steve Drain (222) 1620 posts	@Kuemmel VLDn is for NEON only; I am using VFP double precision. The constants I am concerned with are not used consecutively, so VLDMIA is not appropriate, at the moment. In my routines the polynomial coefficients are special and stored locally, so there is no problem there. I am concerned with the number of multiples of PI etc, only a handful of which are used in any one routine. I also have 0, 1 and 10 as constants, because the RPi does not support VFP 3 and immediate constants. The parsimony of years of ARM assembler tells me to store these constants once only, but that means addresses can be outside the immediate range – hence the desire for VLDRL. I am now thinking that this is not really necessary, so I will probably store copies of such constants along with each routine instead. Here is a general question about VLDM. I believe that LDM is not always quicker than seperate LDRs, and that some assemblers unroll LDM. I have also read some comments about the VFP vector operations not being particularly quick. Does this affect VLDM and is there any substantial advantage in using it over seperate VLDRs?