VFP advice/tutorial
Jeffrey Lee (213) 6048 posts |
Yes, VFP common is correct. I do have one small correction to the NEON instruction list that I produced. VDPL needs to be two entries, because one can accept a condition code and the other can’t: VDPL.8|16|32 <Dreg>,<Dreg[x]> VDPL.8|16|32 <Qreg>,<Dreg[x]> Duplicate element of NEON register into all elements of destination vector VDPL[<cond>].8|16|32 <Dreg>,<reg> VDPL[<cond>].8|16|32 <Qreg>,<reg> Duplicate element low bits of ARM register into all elements of destination vector |
Steve Drain (222) 1620 posts |
Thanks. I have made those corrections. If all are agreed that this provides the help required and the command names and format are suitable I will register the module and formally release it. |
Steve Drain (222) 1620 posts |
I need some help with VFP double precision. I think all the examples I have seen use single precision. What I have done with single so far seems to work, but if I just change to doing the same with double I hit problems. A small example with a Raspberry Pi and RO5.19, using the BASIC assembler:
This works fine, but:
This produces ‘Undefined instruction’ at the line marked +. If I remove the line marked * it works. I have checked the disassembly in StrongED and the assembled code appears to be just what I would expect. What am I missing? |
Kuemmel (439) 384 posts |
Hm…just used the instruction in some code on my PandaBoard (though a quite old ROM from a year ago or so) and it works correctly…is EQUFD correct !? Wasn’t there something like DCFD.vfp ? I remember the thread where I complained about DCFD not working. |
Martin Bazley (331) 379 posts |
If it works on the PandaBoard but not on the Pi, that strongly suggests to me that the Pi doesn’t support double-precision. Remember: the Pi is ARMv6, the dev boards are ARMv7. ARMv7 can do quite a few floating point-type things that ARMv6 can’t. While I know next to nothing about the subject myself, it’s always worth bearing in mind the possibility that you’ve run into one of the very few RISC OS-relevant differences between the two architectures. |
Kuemmel (439) 384 posts |
Hm, googleing suggests the Pi supports Double Precision, e.g. http://mindplusplus.wordpress.com/2013/06/25/arm-vfp-vector-programming-part-1-introduction/ On the other hand it could still be a special VFP-activiation issue on the Pi ? Jeffrey ? |
Steve Drain (222) 1620 posts |
I assumed that it was. It is there in the FPA directives to reserve an IEEE double float, which should be the same as VFP requires. |
Jeffrey Lee (213) 6048 posts |
The Pi does support double precision floats, but when they’re saved to memory the two words are saved in a different order compared to FPA. FPA saves them in big-endian word order, VFP saves them according to the CPU endianness (i.e. little-endian for all RISC OS machines). But since EQUFD was created back in the days when only FPA was available, it will by default be using the wrong word order for VFP. To work around this I added support for .vfp and .fpa suffixes on EQUFD & DCFD. So if you use EQUFD.vfp then BASIC will initialise the double using little-endian word order, ready for VFP. The reason you’re getting an undefined instruction abort is probably because the word-swapped values being loaded were either being interpreted as NaNs or denormalised numbers – which the VFP unit in the Pi can’t handle properly and so raises the exception so that support code on the CPU can handle it (except we haven’t implemented any support code for that yet). The exceptions can also be avoided by using “RunFast” mode (enable flush-to-zero and default NaN modes, by setting bits 24 & 25 of the FPSCR). |
Steve Drain (222) 1620 posts |
Ah! All is revealed. It momentarily left a problem with VSTR, which also stores the words in the wrong order to read in BASIC VI. I solved that by decomposing the double into two singles:
There may well be a better way to do this. ;-) |
Kuemmel (439) 384 posts |
Hi Jeffrey, I might have come up another error in the Basic Assembler (though it might be fixed, as I use a quite old version…). Whereas
works without any problem as it should,
assembles, but produces wrong computation. When I looked in the ARM-ARM, it says that only D0-D15 are allowed in that circumstances, so I suspect that the assembler doesn’t detect that mistake and assembles to some wrong code. Might be a problem also for similar instructions. |
Steve Drain (222) 1620 posts |
Oops! In addition to what I reported elsewhere:
I have found that:
|
Jeffrey Lee (213) 6048 posts |
I might have come up another error in the Basic Assembler (though it might be fixed, as I use a quite old version…). Whereas OK, I’ll check to see if this is still a problem with the latest version.
Yeah, looks like that was a mistake in the list I made. The assembler sources clearly show that the suffix is required.
Do not require or do not support? The sources suggest that the suffix is optional, and that’s what I had in the instruction list I made: VLDM<IA|DB>[<cond>][.32] <reg>[!],{<sregs>} VLDM<IA|DB>[<cond>][.64] <reg>[!],{<dregs>} |
Steve Drain (222) 1620 posts |
In the UAL QRC that I have, a suffix is not shown for any of the store and load instructions. I have so far just found out that the BASIC assembler does not seem to support a suffix. There may also be some problems with the stack type suffices, but I have not had time to refine that yet, and I will go to the source to check some of these before I comment further. |
Jeffrey Lee (213) 6048 posts |
The ARM ARM shows them as being optional. They don’t really add anything to the instruction since you can only specify a size that’s identical to the register size. |
Jeffrey Lee (213) 6048 posts |
I’ve now submitted a fix for the VMLA issue. It was a mistake in the definitions of a few of the instructions – VMLA[L] and VMLS[L] were wrong but the others were all fine. |
Steve Drain (222) 1620 posts |
I implemented your ‘RunFast’ mode and I do avoid various aborts. However, I am getting to the stage where I would like to detect and take action on exceptions. Is there a prospect of that support code becoming available, or is there something I do myself to use them? |
Steve Drain (222) 1620 posts |
I have used the code from that site and I also found another that purported to do something similar. Unfortunately, neither can provide sufficient precision to return true double floats to IEEE standard. However, I did find a couple of papers from ‘The Ganssle Group’ that were very helpful. These have allowed me to implement the basic trig functions, but do not have enough information for the others. The precision is even a little better than the FPE but the speed is only about 30% quicker for the most complex, TAN. Simple VFP is about 50 time faster. In several places I have found mention of “Computer Approximations” by John Hart et al, but it is long out of print and quite expensive second hand. This appears to be the ‘bible’ for the algorithms I need. I will continue my search. ;-) |
Jeffrey Lee (213) 6048 posts |
I will be implementing the support code at some point (unless someone beats me to it!), but I don’t have any timescales for when. It’s something which I’d consider a requirement before the Pi port can be considered suitable for a stable release, so if the plan is for an official 5.22 release on the Pi then I’ll presumably try and find the time to get it done before then! |
Steve Drain (222) 1620 posts |
I have now acquired a copy of this book. It is certainly terse, but everything you could require is there, so I can see why it is so highly regarded. Putting it into practice is another mind-stretching operation. ;-) I have realised that my timings were poorly implemented. I was not allowing for the time taken by the VFP context SWIs, which are quite expensive for small sections of code. I can now say that the VFP polynomial TAN code is over 20 times as fast as FPE, which is much more reasonable. |
Steve Drain (222) 1620 posts |
This goes back a bit and refers to the User mode flag. I have just discovered where it matters. ;-) I have been implementing my code as SWIs, with a context created by the module and changed to by each SWI. It worked just fine. Because of the delays introduced by the context change I decided to leave context control to the calling program, in this case BASIC. I could check that the program context was active in the SWIs with VFPSupport_ActiveContext, but I was getting an ‘Undefined instruction’ error at the first VFP instruction, indicating that VFP was not actually correct. This led to a lot of head-banging until I eventually set the User mode flag. Now all is sweetness and light again, but I am not really sure what I have effected here. I am using a Pi with v5.19. |
Steve Drain (222) 1620 posts |
There is something I am missing in the BASIC VFP assembler. I need a VLDRL pseudo-instruction to get at all the constants involved in these polynomial approximations. I cannot use an adaptation of the LDRL macros, because I have not spotted how you can pass a VFP register variable to a function; arm registers a just an integer. I have cobbled something together, but VLDRL would be rather neater. |
Kuemmel (439) 384 posts |
…why not use VLDMIA or that multiple purpose VLDn ? what’s your data structure like ? |
Kuemmel (439) 384 posts |
@Jeffrey: I might have found another error, not completely sure about the ARM ARM wording though. Can it be the case that the current BASIC assembler doesn’t allow stuff like VSTMIA R0!,{S0,S2} or VSTMIA R0!,{D0,D2}. But those should be allowed similar to STMIA/LDMIA … {R0,R2} ? Only consecutive registers e.g. {S0,S1} or {D0,D1} or {Dx-Dy} seem to work, where the others give syntax errors. |
Jeffrey Lee (213) 6048 posts |
Only consecutive registers are allowed – it’s a limitation of the instruction set. |
Steve Drain (222) 1620 posts |
@Kuemmel VLDn is for NEON only; I am using VFP double precision. The constants I am concerned with are not used consecutively, so VLDMIA is not appropriate, at the moment. In my routines the polynomial coefficients are special and stored locally, so there is no problem there. I am concerned with the number of multiples of PI etc, only a handful of which are used in any one routine. I also have 0, 1 and 10 as constants, because the RPi does not support VFP 3 and immediate constants. The parsimony of years of ARM assembler tells me to store these constants once only, but that means addresses can be outside the immediate range – hence the desire for VLDRL. I am now thinking that this is not really necessary, so I will probably store copies of such constants along with each routine instead. Here is a general question about VLDM. I believe that LDM is not always quicker than seperate LDRs, and that some assemblers unroll LDM. I have also read some comments about the VFP vector operations not being particularly quick. Does this affect VLDM and is there any substantial advantage in using it over seperate VLDRs? |