VFP advice/tutorial
Steve Drain (222) 1620 posts |
Head-banging mode again. I have just removed this flag and all still works. On further inspection, it seems as though I had a typo that might have been responsible in the first place, but it is still weird. Anyway, my apologies if I have stirred things up without reason. |
Kuemmel (439) 384 posts |
Even if you are using VFP double precision, you could still use NEON store/load instructions if it’s benefitial in terms of speed. May be you can post your piece of code ? Regarding the VLDM issue. I made same tests long time ago and if I rememeber correctly it was the fastest to use VLDM/VSTM operations with 4 Dx or 8 Dx registers. There are some google results about optimizing mem copy with NEON. There were even experiments with combining NEON + normal ARM…but in the end the difference isn’t that big. When I write my inner loop routines I always do a perfomance check which of VLDR/VSTR/VLDM/VSTM is the best. Kind of try and error. Most important low level optimization for VFP and NEON I think is to create a flow of independant consecutive operations. |
Jeffrey Lee (213) 6048 posts |
FYI I’ve started work on implementing the support code that’s needed to use VFP on the Pi outside of the RunFast mode. It looks like the support code basically boils down to an IEEE-compliant software implementation of most of the data processing instructions. Basically when the hardware finds some inputs it can’t handle, or (if you’ve got some of the exception flags enabled) it spots something which it thinks might trigger an exception, it fires off an undefined instruction exception and the support code has to deal with performing the operation itself. After considering my options I concluded that the best course of action was to use the SoftFloat library for doing the core maths, as it looked like it would be simple to integrate and 99% bug free. Other options I’d considered were to take large chunks of code from FPEmulator (which looked like it would have been a bit tricky, especially since I need to add support for input subnormal exceptions) or to write my own code from scratch (time consuming and buggy!) At the moment it’s at the stage where it can decode and handle most of the necessary instructions, but there’s still a lot more to do – mainly hooking up some code to throw floating point exceptions, fleshing out the control logic a bit more (there are three places instructions could come from), and then lots of testing. |
Steve Drain (222) 1620 posts |
You have got around to this far sooner than I expected. Many thanks I am not sure of the extent of what you are writing, but I am assuming it is not an implementation of the FPE using VFP. I may have missed something, but I think I have come to realise that that might not hugely profitable in a universal fashion, because of the need to maintain and switch contexts. Reading the SoftFloat library I have not seen any advanced operations, such as trigonometry, so am I right that this is not in the scope of what you are doing? I have not made much progress over the holiday with my little project, but I have got code doing all basic ops, including the trig, and I now have the exponential and logarithms to go. I am engaged with it mainly for my own satisfaction and to get used to handling some VFP, but it might be useful to others. It is in the form of a module with SWIs operating with double-precision floats. It is designed to work on all machines, defaulting to FPA instructions. I have put a copy of the work-in-progress FloatMod on my site. At the moment it uses RunFast mode with some attempt at guarding against invalid values, but it would be handy to have a way to deal with exceptions instead. |
Jeffrey Lee (213) 6048 posts |
I wasn’t sure how long it was going to take, so I figured it was a good idea to get started on it early.
Correct. |
Jeffrey Lee (213) 6048 posts |
I’ve now got all the support code implemented, and have moved on to testing it. Although I’m yet to fully configure it, TestFloat reports that everything is OK, apart from one issue – there’s a small chance that a 64bit division will take about two minutes to complete. It looks like it’s a problem with an interrupt firing at the wrong time and corrupting a register or something – in a BASIC test harness I can run division operations in a loop (with exactly the same inputs) and about every 1 in 100,000 will suddenly decide to take two minutes to complete. So now I’ve got the fun task of disabling suspicious modules one by one until I find the culprit – and then hope the cause is easy to spot! |
Jeffrey Lee (213) 6048 posts |
Looks like it’s a variant of this bug, as the failure rate seems to decrease as I eliminate IRQ-generating modules, and I’ve spotted this sequence of code that the compiler’s generated inside one of the functions used by the division code: MOV R2,R13 ADD R13,R13,#8 LDMIA R2,{R9,R11} I guess I’d better reduce it down to a simple test case and then start prodding people until the bug gets fixed! |
WPB (1391) 352 posts |
This is related to this similar bug, isn’t it? In that if an interrupt occurs after the ADD but before the LDMIA, the stacked R9 and R11 could be trashed, assuming R13 in the above snippet is the SVC stack pointer. Is that right? |
Jeffrey Lee (213) 6048 posts |
Yeah, it’s the same basic mistake as in that code. |
Steve Drain (222) 1620 posts |
I have now completed all the regular transcendental functions and updated my site. What has become clear is that a few functions take too much processing in VFP to be able to maintain the IEEE precision of the FPE. In the worst case, raising to a power, the best I have done so far is 13 decimal places, as opposed to 17. On the other hand, the speed is of the order of 30 times faster. ;-) SWIs have been a useful way to test the code, but they add too much overhead to make a sensible interface for this sort of thing and I shall not be getting an allocation and releasing it officially. However, the assembly source may be of interest to anyone wanting to do something similar. If anyone downloads FloatMod, they will find a copy of a draft StrongHelp manual for VFP instructions included. This is derived from the information Jeffrey posted. There is still some work to be done, but I would welcome comments. |
Jeffrey Lee (213) 6048 posts |
The VFP coprocessor in the Pi is a very odd beast. If you have the inexact operation exception enabled then it will basically trigger an undefined instruction for every single data processing instruction, in order to force the support code to evaluate it and trigger any FP exception that’s required. I can understand that, even though it’ll make performance pretty sucky. But what I can’t get my head around are the myriad of situations in which non- data processing instructions will cause the VFP coprocessor to trigger an abort (or at least the situations in which the undefined instruction vector is taken with the VFP coprocessor in an exception state). So far I’ve seen them fire in the following situations:
CMP R0,R0 RFS R0 ; (FPA instruction) MOV pc,lr VADDNE.F32 S24,S26,S27 Yep, that’s an unreachable VFP instruction, which won’t be executed due to failing the condition code test, triggering a VFP exception on the RFS instruction. But, all those cases I can understand, as in each situation it’s valid for the CPU to take the undefined instruction vector. It’s just the VFP coprocessor which is being a bit funny and claiming that it needs servicing when really it doesn’t. All easy enough to work around by adding some code to validate that the instruction at [LR_und-4] is a VFP data processing instruction, and if it isn’t, clear the VFP exception state and pass the exception on to the next handler. However today, I’ve come across this situation: BL code VSTMIA R0,{S0-S31} (more code here blah blah blah) .code VMLA.F32 S9,S2,S1 ; This next instruction must lie on the start of a new cache line ; Note that some of the above sequences seem to be cache line sensitive as well! MOV PC,R14 VMLA.F32 S9,S2,S20 Guess which instruction we get an exception on? VSTMIA (and the first VMLA, but that’s expected). Guess what VSTMIA isn’t meant to do? Bounce to support code. The interesting thing is that it doesn’t seem to matter how many instructions occur between returning from ‘code’ and the next VFP instruction – the next VFP instruction will always trigger an exception (my original version of the code was in BASIC, with ‘code’ and the VSTMIA being two separate CALL’s). So the CPU must be seeing the second VMLA further ahead in the pipeline and prematurely putting the VFP coprocessor into the exception state such that the next VFP instruction will trigger the exception. So now I have to work out how to deal with that. Unfortunately it isn’t as simple as making it so that all false exceptions clear the exception flag from the VFP coprocessor and then retry the instruction, as it looks like it’s possible to get stuck in a loop with some code sequences (e.g. my first example of an FPA instruction followed by a VFP instruction). Hopefully I can get by with retrying just the VFP instructions, and passing all others on to the next exception handler. If that doesn’t work then I guess I’ll have to add support for emulating at least some of the non- data processing instructions. |
Steve Drain (222) 1620 posts |
That all sound frightening. I cannot say I really understand what you have to do, but my good wishes are with you. |
Jeffrey Lee (213) 6048 posts |
It would probably help if I understood what I had to do as well ;) After experimenting some more, it looks like:
Which suggests my processing flow should be as follows:
Due to the need to correctly determine whether a VFP instruction is supported by the hardware or not, I think I’ll have a go at using decgen again – if all it has to do is classify the instruction and work out what VFP features it requires then it should work quite well, without much overhead compared to a bespoke lump of assembler. |
Rick Murray (539) 13840 posts |
Implementation bug, or really seriously quirky way to do it? Would an MPEG4 / AAC decoder use VFP? The Pi is quite happy playing 720P video, heats up one whole degree when playing it. ;-) I’m having difficulty imagining it can seriously manage this if firing off exceptions all over the place for the VFP bits, or is this weirdness only when working with inexact? Also, if anybody (not necessarily you) has a mo – could they/you explain (preferably in terms that a five year old can understand…) how a processor is supposed to calculate mathematics “inexactly”? I sort of imagine this to be the reason the economy is in a mess. “Well, we added it all up and the machine just sorta picked a nice looking result.” (^_^) |
Jeffrey Lee (213) 6048 posts |
Possibly, but I doubt it. NEON is more cut out for that kind of thing.
It’s only when trapped inexact exceptions are enabled. If they’re disabled then it’s generally a lot more sensible about what instructions will bounce to support code (and hopefully impossible for non- data processing instructions to bounce)
The inexact exception is raised whenever the precision of a single/double precision floating point number is insufficient for representing the result of a calculation. For example, the number 2 can be represented exactly, but the result of sqrt(2) cannot because it’s some really long number (1.4142135623730950488016887242097 something something something according to Windows calculator – which is a lot higher precision than SciCalc seems to display!) If you have the inexact trap enabled (which is what I’m trying to sort out) then any inexact result will cause an error to be generated so that the program can deal with it appropriately. If the trap is disabled then all that happens is a status bit gets set in the FPSCR and it’s down to the program to check for it if/when it desires. In practice, I’d expect very little code to want to use trapped inexact exceptions, for the simple fact that most interesting calculations will have inexact results. Even 1/10 can’t be represented with full accuracy using binary floating point! |
Steve Drain (222) 1620 posts |
I hardly dare suggest something more, but if you are already writing a lot of support code, had you considered whether the FPE could be re-written to take advantage of VFP? The conclusion I came to when dabbling about with my Float module was that this could not be done effectively, because of the need to have a VFP context that would have to switch for every FPA instruction emulated. I was only able to make things work fast by using an application VFP context. I presume that this could be overcome at a low enough level. As for the translation, the basic FPA instructions can be more or less written as VFP instructions, although only for double or single precision – there can be no extended precision. For those trancendental instructions I have now improved my double precision algorithms to that of the FPE 1 and they do run acceptably fast. If it is not impertinent, you are welcome to use them. 1 except for POW which is still a place or two short |
Jeffrey Lee (213) 6048 posts |
I have considered it, but decided against it for pretty much the reasons you state. It’d be a lot of hassle to make it work, and the end result might not be that much faster than using FPE. Ultimately I’d rather promote the use of native VFP (or softfloat) than spend time trying to make legacy FPA code a little bit faster. The fact that the VFP support code needed for the Pi has to cover all of the data processing instructions means that a VFPEmulator module which provides full VFP emulation is a distinct possibility – which would make using VFP as default for most software a much more attractive proposition than sticking with FPA or softfloat, especially since any future hardware ports are almost certainly going to have hardware VFP. |
Steve Drain (222) 1620 posts |
Using immediate context switching I found the advantage was quite limited.
That is an idea that I like a lot. My module defaults to FPA without a context, but that would be a better. |
Jeffrey Lee (213) 6048 posts |
The support code is now checked in and should be in today’s ROM. There are a couple of new SWIs (VFPSupport_Features 1 and VFPSupport_ExceptionDump) and a new *Command (*ShowVFPRegs). Let me know if you run into any issues! |
Chris Gransden (337) 1207 posts |
Would it be possible to add an optional build switch to the rom build to enable building the rom without FPE support. Something similar is mentioned in the RISC OS Rambles here. This would help open the way to add hard-float vfp support to GCC. The fpe/fpa support was removed from GCC 4.8 onwards some time ago. I think that’s why the RISC OS GCC port is stuck at 4.7.4. I’ve also been testing the new VFP support code with the last RPi rom. I used the VFP version of povray 3.6.1 to test it. The time taken to run teapot.pov (-w1024 -h768 +a0.3) was 49s with or without using the support code. |
Jeffrey Lee (213) 6048 posts |
Possibly. My main worry would be how much C stuff there is which is using FPA instructions. I think the code can be detected fairly easily by telling Norcroft to use a different APCS variant, but if you have to disable most of the C modules or something important like the shared C library then you won’t be left with a very useful ROM image for development work. Also, apart from making any code which uses FPA break horribly, I’m not sure how it would help get VFP working in GCC! Getting VFP working in GCC and getting a VFP ROM build are two completely different things.
Ah, but were you actually doing anything which would require support code intervention? If povray is only using RunFast mode (which I guess it must have done if it was running without the support code) then there shouldn’t be any difference in performance. |
Jeffrey Lee (213) 6048 posts |
On the subject of instructions, while looking at the new ARMv8 VFP/NEON instructions today I’ve realised that there are a few which could be mistaken for being conditionally executed. E.g. VSELEQ, VSELGE, VSELGT, VSELVS are ARMv8 VFP instructions, which look like they should be conditional, and it would make sense for them to be conditional considering that all other VFP instructions are conditional, but these ones are unconditional. Things are further compounded by the fact that an assembler could introduce aliases of the form VSELNE, VSELLT, VSELLE, VSELVC by swapping around two of the registers. There are also a few (ARMv7) NEON instructions like VCEQ, VCGE, VCGT, VCLE and VCLT, which also look like they could be conditional, except that it’s well-known that the NEON data processing instructions are all unconditional. |
Steve Drain (222) 1620 posts |
And VACGE, VACGT, VACLE and VACLT. The StrongHelp VFP manual is careful to make this distinction. If I understand correctly, the ‘suffices’ indicate the type of comparison to be made beteeen elements of the vectors. Would you list the new instruction for me to add, as before. ;-) |
Jeffrey Lee (213) 6048 posts |
So needy! ;-) Note that VFPv4/NEONv2 is a core feature in ARMv8(-A), so these are technically just dependent on the CPU being ARMv8, rather than having any particular VFP/NEON version. Rounding mode suffixes: * Round to nearest with ties to (A)way * Round towards (M)inus infinity * Round towards (P)lus infinity * Round to (N)earest * Round to (Z)ero * (R)ound according to rounding mode specified in FPSCR * Round according to rounding mode specified in FPSCR, raising an ine(X)act exception if this resulted in a change in value VFP --- VSEL<EQ|GE|GT|VS>.F64 <Dreg>, <Dreg>, <Dreg> VSEL<EQ|GE|GT|VS>.F32 <Sreg>, <Sreg>, <Sreg> Set destination to the first operand if the condition is true (according to the PSR), second operand if the condition is false (note that unlike e.g. VCGE, the ARMv8 ARM doesn't list any aliases for these. So I'm not planning on adding any NE|LT|LE|VC aliases to the BASIC assembler) V<MAX|MIN>NM.F64 <Dreg>,<Dreg>,<Dreg> V<MAX|MIN>NM.F32 <Sreg>,<Sreg>,<Sreg> IEEE754-2008 min/max number selection. Differs from the older, NEON-only VMIN/VMAX in that if one operand is a quiet NaN and the other isn't a NaN, the non-NaN operand will be returned (VMIN/VMAX will return the default NaN if any operand was NaN) VCVT<A|M|N|P>.<S32|U32>.F64 <Sreg>,<Dreg> VCVT<A|M|N|P>.<S32|U32>.F32 <Sreg>,<Sreg> Convert float to integer, with specified rounding mode. This differs from older VFP VCVT instruction, which always rounded to zero. VRINT<A|M|N|P|R|X|Z>.F64.F64 <Dreg>,<Dreg> VRINT<A|M|N|P|R|X|Z>.F32.F32 <Sreg>,<Sreg> Round a floating point value to integer, with specified rounding mode. VCVT<B|T>[<cond>].F16.F64 <Sreg>,<Dreg> VCVT<B|T>[<cond>].F64.F16 <Dreg>,<Sreg> Convert between half-precision and double-precision float (as per the older VCVTB/VCVTT, which only supported conversion between half & single precision) NEON ---- V<MAX|MIN>NM.F32 <Qreg>,<Qreg>,<Qreg> V<MAX|MIN>NM.F32 <Dreg>,<Dreg>,<Dreg> IEEE754-2008 min/max number selection for NEON vectors. Differs from the older, NEON-only VMIN/VMAX in that if one operand is a quiet NaN and the other isn't a NaN, the non-NaN operand will be returned (VMIN/VMAX will return the default NaN if any operand was NaN) VCVT<A|M|N|P>.<S32|U32>.F32 <Qreg>,<Qreg> VCVT<A|M|N|P>.<S32|U32>.F32 <Dreg>,<Dreg> Convert NEON vector from floats to integers, with specified rounding mode. This differs from older NEON VCVT instruction, which always rounded to zero. VRINT<A|M|N|P|X|Z>.F32.F32 <Qreg>,<Qreg> VRINT<A|M|N|P|X|Z>.F32.F32 <Dreg>,<Dreg> Round a NEON vector of floating point values to integer, with specified rounding mode. |
Steve Drain (222) 1620 posts |
Thanks. That tells me what I have to include. As these look as if they are still VFP4/NEON2, I shall put them in those groups but in a sub-heading for ARMv8.
I have been to the ARM documents and I think they are conditional:
So the instruction is Is that ok? |