VFP implementations of transcendental functions and decimal conversion
Jeffrey Lee (213) 6048 posts |
The fused multiply & accumulate can help a bit (probably by one literal bit!). It’s the difference between
Emphasis on the “might” :-) If double precision floats are too imprecise, the next step would be to create your own number type out of integers. You could do it using just the ARM integer registers, but NEON might run faster. |
Jeffrey Lee (213) 6048 posts |
This article gives a pretty good fixed-point sine function approximation (including ARM assembler): http://www.coranac.com/2009/07/sines/ Whether it would be directly useful for us, I don’t know (it’ll all depend on how accurately we need to handle small numbers – e.g. do we expect sin(1/1000000) to be accurate) |
jim lesurf (2082) 1438 posts |
I’m not any kind of expert on how IEEE compliance, etc, is achieved. But I can make a few comments that may be of interest/amusement. In my experience injuneers and academics take a different tack on this in general. Both want repeatability and transferability. i.e. the same computer program as source code complied and used on different machines and fed the same values (i.e. in the same input formats as well as the digits you might see printed) should produce the same output. This is useful to check work and know that if there is an error, we’re all making the same one. :-) However injuneers are happier with differences provided they know why they popped up and are small enough that they can satisfy themselves they are ‘small enough’. Whereas academics may still get into worries about the how and why which can be exampled by the first discoveries of ‘Chaos’ where tiny alterations of the input proceeded to cause changes as processing was done thar ‘blow up in your face’. This is particularly worrying when two apparently identical programs end up giving different ‘explosions’. Academics start to wonder “who is wrong”, etc. 8-] Not sure how it compares with the other texts people have quoted. But when I’ve wanted to calculate weird functions, etc, I’ve tended to look in Ambramowitz and Stegun. But I don’t know if it is of any use in this context. I did have an IEEE FP test routine someone gave me many years ago when I was working on the flaws in ‘DSD’ (SACD Audio format). Yes, the devices they use can show ‘Chaos’. But I don’t know if I still have that routine or who gave it to me. I’ll see if I can find it, but it may now be lost on an old machine. |
Jeffrey Lee (213) 6048 posts |
I used Berkeley TestFloat for testing the VFP support code (and the VFP support code itself is built around Berkeley SoftFloat). However it only tests the core arithmetic and exception handling – not decimal conversion or transcendental functions. Edit: Although, the TestFloat website has a link to UCBTEST, which allegedly does test the transcendental functions. Worth investigating. |
Steve Drain (222) 1620 posts |
Wouldn’t that be very similar to the existing FPE?
The advantage here would have to be in exploiting vectors, wouldn’t it? |
Steve Drain (222) 1620 posts |
It has a lot of great stuff for mathematics, but I do not think it is relevant to this computer/digital problem. |
Jeffrey Lee (213) 6048 posts |
If double precision floats are too imprecise, the next step would be to create your own number type out of integers. You could do it using just the ARM integer registers Depends on which part of the FPE you’re taking about! The core math functions are implemented using integers, since that’s the only option available. The transcendental functions are implemented using (extended-precision) FPA instructions (so will be slower than necessary because they’ll invoke the undefined instruction handler many times). Decimal conversion might be a mix of both? Optimising the FPE to not use FPA instructions for the transcendental functions would be a worthwhile task, if anyone feels up to it :-) but NEON might run faster. Ideally, yes. However there might also be some gains from having more registers to play with. Basic 64bit arithmetic operations may also be faster due to being natively supported (so you can do them in one instruction instead of 2+) |
Steve Drain (222) 1620 posts |
I have read through this, and it makes quite a good summary of the problems, even if there are better ones. Anyway, it is not for floating point and nowhere near the precision required for IEEE.
For small values, sin x = x. The error is of the order of the cube of x |
Steve Drain (222) 1620 posts |
That must be the slowest part by a long way, mustn’t it?
It looks as though that was done to unify the code for the actual FPA and the emulated FPA. Up until I started looking at this I had not appreciated that the FPA might not, of itself, implement the trancendental functions. Did any version do that?
Would that be what any of the ‘Super FPE’ modules did? In comparison to the speed of the integer arithmetic operations in extended precision, how much do the ‘undefined instruction handler’ delays alter the overall timings? Would it really be worthwhile? However, looking at using VFP you would not have those delays and, perhaps, you could slot in the VFP operations in place of the FPA operations, one for one. ;-) |
Jeffrey Lee (213) 6048 posts |
The core math functions are implemented using integers, since that’s the only option available. I’m not sure. It certainly doesn’t help that the FPE does everything in extended precision, but there are also a lot of other overheads involved, which an inline software floating point implementation (like 5-byte BASIC floats or GCC’s softfloat) could mostly avoid:
The manual for the FPA coprocessor used with the ARM3 described the transcendental functions as not being supported in hardware. Which makes me think that the FPC podule may have been the only device that supported everything in hardware.
I’m not sure. There are a few ways I can think of for speeding things up (with varying effects on standards compliance):
I don’t know. You’ll also be able to gain some performance by removing or deferring some of the exception checks. E.g. for sin and cos, once you’ve sanitised the input value, you know that the intermediate computations won’t overflow so you don’t need to worry about checking for that at all. |
Steve Drain (222) 1620 posts |
I had one of those, but even so I did not realise. I did know that the 7500E was limited.
One that occurred to me was to arrange for the arithemtic operation routines to be callable direct, rather than as FPA instructions. Anyway, I think all that is in the past.
Exceptions worried me a bit with SmartFP so I tried very hard to sanitise the inputs. ;-) |
Jeffrey Lee (213) 6048 posts |
In terms of accuracy, it’s worth bearing in mind that IEEE 754 only “recommends” that the transcendental functions are (a) provided and (b) correctly rounded (unlike the decimal conversion functions, which must be correct). But it would still be nice to have a framework which can show how accurate our versions are (and to test behaviour of special-case/exceptional inputs). For floating point emulation, if you represent your numbers the right way and are willing to forgo rounding and error checks then you can perform (31bit) multiplication in just a handful of instructions: ; C = A*B ; Assume mantissa has 31 fraction bits, i.e. bit 31 is 1 for any non-zero value ; Assume exponents are unbiased UMULL Rlo,Rhi,RmantissaA,RmantissaB ; Calculate 64bit product, with 62 fraction bits. Result will be in the range [1, 4). MOVS RmantissaC,Rhi,LSL #1 ; Normalise the result - shift left once to get 31 fraction bits. Sets carry if value was already normalised. MOVCS RmantissaC,Rhi ; Use Rhi as-is if it was already normalised ORRCC RmantissaC,RmantissaC,Rlo,LSR #31 ; If it wasn't normalised, fill the bottom bit with the top bit of Rlo ADC RexponentC,RexponentA,RexponentB ; Accumulate exponents + normalisation offset EOR RsignC,RsignA,RsignB ; Update sign The same basic technique can be used for 63bit fractions (or larger), although the lack of 64×64 bit multiply means the code will start to get pretty bloated (and the best NEON can do is 32×32 → 64, the same as ARM) |
Steve Drain (222) 1620 posts |
That is a good escape clause.
And need to be worked on. They will be critical to:
Indeed.
Those NaNs and infinities. ;-) Can we say that denormals will not be used? I have been having a much closer look at the FPE source for transcendentals. It is explicit that the FPA instructions are used, in order to keep the size of the code small, important in those days. All the operations are based on C&W, but I have yet to trace them back to the book itself and then to Hart. It is not necessarily true that the C&W choices are the quickest or most accurate. The range reduction techniques might be very important here. |
Jeffrey Lee (213) 6048 posts |
They will be used (we can’t stop people!), but we can probably special-case them if processing them normally will cause problems. Clearly denormalised numbers are useful (otherwise they wouldn’t be part of the spec), so if there’s something we can do to make sure they give approximately correct answers then that’s probably better than treating them as zero. “This minuscule chasm between zero and the smallest positive normalized floating-point number yawned many orders of magnitude wider than the gaps between adjacent slightly larger floating-point numbers.” http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html |
Clive Semmens (2335) 3276 posts |
Indeed. As many orders of magnitude as the number of significant figures in the normal numbers (possibly minus one or two, not sure without careful thought!) |
Steve Drain (222) 1620 posts |
When I last looked, you sort-of recommended setting flush-to-zero because VFPSupport did not deal with denormalised trapping fully. Did I understand that correctly? At the moment SmartFP does that. I did not detect any code in the FPE source I was looking at that dealt with them, but that may be in another file. ‘Uncommon’ numbers – NaNs and infinities – are dealt with. The article you pointed to is fun and fascinating. Respect to the IEEE 754 committee. I now have a bit of a handle on what denormalised numbers are for, not just how they are defined. Please correct me if I am up-the-creek. We need them so that we do not loose track of the fact that an underflow has occured, but that there is still a number that is not zero. If we assumed zero this could have unforeseen consequences in comparisons and small number arithmetic. Denormals are not acually used in calculations – the ‘input denormal’ exception flag – or are they? ;-) Laying my head on the block, that would mean that for these operations we could set denormals to zero and retain precision. Edit: added ‘input denormal’ exception flag |
Jeffrey Lee (213) 6048 posts |
Yes (depending on when you last looked!) VFPSupport was first written for the BeagleBoard. The VFP hardware in the Cortex-A8 is able to deal with denormalised numbers by itself, and it doesn’t generate undefined instruction traps on FP exceptions, so the only real thing which VFPSupport did was to manage switching contexts. Fast-forward to the Pi 1, which contains VFP hardware which isn’t able to deal with denormalised numbers by itself (and which can generate undefined instruction traps on FP exceptions). The first versions of RISC OS for the Pi essentially used the same VFPSupport version as before, so in order to avoid “random” undefined instruction aborts you had to make sure that (a) all the exception traps are disabled in the FPSCR and (b) the ‘RunFast’ mode is used (so that denormalised numbers are flushed to zero by the hardware – breaking the spec, but avoiding the need for any support code). In February 2014 VFPSupport 0.06 came along, introducing the necessary support code for handling denormalised numbers on the Pi (essentially a software emulation of most of the hardware, similar to FPEmulator), and also support for translating the different exception traps into more useful errors. So now you can run with RunFast disabled and whichever exception traps enabled that you want. (However, looking at the commit message, I’m reminded that the emulation in VFPSupport currently doesn’t emulate “default NaN” or flush-to-zero, so it’s not a fully accurate emulation of the hardware. So e.g. flush-to-zero + exceptions will give you different results to flush-to-zero + no exceptions)
They are used in calculations (if they weren’t used, what would be the point in having them? they might as well just be replaced with an Underflow exception) I think the “input denormal” exception exists because denormalised numbers are less accurate than normalised numbers. Normalised numbers always have the same number of fraction bits, denormalised numbers have less and less the closer you get to zero. So if your program is concerned with accuracy, knowing when you’ve got a denormalised number in your equation could be a big help (just like the “inexact” flag tells you that the result has been rounded)
Chop! |
Clive Semmens (2335) 3276 posts |
Of course the interesting question then is, “how inaccurate might my results be?” at which point you want a number indicating how many short of the normal number of fraction bits the denormalized number is. This number takes as many bits to report as you’d have needed to lengthen the exponent to enable the number to have been normalized…(again, +/- 1 or two, haven’t thought carefully enough to be exact) You could go on like that forever. |
Steve Drain (222) 1620 posts |
OK. Tucks head under ARM. ;-) I have also read some more about denormals (or subnormals in IEEE 754-2008).
I started on Float in Nov 2013, so before v0.06. I am afraid I have probably not checked since. ;-(
So how does that affect the current problem? We want denormals to be dealt with. Do we need to disable the ID exception for the duration? Denormals are small and might be caught in a small number filter. SmartFP does not do this, but there is something in FPE like that. Then we can use sin x = x, cos x = 1-x^2/2 etc. One question I could not find an answer too, that may not be relevant anyway, but what is the result of taking a reciprocal of a denormal? |
Steve Drain (222) 1620 posts |
Of course I have. I wrote the VFP StrongHelp manual, so I copied all the information for v0.06, including VFPSupport_Features, but I clearly did not pay attention. |
Clive Semmens (2335) 3276 posts |
Overflow, surely. It’s not zero, so the result isn’t infinity, and it’s definitely out of the range of normal numbers. But you can’t say it’s “not a number.” |
Jeffrey Lee (213) 6048 posts |
No need to explicitly store the number of fraction bits you’ve lost (compared to a regular normalised number) – you can easily work that out by looking at the number itself. For the basic arithmetic operations (which are required to be accurate), this should directly correlate to how much error there is in the result. For the other ops, it’s less of a guarantee.
I’m not sure. For the transcendental functions, the spec lists the exceptions which they should generate (and when), but it doesn’t seem to obviously state that no other exceptions should be generated. It’s also worth bearing in mind that the rules for FPEmulator are likely to be different than those for VFPSupport. FPEmulator provides emulation of hardware operations, so in effect all the internal operations are hidden from the user, including any unexpected exceptions they might generate (e.g. you wouldn’t expect overflow or division by zero to be generated by ‘sin’). But the code going into VFPSupport would be a software library, so maybe exceptions unexpectedly generated by internal operations should be trapped (according to whatever traps the user had enabled when he called the function). |
Clive Semmens (2335) 3276 posts |
That’s fine while you’ve still got the number that caused the loss of precision, but you will need to store that information through to the end of any subsequent chain of calculations (by which time the results may be back in normal range, but no longer accurate). Okay, that’s more economical than having more exponent bits throughout. |
Jeffrey Lee (213) 6048 posts |
Looking closer, I’m reminded that the input denormal/subnormal exception isn’t part of the IEEE spec. The ARM ARM says that it’s only triggered when a denormal input is flushed to zero (i.e. only when flush-to-zero is enabled). Flush-to-zero also alters the behaviour in a few other areas (e.g. denormalised outputs are flushed to zero as well) So unless you want to give yourself a headache, I’d say it’s best to ignore flush-to-zero & default NaN modes, and assume that you’re writing IEEE-compliant code to run on a system that’s configured for IEEE compliance. This means that the only bits of the FPSCR we need to worry about are:
|
Steve Drain (222) 1620 posts |
Within that, I also have a model in my head that is closer to VFP. There is an ‘engine’ in FPE that emulates the FPA arithemtic operations using integer instructions when an FPA is not present. This is IEEE compliant. Then there is code that uses the the FPA operations to emulate the transcendental functions. It is this second level that is like VFP, although the API is different. If I remember correctly, the errors of the transcendentals are from the ‘engine’ exceptions, rather than more informative ones. SmartFP attempts to sanitise inputs avoid those, but BASIC VI (FP) still reports them.
I was hoping you would say that. ;-)
Leave that at default, round to nearest?
Sanitising inputs might avoid these. Here’s hoping. |