VFP implementations of transcendental functions and decimal conversion

56 posts, 9 voices

Pages: 1 2 3

Jan 17, 2018 12:44pm Jeffrey Lee (213) 6048 posts	Using VFP we can only use double precision and I think there must be an inevitable loss of precision. I may need some tutoring about how VFPv4 can overcome this. ;-) The fused multiply & accumulate can help a bit (probably by one literal bit!). It’s the difference between `acc = round(acc + round(ab))` (regular VMLA) `acc = round(acc + ab)` (VFMA). Less rounding errors should help accuracy, but whether it’ll be enough to avoid the need for using a slower alternate algorithm, I don’t know. I did get a hint that replicating extended precision using NEON integer operations might be the way. Emphasis on the “might” :-) If double precision floats are too imprecise, the next step would be to create your own number type out of integers. You could do it using just the ARM integer registers, but NEON might run faster.

Jan 17, 2018 2:13pm Jeffrey Lee (213) 6048 posts	This article gives a pretty good fixed-point sine function approximation (including ARM assembler): http://www.coranac.com/2009/07/sines/ Whether it would be directly useful for us, I don’t know (it’ll all depend on how accurately we need to handle small numbers – e.g. do we expect sin(1/1000000) to be accurate)

Jan 17, 2018 4:18pm jim lesurf (2082) 1438 posts	I’m not any kind of expert on how IEEE compliance, etc, is achieved. But I can make a few comments that may be of interest/amusement. In my experience injuneers and academics take a different tack on this in general. Both want repeatability and transferability. i.e. the same computer program as source code complied and used on different machines and fed the same values (i.e. in the same input formats as well as the digits you might see printed) should produce the same output. This is useful to check work and know that if there is an error, we’re all making the same one. :-) However injuneers are happier with differences provided they know why they popped up and are small enough that they can satisfy themselves they are ‘small enough’. Whereas academics may still get into worries about the how and why which can be exampled by the first discoveries of ‘Chaos’ where tiny alterations of the input proceeded to cause changes as processing was done thar ‘blow up in your face’. This is particularly worrying when two apparently identical programs end up giving different ‘explosions’. Academics start to wonder “who is wrong”, etc. 8-] Not sure how it compares with the other texts people have quoted. But when I’ve wanted to calculate weird functions, etc, I’ve tended to look in Ambramowitz and Stegun. But I don’t know if it is of any use in this context. I did have an IEEE FP test routine someone gave me many years ago when I was working on the flaws in ‘DSD’ (SACD Audio format). Yes, the devices they use can show ‘Chaos’. But I don’t know if I still have that routine or who gave it to me. I’ll see if I can find it, but it may now be lost on an old machine.

Jan 17, 2018 4:59pm Jeffrey Lee (213) 6048 posts	I used Berkeley TestFloat for testing the VFP support code (and the VFP support code itself is built around Berkeley SoftFloat). However it only tests the core arithmetic and exception handling – not decimal conversion or transcendental functions. Edit: Although, the TestFloat website has a link to UCBTEST, which allegedly does test the transcendental functions. Worth investigating.

Jan 17, 2018 5:23pm Steve Drain (222) 1620 posts	If double precision floats are too imprecise, the next step would be to create your own number type out of integers. You could do it using just the ARM integer registers Wouldn’t that be very similar to the existing FPE? but NEON might run faster. The advantage here would have to be in exploiting vectors, wouldn’t it?

Jan 17, 2018 5:29pm Steve Drain (222) 1620 posts	I’ve tended to look in Ambramowitz and Stegun It has a lot of great stuff for mathematics, but I do not think it is relevant to this computer/digital problem.

Jan 17, 2018 5:45pm Jeffrey Lee (213) 6048 posts	If double precision floats are too imprecise, the next step would be to create your own number type out of integers. You could do it using just the ARM integer registers Wouldn’t that be very similar to the existing FPE? Depends on which part of the FPE you’re taking about! The core math functions are implemented using integers, since that’s the only option available. The transcendental functions are implemented using (extended-precision) FPA instructions (so will be slower than necessary because they’ll invoke the undefined instruction handler many times). Decimal conversion might be a mix of both? Optimising the FPE to not use FPA instructions for the transcendental functions would be a worthwhile task, if anyone feels up to it :-) but NEON might run faster. The advantage here would have to be in exploiting vectors, wouldn’t it? Ideally, yes. However there might also be some gains from having more registers to play with. Basic 64bit arithmetic operations may also be faster due to being natively supported (so you can do them in one instruction instead of 2+)

Jan 17, 2018 5:53pm Steve Drain (222) 1620 posts	This article gives a pretty good fixed-point sine function approximation I have read through this, and it makes quite a good summary of the problems, even if there are better ones. Anyway, it is not for floating point and nowhere near the precision required for IEEE. it’ll all depend on how accurately we need to handle small numbers – e.g. do we expect sin(1/1000000) to be accurate For small values, sin x = x. The error is of the order of the cube of x

Jan 17, 2018 6:16pm Steve Drain (222) 1620 posts	The core math functions are implemented using integers, since that’s the only option available. That must be the slowest part by a long way, mustn’t it? The transcendental functions are implemented using (extended-precision) FPA instructions (so will be slower than necessary because they’ll invoke the undefined instruction handler many times). It looks as though that was done to unify the code for the actual FPA and the emulated FPA. Up until I started looking at this I had not appreciated that the FPA might not, of itself, implement the trancendental functions. Did any version do that? Optimising the FPE to not use FPA instructions for the transcendental functions would be a worthwhile task, if anyone feels up to it :-) Would that be what any of the ‘Super FPE’ modules did? In comparison to the speed of the integer arithmetic operations in extended precision, how much do the ‘undefined instruction handler’ delays alter the overall timings? Would it really be worthwhile? However, looking at using VFP you would not have those delays and, perhaps, you could slot in the VFP operations in place of the FPA operations, one for one. ;-)

Jan 17, 2018 7:24pm Jeffrey Lee (213) 6048 posts	The core math functions are implemented using integers, since that’s the only option available. That must be the slowest part by a long way, mustn’t it? I’m not sure. It certainly doesn’t help that the FPE does everything in extended precision, but there are also a lot of other overheads involved, which an inline software floating point implementation (like 5-byte BASIC floats or GCC’s softfloat) could mostly avoid: Context switching in the CPU will cause various bits of the pipeline to be flushed or synchronised Stacking and unstacking of ARM registers will take time Decoding the instruction takes time Loading parameters and converting them to an internal format for processing takes time (although I think the FPE mostly avoids this by using the FPA internal format for emulating the register file) Converting the result back to storage format takes time Up until I started looking at this I had not appreciated that the FPA might not, of itself, implement the trancendental functions. Did any version do that? The manual for the FPA coprocessor used with the ARM3 described the transcendental functions as not being supported in hardware. Which makes me think that the FPC podule may have been the only device that supported everything in hardware. Would that be what any of the ‘Super FPE’ modules did? I’m not sure. There are a few ways I can think of for speeding things up (with varying effects on standards compliance): Implement versions of routines optimised for single/double precision Avoid FPA instruction use by FPE Remove exception checks Remove support for denormalised numbers (i.e. implement “flush to zero” behaviour, as per the VFP RunFast mode) In comparison to the speed of the integer arithmetic operations in extended precision, how much do the ‘undefined instruction handler’ delays alter the overall timings? I don’t know. You’ll also be able to gain some performance by removing or deferring some of the exception checks. E.g. for sin and cos, once you’ve sanitised the input value, you know that the intermediate computations won’t overflow so you don’t need to worry about checking for that at all.

Jan 17, 2018 8:04pm Steve Drain (222) 1620 posts	The manual for the FPA coprocessor used with the ARM3 described the transcendental functions as not being supported in hardware. I had one of those, but even so I did not realise. I did know that the 7500E was limited. There are a few ways I can think of for speeding things up One that occurred to me was to arrange for the arithemtic operation routines to be callable direct, rather than as FPA instructions. Anyway, I think all that is in the past. once you’ve sanitised the input value, you know that the intermediate computations won’t overflow so you don’t need to worry about checking for that at all. Exceptions worried me a bit with SmartFP so I tried very hard to sanitise the inputs. ;-)

Jan 18, 2018 2:06pm Jeffrey Lee (213) 6048 posts	In terms of accuracy, it’s worth bearing in mind that IEEE 754 only “recommends” that the transcendental functions are (a) provided and (b) correctly rounded (unlike the decimal conversion functions, which must be correct). But it would still be nice to have a framework which can show how accurate our versions are (and to test behaviour of special-case/exceptional inputs). For floating point emulation, if you represent your numbers the right way and are willing to forgo rounding and error checks then you can perform (31bit) multiplication in just a handful of instructions: ; C = A*B ; Assume mantissa has 31 fraction bits, i.e. bit 31 is 1 for any non-zero value ; Assume exponents are unbiased UMULL Rlo,Rhi,RmantissaA,RmantissaB ; Calculate 64bit product, with 62 fraction bits. Result will be in the range [1, 4). MOVS RmantissaC,Rhi,LSL #1 ; Normalise the result - shift left once to get 31 fraction bits. Sets carry if value was already normalised. MOVCS RmantissaC,Rhi ; Use Rhi as-is if it was already normalised ORRCC RmantissaC,RmantissaC,Rlo,LSR #31 ; If it wasn't normalised, fill the bottom bit with the top bit of Rlo ADC RexponentC,RexponentA,RexponentB ; Accumulate exponents + normalisation offset EOR RsignC,RsignA,RsignB ; Update sign The same basic technique can be used for 63bit fractions (or larger), although the lack of 64×64 bit multiply means the code will start to get pretty bloated (and the best NEON can do is 32×32 → 64, the same as ARM)

Jan 18, 2018 4:25pm Steve Drain (222) 1620 posts	IEEE 754 only “recommends” that the transcendental functions are (a) provided and (b) correctly rounded That is a good escape clause. unlike the decimal conversion functions, which must be correct And need to be worked on. They will be critical to: it would still be nice to have a framework which can show how accurate our versions are Indeed. and to test behaviour of special-case/exceptional inputs Those NaNs and infinities. ;-) Can we say that denormals will not be used? I have been having a much closer look at the FPE source for transcendentals. It is explicit that the FPA instructions are used, in order to keep the size of the code small, important in those days. All the operations are based on C&W, but I have yet to trace them back to the book itself and then to Hart. It is not necessarily true that the C&W choices are the quickest or most accurate. The range reduction techniques might be very important here.

Jan 18, 2018 5:02pm Jeffrey Lee (213) 6048 posts	Can we say that denormals will not be used? They will be used (we can’t stop people!), but we can probably special-case them if processing them normally will cause problems. Clearly denormalised numbers are useful (otherwise they wouldn’t be part of the spec), so if there’s something we can do to make sure they give approximately correct answers then that’s probably better than treating them as zero. “This minuscule chasm between zero and the smallest positive normalized floating-point number yawned many orders of magnitude wider than the gaps between adjacent slightly larger floating-point numbers.” http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html

Jan 18, 2018 6:26pm Clive Semmens (2335) 3276 posts	“This minuscule chasm between zero and the smallest positive normalized floating-point number yawned many orders of magnitude wider than the gaps between adjacent slightly larger floating-point numbers.” Indeed. As many orders of magnitude as the number of significant figures in the normal numbers (possibly minus one or two, not sure without careful thought!)

Jan 18, 2018 8:33pm Steve Drain (222) 1620 posts	[denormals] will be used […], but we can probably special-case them if processing them normally will cause problems. When I last looked, you sort-of recommended setting flush-to-zero because VFPSupport did not deal with denormalised trapping fully. Did I understand that correctly? At the moment SmartFP does that. I did not detect any code in the FPE source I was looking at that dealt with them, but that may be in another file. ‘Uncommon’ numbers – NaNs and infinities – are dealt with. The article you pointed to is fun and fascinating. Respect to the IEEE 754 committee. I now have a bit of a handle on what denormalised numbers are for, not just how they are defined. Please correct me if I am up-the-creek. We need them so that we do not loose track of the fact that an underflow has occured, but that there is still a number that is not zero. If we assumed zero this could have unforeseen consequences in comparisons and small number arithmetic. Denormals are not acually used in calculations – the ‘input denormal’ exception flag – or are they? ;-) Laying my head on the block, that would mean that for these operations we could set denormals to zero and retain precision. Edit: added ‘input denormal’ exception flag

Jan 18, 2018 9:45pm Jeffrey Lee (213) 6048 posts	When I last looked, you sort-of recommended setting flush-to-zero because VFPSupport did not deal with denormalised trapping fully. Did I understand that correctly? Yes (depending on when you last looked!) VFPSupport was first written for the BeagleBoard. The VFP hardware in the Cortex-A8 is able to deal with denormalised numbers by itself, and it doesn’t generate undefined instruction traps on FP exceptions, so the only real thing which VFPSupport did was to manage switching contexts. Fast-forward to the Pi 1, which contains VFP hardware which isn’t able to deal with denormalised numbers by itself (and which can generate undefined instruction traps on FP exceptions). The first versions of RISC OS for the Pi essentially used the same VFPSupport version as before, so in order to avoid “random” undefined instruction aborts you had to make sure that (a) all the exception traps are disabled in the FPSCR and (b) the ‘RunFast’ mode is used (so that denormalised numbers are flushed to zero by the hardware – breaking the spec, but avoiding the need for any support code). In February 2014 VFPSupport 0.06 came along, introducing the necessary support code for handling denormalised numbers on the Pi (essentially a software emulation of most of the hardware, similar to FPEmulator), and also support for translating the different exception traps into more useful errors. So now you can run with RunFast disabled and whichever exception traps enabled that you want. (However, looking at the commit message, I’m reminded that the emulation in VFPSupport currently doesn’t emulate “default NaN” or flush-to-zero, so it’s not a fully accurate emulation of the hardware. So e.g. flush-to-zero + exceptions will give you different results to flush-to-zero + no exceptions) Denormals are not acually used in calculations – the ‘input denormal’ exception flag – or are they? ;-) They are used in calculations (if they weren’t used, what would be the point in having them? they might as well just be replaced with an Underflow exception) I think the “input denormal” exception exists because denormalised numbers are less accurate than normalised numbers. Normalised numbers always have the same number of fraction bits, denormalised numbers have less and less the closer you get to zero. So if your program is concerned with accuracy, knowing when you’ve got a denormalised number in your equation could be a big help (just like the “inexact” flag tells you that the result has been rounded) Laying my head on the block, that would mean that for these operations we could set denormals to zero and retain precision. Chop!

Jan 19, 2018 7:15am Clive Semmens (2335) 3276 posts	So if your program is concerned with accuracy, knowing when you’ve got a denormalised number in your equation could be a big help (just like the “inexact” flag tells you that the result has been rounded) Of course the interesting question then is, “how inaccurate might my results be?” at which point you want a number indicating how many short of the normal number of fraction bits the denormalized number is. This number takes as many bits to report as you’d have needed to lengthen the exponent to enable the number to have been normalized…(again, +/- 1 or two, haven’t thought carefully enough to be exact) You could go on like that forever.

Jan 19, 2018 10:48am Steve Drain (222) 1620 posts	Chop! OK. Tucks head under ARM. ;-) I have also read some more about denormals (or subnormals in IEEE 754-2008). Yes (depending on when you last looked!) I started on Float in Nov 2013, so before v0.06. I am afraid I have probably not checked since. ;-( So now you can run with RunFast disabled and whichever exception traps enabled that you want So how does that affect the current problem? We want denormals to be dealt with. Do we need to disable the ID exception for the duration? Denormals are small and might be caught in a small number filter. SmartFP does not do this, but there is something in FPE like that. Then we can use sin x = x, cos x = 1-x^2/2 etc. One question I could not find an answer too, that may not be relevant anyway, but what is the result of taking a reciprocal of a denormal?

Jan 19, 2018 11:09am Steve Drain (222) 1620 posts	I am afraid I have probably not checked since. Of course I have. I wrote the VFP StrongHelp manual, so I copied all the information for v0.06, including VFPSupport_Features, but I clearly did not pay attention.

Jan 19, 2018 11:30am Clive Semmens (2335) 3276 posts	what is the result of taking a reciprocal of a denormal Overflow, surely. It’s not zero, so the result isn’t infinity, and it’s definitely out of the range of normal numbers. But you can’t say it’s “not a number.”

Jan 19, 2018 12:34pm Jeffrey Lee (213) 6048 posts	Of course the interesting question then is, “how inaccurate might my results be?” at which point you want a number indicating how many short of the normal number of fraction bits the denormalized number is. This number takes as many bits to report as you’d have needed to lengthen the exponent to enable the number to have been normalized…(again, +/- 1 or two, haven’t thought carefully enough to be exact) No need to explicitly store the number of fraction bits you’ve lost (compared to a regular normalised number) – you can easily work that out by looking at the number itself. For the basic arithmetic operations (which are required to be accurate), this should directly correlate to how much error there is in the result. For the other ops, it’s less of a guarantee. So how does that affect the current problem? We want denormals to be dealt with. Do we need to disable the ID exception for the duration? I’m not sure. For the transcendental functions, the spec lists the exceptions which they should generate (and when), but it doesn’t seem to obviously state that no other exceptions should be generated. It’s also worth bearing in mind that the rules for FPEmulator are likely to be different than those for VFPSupport. FPEmulator provides emulation of hardware operations, so in effect all the internal operations are hidden from the user, including any unexpected exceptions they might generate (e.g. you wouldn’t expect overflow or division by zero to be generated by ‘sin’). But the code going into VFPSupport would be a software library, so maybe exceptions unexpectedly generated by internal operations should be trapped (according to whatever traps the user had enabled when he called the function).

Jan 19, 2018 1:30pm Clive Semmens (2335) 3276 posts	No need to explicitly store the number of fraction bits you’ve lost (compared to a regular normalised number) – you can easily work that out by looking at the number itself. That’s fine while you’ve still got the number that caused the loss of precision, but you will need to store that information through to the end of any subsequent chain of calculations (by which time the results may be back in normal range, but no longer accurate). Okay, that’s more economical than having more exponent bits throughout.

Jan 19, 2018 2:05pm Jeffrey Lee (213) 6048 posts	I think the “input denormal” exception exists because denormalised numbers are less accurate than normalised numbers. Looking closer, I’m reminded that the input denormal/subnormal exception isn’t part of the IEEE spec. The ARM ARM says that it’s only triggered when a denormal input is flushed to zero (i.e. only when flush-to-zero is enabled). Flush-to-zero also alters the behaviour in a few other areas (e.g. denormalised outputs are flushed to zero as well) So unless you want to give yourself a headache, I’d say it’s best to ignore flush-to-zero & default NaN modes, and assume that you’re writing IEEE-compliant code to run on a system that’s configured for IEEE compliance. This means that the only bits of the FPSCR we need to worry about are: The rounding mode Invalid Operation, Division by Zero, Overflow, Underflow & Inexact exception bits

Jan 19, 2018 2:41pm Steve Drain (222) 1620 posts	FPEmulator provides emulation of hardware operations, so in effect all the internal operations are hidden from the user, including any unexpected exceptions they might generate Within that, I also have a model in my head that is closer to VFP. There is an ‘engine’ in FPE that emulates the FPA arithemtic operations using integer instructions when an FPA is not present. This is IEEE compliant. Then there is code that uses the the FPA operations to emulate the transcendental functions. It is this second level that is like VFP, although the API is different. If I remember correctly, the errors of the transcendentals are from the ‘engine’ exceptions, rather than more informative ones. SmartFP attempts to sanitise inputs avoid those, but BASIC VI (FP) still reports them. I’d say it’s best to ignore flush-to-zero & default NaN modes, and assume that you’re writing IEEE-compliant code to run on a system that’s configured for IEEE compliance. I was hoping you would say that. ;-) The rounding mode Leave that at default, round to nearest? Invalid Operation, Division by Zero, Overflow, Underflow & Inexact exception bits Sanitising inputs might avoid these. Here’s hoping.

Pages: 1 2 3

Reply

To post replies, please first log in.

Forums → Wish lists →

VFP implementations of transcendental functions and decimal conversion

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options