Taking FP seriously
Pages: 1 2
GavinWraith (26) 1563 posts |
On another thread Rick said
CPUs generally seem to offer fp arithmetic (+, -, *, /, abs, round) and an assortment of basic functions ( exp, log, power, sqrt, sin, cos, tan, asin, acos, atan). If I remember rightly BBC BASIC implements these with lookup tables for continued fraction approximants (i.e. rational function approximants), which is the only sensible way to go, given that rational functions are precisely those expressible using only the arithmetic functions. It would have made sense for BBC BASIC to have supplied as a separate function a generic routine with a user-supplied lookup table. Admittedly, it would not be able to take advantage of the various optimizations that could be used for particular functions such as those mentioned above, but even so it would have been useful to have (are you listening Steve?). I do not know enough about VFP, but I suspect that it must use a similar algorithm in microcode. Does it expose it for public use? |
Clive Semmens (2335) 3276 posts |
Not sure whether it’s microcoded or hardwired – I’m pretty sure that’s not specified in the architecture, so an implementer would be free to do whichever they preferred. I’m pretty sure it’s got to be that kind of algorithm though, and I’m very sure it’s not exposed for public use. |
Rick Murray (539) 13805 posts |
? To me that question is like “does the ARM expose how MUL works?”. Am I misreading something?
It’s a lot like the older FPA, only different instructions and more capabilities. And, from the point of view of a programmer, the data byte order is the other way around, so using VFP in a program written for FPE (any Norcroft C program) isn’t simply a matter of patching in some VFP instructions, though you can fake it by loading the double precision register as two single precision halves, like:
|
Jeffrey Lee (213) 6048 posts |
VFP only provides addition, subtraction, multiplication, division, abs, negation, multiply-accumulate (including fused multiply-accumulate in VFPv4), and square root (and different rounding modes). NEON supports the same operations as VFP, except for square root. Instead it supports reciprocal and reciprocal square root. They use lookup tables and Newton-Raphson iteration to arrive at the result – with the programmer in control of how many iterations are performed (VRECPE, VRECPS, VRSQRTE, VRSQRTS instructions) |
Clive Semmens (2335) 3276 posts |
I think what Gavin wants is to be able to use the VFP’s Newton-Raphson iteration with user-supplied look-up tables for user-defined functions, which he definitely can’t do. |
GavinWraith (26) 1563 posts |
But they do not let the user specify a lookup table in memory? That would seem to me to be rather useful.
OK if you know the derivative, as with standard functions. For functions that are specified in some other way, say by an integral equation or a minimization problem, it might not be the most convenient algorithm. |
Jeffrey Lee (213) 6048 posts |
They use lookup tables .. No, the lookup table is fixed in the hardware. If you want custom lookup tables, then I think you have three options:
|
Steve Drain (222) 1620 posts |
As does an FPA/FPE, with addition of POL (equivalent to ATAN2).
BASIC V with 5-byte floats uses Chebyshev polynomials where appropriate, not continued fractions as far as I can tell. BASIC VI uses FPA instructions. The FPE integer code is beyond me, but it is very long. ;-(
Note the missing ‘basic functions’. Also, VFP is only 64-bit, whereas the FPA can do 80-bit and alway does internally. My Float module does its best to add those other basic functions and to make calling VFP friendly and backward compatible with FPE. Remember ‘context switching’.
But only 32-bit, which I think makes it a no-go for exploiting in BASIC, at least. |
Jeffrey Lee (213) 6048 posts |
CPUs generally seem to offer fp arithmetic (+, -, *, /, abs, round) and an assortment of basic functions ( exp, log, power, sqrt, sin, cos, tan, asin, acos, atan). The original FPA co-processor that was available for the ARM3 implemented all the arithmetic functions in hardware. But for the FPA in the ARM7500FE, there’s no hardware support for the trig or power functions (except square root) – POW, RPW, POL, LOG, LGN, EXP, SIN, COS, TAN, ASN, ACS, and ATN are handled by FPEmulator (see section 10.2.3 of the ARM7500FE data sheet). NEON supports the same operations as VFP, except for square root. For floating-point, yeah. But maybe there’s some fun stuff which could be done for integers (array/matrix operations) |
Steve Drain (222) 1620 posts |
Now, I did not know that. Having read the data sheet, I see that the recommendation is to use library functions (in C) to implement the missing instructions, because that is more efficient than letting the FPE implement them through exceptions. This would also seem to be the best route for using VFP, rather than an FPE that uses VFP instructions. The problem is backward compatability, I suppose. Is the source for that version of the FPE likely to be available anywhere?
Having read every NEON instruction while writing the StrongHelp VFP manual, I am staggered by the things that might be done. |
Rick Murray (539) 13805 posts |
Mmm, like unpacking an RGB bitmap into separate registers for each colour component with one instruction, and then after processing, put the RGB data back together from the separate registers, again using one instruction. There’s some interesting stuff in NEON. Just a shame I’m too dumb to understand most of it… |
Jeffrey Lee (213) 6048 posts |
ARM allowed the FPEmulator source to be released under a BSD license a few years ago. https://www.riscosopen.org/viewer/view/mixed/RiscOS/Sources/HWSupport/FPASC/ (“FPASC” = “FPA support code”, I guess) All the interesting stuff will be in the coresrc folder. |
Clive Semmens (2335) 3276 posts |
There is indeed. I loved documenting it, it’s so elegantly designed. It’s a pity I’ve never had the opportunity to use it in anger…
Yeah, right, of course you are. Not. Never learnt the relevant skills to make use of it, perhaps, but that’s not the same thing at all at all. |
Rick Murray (539) 13805 posts |
Unfortunately, there is a technical issue. 😕 |
Clive Semmens (2335) 3276 posts |
If that’s purely an arithmetical dyscalculia, it shouldn’t be an impediment. If it’s more general, fair point and I’m sorry. (And sad, but not ashamed.) |
Steve Pampling (1551) 8154 posts |
Funny things mental blocks. 1 According to my A level maths teacher I was useless at it, according to my college lecturer my A level maths teacher was an idiot. |
Rick Murray (539) 13805 posts |
It isn’t really. Maths is not a big part of my life, though in a discussion regarding FP behaviour, I’m probably the worst person to be in the discussion. ;-) I have no difficulty with three dimensional visualisations, and I see estimating which of two things is more as being a logic problem more than a maths one. It’s like when you’re at the till and it comes to 9.86 and you give the woman 10.06 and she hands the extra six cents back as she isn’t able to understand that giving back a twenty is going to be quicker and simpler than finding fourteen cents. It was fascinating to see the inability to associate a name with a face was part of this. I’m not sure why, however that’s something a have trouble with. I can remember faces and recognise them with the briefest glance – I proved this when I was about eight waiting in Basingstoke to catch a train to Southampton. I spotted another crew member on the intercity as it barreled through the station. Everybody said that was impossible so I told him where on the train, what time, and described what he was wearing. All correctly. But could I tell you his name? Nope! No trouble with time. I used to get detention lot (teachers don’t like being corrected, it seems, but then I was a bit of a troll1) so I used to count of my “stand in the corner” time with great accuracy. I’m less accurate now but I think that’s a perception thing to do with aging. 1 Some might say “obnoxious git”, but then I’d say that of some teachers. Truth is, I found school mind numbingly boring until maybe third form senior. Of course we are dealing with an education system where the response to pre-teen me reading John Whyndams when the other kids are looking at books with Ladybird print and pictures was “he can’t”. What? Sc**w you, lady! |
Steve Pampling (1551) 8154 posts |
Oh that routine. My father had to deal with people doing that three times over, although my youngest sibling being so much younger the teenage me had lots of fun winding up the critical adults |
Theo Markettos (89) 919 posts |
I don’t think they failed to take it seriously, I think they made some bad decisions early on that caused trouble later. The idea of specifying FP instructions and emulating them doesn’t seem a bad idea from the perspective of 1983 when the instruction set was designed. Using instructions as an API between applications and the OS (see also SWI) seems like a natural fit – especially when it’s possible to easily replace them with a hardware implementation. For instance they had a podule with a Western Electric WE32206 floating point unit on it. FPEmulator translated FP instructions into native commands to the FPU, waited for the results and returned them. That made sense in a 1985 kind of world (indeed the contemporary MC68881 FPU documentation suggests a similar idea). But as soon as caches come along this became infeasibly expensive – easier to compute on the integer CPU than pay I/O latency, and the CPU can’t get any useful work done while waiting for the FPU reply. Bearing in mind that ARM1/2/3 were designed on the basis of no money and no people, they had to be as simple as possible to make it work at all. So adding an on-die FPU (which are expensive in terms of area) was out of the question. Move forward into ARM6/7 timeframe and the CPU is being developed by ARM. The costs of the undefined instruction increasing start to bite: pipeline flush, register save, FP instruction decode, potentially cache and TLB invalidates (bearing mind ARM is not designing for RISC OS any more). It’s cheaper just to compile soft-float. Meanwhile the market has moved from workstations (eg Acorn) to embedded (Psion, Nokia and friends). Nokia wasn’t interested in FP, they cared about power consumption. Process scaling meant FPA could fit on-die (viz 7500FE) but the integer side was only competitive for a short while in the STB market before others overtook it. Having made those wrong turns, I think ARM were right to scrap it all and start again. Perhaps Acorn’s final wrong turn was to cling to the FPA world for too long, rather than admitting defeat and going soft-float. Today, RISC OS still suffers from the compounded bad decisions: FPA code is still being generated, even though the last processor to execute it natively was twenty years ago. FPEmulator still emulates FPA in software, even though VFP instructions are now available. BASIC, unusually, supported soft-float from the beginning – getting the decision right with hindsight. Its later FP cousin is still using emulated FPA rather than VFP. And SWIs are still the general-purpose API between software components (rather than the means of escalating privilege that most OSes restrict them to today, due to the costs). Most of those are fixable but we are where we are. |
GavinWraith (26) 1563 posts |
Many thanks for this Theo. There were some very bright people at Acorn, and I think that there could have been intellectual reasons for underestimating the importance of FP. Computers are finite digital machines. The leap in mathematics from integers to real numbers is over an abyss that inescapably involves infinitary notions (suprema). Those notions have to be faked on a computer. For example, a stack is a way of faking an arbitrarily long array. Similarly, in programs we fake meaning with types . But there are lots of ways of doing the faking, and the utility of each depends on the intended use. The concept of program lulls us into forgetting that useful programming is a public business; protocols are needed so that everybody concerned is singing from the same hymn-sheet. Which is why IEEE standards for FP are important, for example. In other words, this becomes an organizational as well as a merely intellectual problem of implementing mathematical ideas. I often get upset when I find people thinking that real numbers have to be implemented using floating point. Convenient if a number must be represented with a fixed number of bytes, but you sacrifice the rules of algebra – multiplication with fixed point numbers is not always associative. Whereas if you use lazy streams of rational approximants you get the algebra correct but hellish difficulties of estimating storage requirements. Acorn had plenty of intellectual skills, but perhaps I do not need to complete this sentence … ;) |
Rick Murray (539) 13805 posts |
Is this still the case? I thought the likes of VFP could work in parallel so long as results were not needed immediately, rather like the behaviour of the Cortex dual pipeline.
Yup. We’ve long passed the point when the Norcroft compiler should have emitted FPA as a compatibility option, rather than only being able to do FPA for floats.
How is soft float more optimal than hardware FP? Wouldn’t the most logical way forward be to introduce VFP support and transition software to use that rather than an emulation of an ancient device?
There is floating point and there is fixed point, but more specifically both are approximations. Just like how 96kHz 24 bit audio is the new best thing, despite decades of 44.1kHz (CD) or 48kHz (DVD) at 16 bit – possibly more levels of accuracy than a lot of hardware is capable of reproducing and certainly more than our ears are capable of sensing, it’s only an approximation of the sound and the higher spec protocol is just the same, only with more data to (hopefully) make it a better approximation. This shows up in my weather program where reading wind speed data from the sensor (it gives raw data in m/s) and translating it to a human friendly format like kp/h or mph (in BASIC) often results in wind speeds like 12.3000000001 kph. Because real world stuff on a digital logic device is merely an approximation. The question is, what level of accuracy is acceptable?
Yet time and again you will see two things: |
Jon Abbott (1421) 2641 posts |
Wasn’t the FPA10 restricted to 25MHz? Was a faster FPA ever released? I expect the reason we’re in this state is because Acorn implemented a “RISC like FPA” (their words), by RISC they simply meant “not a full implementation”. As a consequence the FPA needed assistance before you could use it, making it pretty useless as a general use FPA and only really relevant to carefully written code that only used normalized values and specific instructions. Requiring hand-holding meant several versions of FPEmulator, one which was totally software based and another which handled the unimplemented scenarios. That’s fair enough, the FPA10 was pretty advanced for the day. Only problem is, they never produced an FPA that fully implemented the full instruction set and handled non-normalized numbers etc. So, in short hardware based FPA was rocky from day 1. Do any ARM chips these days even implement an FPA? I expect not, as it’s a co-processor so will at best be implementation defined or worst, not even an option. |
Clive Semmens (2335) 3276 posts |
You expect right. Nothing since the A7500. I think that’s actually the only chip that’s ever had the FPA onboard. |
Jeffrey Lee (213) 6048 posts |
I think someone needs to teach Rick the concept of “history” :-) and the CPU can’t get any useful work done while waiting for the FPU reply. Theo was talking about 80’s-era FPUs. Modern FPUs definitely avoid blocking other parts of the pipeline as much as possible.
Theo was talking about emulated FP on ARM6/7. Clearly if you’ve got hardware FP being handled by the CPU (either integrated or on the coprocessor bus) then there should be no extra cost in instruction decoding. (Although it wouldn’t surprise me if earlier coprocessor-based designs did have an extra cycle of latency because they had to decode the instruction after being fed it by the ARM)
It isn’t. Theo is talking about soft float vs. emulating FP hardware.
Theo is talking about the mid-90’s, when VFP didn’t exist. |
Rick Murray (539) 13805 posts |
History…oh, that’s that thing where lots of people die for the same stupid reasons over and over again… What I was asking was to compare what was with what is, for example how does ARM rationalise FP operations, given the potential for stalling awaiting a result. Is it more efficient to string together FP instructions, or to interleave FP and ARM so the ARM has something to do while the FP is busy? Not that I plan to be using FP ops. Just wondering… |
Pages: 1 2