Taking FP seriously

30 posts, 9 voices

Pages: 1 2

Jan 26, 2017 6:44pm GavinWraith (26) 1572 posts	On another thread Rick said I don’t think Acorn, in their entire history, ever took FP seriously. … I think, generally, they thought that the emulated FP was “good enough”. … The knock on effect of that is that today we’re in the rather ludicrous situation of most software making expensive calls to the FPEmulator module instead of using the native VFP present in pretty much every ARM board made this decade. Worse yet, the official compiler only knows about FPE so even new software is impacted. CPUs generally seem to offer fp arithmetic (+, -, *, /, abs, round) and an assortment of basic functions ( exp, log, power, sqrt, sin, cos, tan, asin, acos, atan). If I remember rightly BBC BASIC implements these with lookup tables for continued fraction approximants (i.e. rational function approximants), which is the only sensible way to go, given that rational functions are precisely those expressible using only the arithmetic functions. It would have made sense for BBC BASIC to have supplied as a separate function a generic routine with a user-supplied lookup table. Admittedly, it would not be able to take advantage of the various optimizations that could be used for particular functions such as those mentioned above, but even so it would have been useful to have (are you listening Steve?). I do not know enough about VFP, but I suspect that it must use a similar algorithm in microcode. Does it expose it for public use?

Jan 26, 2017 7:42pm Clive Semmens (2335) 3282 posts	I do not know enough about VFP, but I suspect that it must use a similar algorithm in microcode. Does it expose it for public use? Not sure whether it’s microcoded or hardwired – I’m pretty sure that’s not specified in the architecture, so an implementer would be free to do whichever they preferred. I’m pretty sure it’s got to be that kind of algorithm though, and I’m very sure it’s not exposed for public use.

Jan 26, 2017 7:59pm Rick Murray (539) 14048 posts	Does it expose it for public use? ? To me that question is like “does the ARM expose how MUL works?”. Am I misreading something? I do not know enough about VFP, It’s a lot like the older FPA, only different instructions and more capabilities. And, from the point of view of a programmer, the data byte order is the other way around, so using VFP in a program written for FPE (any Norcroft C program) isn’t simply a matter of patching in some VFP instructions, though you can fake it by loading the double precision register as two single precision halves, like: `VLDR S0, [R0, #4] ; backwards load D0 as two halves from FPA data VLDR S1, [R0] VLDR S2, [R1, #4] ; ditto for D1 VLDR S3, [R1] VMUL.F64 D2, D0, D1 ; D2 = D0 * D1 VSTR S4, [R2, #4] ; backwards write to save in FPA form VSTR S5, [R2]`

Jan 26, 2017 8:21pm Jeffrey Lee (213) 6048 posts	VFP only provides addition, subtraction, multiplication, division, abs, negation, multiply-accumulate (including fused multiply-accumulate in VFPv4), and square root (and different rounding modes). NEON supports the same operations as VFP, except for square root. Instead it supports reciprocal and reciprocal square root. They use lookup tables and Newton-Raphson iteration to arrive at the result – with the programmer in control of how many iterations are performed (VRECPE, VRECPS, VRSQRTE, VRSQRTS instructions)

Jan 26, 2017 8:29pm Clive Semmens (2335) 3282 posts	I think what Gavin wants is to be able to use the VFP’s Newton-Raphson iteration with user-supplied look-up tables for user-defined functions, which he definitely can’t do.

Jan 26, 2017 8:50pm GavinWraith (26) 1572 posts	They use lookup tables .. But they do not let the user specify a lookup table in memory? That would seem to me to be rather useful. In the early days of Acorn we had numerous academics on board who were experts on numerical methods. They would have had a field day with today’s hardware. Newton-Raphson iteration OK if you know the derivative, as with standard functions. For functions that are specified in some other way, say by an integral equation or a minimization problem, it might not be the most convenient algorithm.

Jan 26, 2017 10:17pm Jeffrey Lee (213) 6048 posts	They use lookup tables .. But they do not let the user specify a lookup table in memory? That would seem to me to be rather useful. No, the lookup table is fixed in the hardware. If you want custom lookup tables, then I think you have three options: Implement the lookup yourself using standard LDR/VLDR instructions Get your hands on an ARMv8 chip with the Scalable Vector Extension implemented, and use the gather-load instructions to do lookups in parallel Use the NEON VTBL instructions. These operate on a lookup table held in registers (so no memory latency costs), and being NEON it does multiple lookups in parallel. However it’s restricted to only operating on vectors of bytes, and the max lookup table size is 32 bytes, so I doubt it would be that useful for general maths work.

Jan 27, 2017 1:10pm Steve Drain (222) 1620 posts	CPUs generally seem to offer fp arithmetic (+, -, *, /, abs, round) and an assortment of basic functions ( exp, log, power, sqrt, sin, cos, tan, asin, acos, atan). As does an FPA/FPE, with addition of POL (equivalent to ATAN2). If I remember rightly BBC BASIC implements these with lookup tables for continued fraction approximants (i.e. rational function approximants) … BASIC V with 5-byte floats uses Chebyshev polynomials where appropriate, not continued fractions as far as I can tell. BASIC VI uses FPA instructions. The FPE integer code is beyond me, but it is very long. ;-( VFP only provides addition, subtraction, multiplication, division, abs, negation, multiply-accumulate (including fused multiply-accumulate in VFPv4), and square root (and different rounding modes). Note the missing ‘basic functions’. Also, VFP is only 64-bit, whereas the FPA can do 80-bit and alway does internally. My Float module does its best to add those other basic functions and to make calling VFP friendly and backward compatible with FPE. Remember ‘context switching’. NEON supports the same operations as VFP, except for square root. But only 32-bit, which I think makes it a no-go for exploiting in BASIC, at least.

Jan 27, 2017 1:29pm Jeffrey Lee (213) 6048 posts	CPUs generally seem to offer fp arithmetic (+, -, *, /, abs, round) and an assortment of basic functions ( exp, log, power, sqrt, sin, cos, tan, asin, acos, atan). As does an FPA/FPE, with addition of POL (equivalent to ATAN2). The original FPA co-processor that was available for the ARM3 implemented all the arithmetic functions in hardware. But for the FPA in the ARM7500FE, there’s no hardware support for the trig or power functions (except square root) – POW, RPW, POL, LOG, LGN, EXP, SIN, COS, TAN, ASN, ACS, and ATN are handled by FPEmulator (see section 10.2.3 of the ARM7500FE data sheet). NEON supports the same operations as VFP, except for square root. But only 32-bit, which I think makes it a no-go for exploiting in BASIC, at least. For floating-point, yeah. But maybe there’s some fun stuff which could be done for integers (array/matrix operations)

Jan 27, 2017 4:42pm Steve Drain (222) 1620 posts	But for the FPA in the ARM7500FE, there’s no hardware support for the trig or power functions (except square root) – POW, RPW, POL, LOG, LGN, EXP, SIN, COS, TAN, ASN, ACS, and ATN are handled by FPEmulator (see section 10.2.3 of the ARM7500FE data sheet). Now, I did not know that. Having read the data sheet, I see that the recommendation is to use library functions (in C) to implement the missing instructions, because that is more efficient than letting the FPE implement them through exceptions. This would also seem to be the best route for using VFP, rather than an FPE that uses VFP instructions. The problem is backward compatability, I suppose. Is the source for that version of the FPE likely to be available anywhere? [NEON] But maybe there’s some fun stuff which could be done for integers (array/matrix operations) Having read every NEON instruction while writing the StrongHelp VFP manual, I am staggered by the things that might be done.

Jan 27, 2017 4:56pm Rick Murray (539) 14048 posts	I am staggered by the things that might be done. Mmm, like unpacking an RGB bitmap into separate registers for each colour component with one instruction, and then after processing, put the RGB data back together from the separate registers, again using one instruction. There’s some interesting stuff in NEON. Just a shame I’m too dumb to understand most of it…

Jan 27, 2017 5:07pm Jeffrey Lee (213) 6048 posts	Is the source for that version of the FPE likely to be available anywhere? ARM allowed the FPEmulator source to be released under a BSD license a few years ago. https://www.riscosopen.org/viewer/view/mixed/RiscOS/Sources/HWSupport/FPASC/ (“FPASC” = “FPA support code”, I guess) All the interesting stuff will be in the coresrc folder.

Jan 27, 2017 6:04pm Clive Semmens (2335) 3282 posts	There’s some interesting stuff in NEON. There is indeed. I loved documenting it, it’s so elegantly designed. It’s a pity I’ve never had the opportunity to use it in anger… Just a shame I’m too dumb to understand most of it… Yeah, right, of course you are. Not. Never learnt the relevant skills to make use of it, perhaps, but that’s not the same thing at all at all.

Jan 27, 2017 8:49pm Rick Murray (539) 14048 posts	Yeah, right, of course you are. Not. Unfortunately, there is a technical issue. 😕

Jan 27, 2017 8:58pm Clive Semmens (2335) 3282 posts	If that’s purely an arithmetical dyscalculia, it shouldn’t be an impediment. If it’s more general, fair point and I’m sorry. (And sad, but not ashamed.)

Jan 27, 2017 9:51pm Steve Pampling (1551) 8272 posts	If that’s purely an arithmetical dyscalculia, it shouldn’t be an impediment. Funny things mental blocks. I can do calculus¹, in fact in college I answered the blackboard examples almost instantly with some mystic process in my head. Chances of me completing the task with full explanation of the intermediate stages 5 minutes later = nil. Why? Mental block. I understand the whole sequence but when called up to work through it my mind goes blank. Fortunately the requirement to know such things went away over 30 years ago. ¹ According to my A level maths teacher I was useless at it, according to my college lecturer my A level maths teacher was an idiot.

Jan 27, 2017 9:57pm Rick Murray (539) 14048 posts	it shouldn’t be an impediment. It isn’t really. Maths is not a big part of my life, though in a discussion regarding FP behaviour, I’m probably the worst person to be in the discussion. ;-) I have no difficulty with three dimensional visualisations, and I see estimating which of two things is more as being a logic problem more than a maths one. It’s like when you’re at the till and it comes to 9.86 and you give the woman 10.06 and she hands the extra six cents back as she isn’t able to understand that giving back a twenty is going to be quicker and simpler than finding fourteen cents. On the other hand, long division might as well be magical incantations. I multiply like a computer (add plus add plus add…). And I try to estimate before I start what sort of ballpark I would expect my result to be in, so you won’t find me downing 30g of caffeine! It was fascinating to see the inability to associate a name with a face was part of this. I’m not sure why, however that’s something a have trouble with. I can remember faces and recognise them with the briefest glance – I proved this when I was about eight waiting in Basingstoke to catch a train to Southampton. I spotted another crew member on the intercity as it barreled through the station. Everybody said that was impossible so I told him where on the train, what time, and described what he was wearing. All correctly. But could I tell you his name? Nope! I’m a little better now, now I only need to be told half a dozen times; though at work I do try to remember people’s name, generally in the order of importance and/or cuteness (so my boss and the pretty girl get referred to by name, everybody else is “uhhhhh”). :-) No trouble with time. I used to get detention lot (teachers don’t like being corrected, it seems, but then I was a bit of a troll¹) so I used to count of my “stand in the corner” time with great accuracy. I’m less accurate now but I think that’s a perception thing to do with aging. ¹ Some might say “obnoxious git”, but then I’d say that of some teachers. Truth is, I found school mind numbingly boring until maybe third form senior. Of course we are dealing with an education system where the response to pre-teen me reading John Whyndams when the other kids are looking at books with Ladybird print and pictures was “he can’t”. What? Sc**w you, lady!

Jan 27, 2017 11:04pm Steve Pampling (1551) 8272 posts	Of course we are dealing with an education system where the response to pre-teen me reading John Whyndams when the other kids are looking at books with Ladybird print and pictures was “he can’t”. What? Sc*w you, lady! Oh that routine. My father had to deal with people doing that three times over, although my youngest sibling being so much younger the teenage me had lots* of fun winding up the critical adults

Jan 28, 2017 1:15am Theo Markettos (89) 919 posts	I don’t think Acorn, in their entire history, ever took FP seriously. … I think, generally, they thought that the emulated FP was “good enough”. … The knock on effect of that is that today we’re in the rather ludicrous situation of most software making expensive calls to the FPEmulator module instead of using the native VFP present in pretty much every ARM board made this decade. Worse yet, the official compiler only knows about FPE so even new software is impacted. I don’t think they failed to take it seriously, I think they made some bad decisions early on that caused trouble later. The idea of specifying FP instructions and emulating them doesn’t seem a bad idea from the perspective of 1983 when the instruction set was designed. Using instructions as an API between applications and the OS (see also SWI) seems like a natural fit – especially when it’s possible to easily replace them with a hardware implementation. For instance they had a podule with a Western Electric WE32206 floating point unit on it. FPEmulator translated FP instructions into native commands to the FPU, waited for the results and returned them. That made sense in a 1985 kind of world (indeed the contemporary MC68881 FPU documentation suggests a similar idea). But as soon as caches come along this became infeasibly expensive – easier to compute on the integer CPU than pay I/O latency, and the CPU can’t get any useful work done while waiting for the FPU reply. Bearing in mind that ARM1/2/3 were designed on the basis of no money and no people, they had to be as simple as possible to make it work at all. So adding an on-die FPU (which are expensive in terms of area) was out of the question. We’re now left with software using an API intended for hardware that doesn’t really exist. We had FPA but that was an expensive niche product. Move forward into ARM6/7 timeframe and the CPU is being developed by ARM. The costs of the undefined instruction increasing start to bite: pipeline flush, register save, FP instruction decode, potentially cache and TLB invalidates (bearing mind ARM is not designing for RISC OS any more). It’s cheaper just to compile soft-float. Meanwhile the market has moved from workstations (eg Acorn) to embedded (Psion, Nokia and friends). Nokia wasn’t interested in FP, they cared about power consumption. Process scaling meant FPA could fit on-die (viz 7500FE) but the integer side was only competitive for a short while in the STB market before others overtook it. Having made those wrong turns, I think ARM were right to scrap it all and start again. Perhaps Acorn’s final wrong turn was to cling to the FPA world for too long, rather than admitting defeat and going soft-float. Today, RISC OS still suffers from the compounded bad decisions: FPA code is still being generated, even though the last processor to execute it natively was twenty years ago. FPEmulator still emulates FPA in software, even though VFP instructions are now available. BASIC, unusually, supported soft-float from the beginning – getting the decision right with hindsight. Its later FP cousin is still using emulated FPA rather than VFP. And SWIs are still the general-purpose API between software components (rather than the means of escalating privilege that most OSes restrict them to today, due to the costs). Most of those are fixable but we are where we are.

Jan 28, 2017 10:32am GavinWraith (26) 1572 posts	I don’t think they failed to take it seriously, I think they made some bad decisions early on that caused trouble later. Many thanks for this Theo. There were some very bright people at Acorn, and I think that there could have been intellectual reasons for underestimating the importance of FP. Computers are finite digital machines. The leap in mathematics from integers to real numbers is over an abyss that inescapably involves infinitary notions (suprema). Those notions have to be faked on a computer. For example, a stack is a way of faking an arbitrarily long array. Similarly, in programs we fake meaning with types . But there are lots of ways of doing the faking, and the utility of each depends on the intended use. The concept of program lulls us into forgetting that useful programming is a public business; protocols are needed so that everybody concerned is singing from the same hymn-sheet. Which is why IEEE standards for FP are important, for example. In other words, this becomes an organizational as well as a merely intellectual problem of implementing mathematical ideas. I often get upset when I find people thinking that real numbers have to be implemented using floating point. Convenient if a number must be represented with a fixed number of bytes, but you sacrifice the rules of algebra – multiplication with fixed point numbers is not always associative. Whereas if you use lazy streams of rational approximants you get the algebra correct but hellish difficulties of estimating storage requirements. Acorn had plenty of intellectual skills, but perhaps I do not need to complete this sentence … ;)

Jan 28, 2017 12:05pm Rick Murray (539) 14048 posts	and the CPU can’t get any useful work done while waiting for the FPU reply. Is this still the case? I thought the likes of VFP could work in parallel so long as results were not needed immediately, rather like the behaviour of the Cortex dual pipeline. Likewise with the undefined instruction behaviour – surely contemporary ARM have pipeline logic that understands the FP instructions so they can be handled in a sensible/fast manner? FPA code is still being generated, Yup. We’ve long passed the point when the Norcroft compiler should have emitted FPA as a compatibility option, rather than only being able to do FPA for floats. rather than admitting defeat and going soft-float. How is soft float more optimal than hardware FP? Wouldn’t the most logical way forward be to introduce VFP support and transition software to use that rather than an emulation of an ancient device? I often get upset when I find people thinking that real numbers have to be implemented using floating point. There is floating point and there is fixed point, but more specifically both are approximations. Just like how 96kHz 24 bit audio is the new best thing, despite decades of 44.1kHz (CD) or 48kHz (DVD) at 16 bit – possibly more levels of accuracy than a lot of hardware is capable of reproducing and certainly more than our ears are capable of sensing, it’s only an approximation of the sound and the higher spec protocol is just the same, only with more data to (hopefully) make it a better approximation. This shows up in my weather program where reading wind speed data from the sensor (it gives raw data in m/s) and translating it to a human friendly format like kp/h or mph (in BASIC) often results in wind speeds like 12.3000000001 kph. Because real world stuff on a digital logic device is merely an approximation. The question is, what level of accuracy is acceptable? useful programming is a public business; protocols are needed so that everybody concerned is [horrid cliché cut] Yet time and again you will see two things: 1, idiotic protectionism subverting common protocols for intentional incompatibilities. Example: I challenge you to send a photo from an iPad to an Android phone using Bluetooth… 2, idiotic implementation of a protocol where only the bare minimum is supported, and probably wrongly so it’ll fail in edge cases. Example: my new Vonets VAP11G has a mode where it supports being a WiFi repeater. Sounds useful, but I had to disable it. I suspect the weak original signal was causing the adaptor to keep retrying. The end result being that the Vonets would swamp and choke the communications such that it probably could not receive anything at all and as a side effect it slew WiFi signals this side of the stone wall. Seriously, the presence of the Vonets in repeater mode caused my PC to manage a download measured in bytes per second. My media server box has a repeater mode. It works and the box is smart enough to disable WAN access if the signal is too weak. A phone directly connected to the media server and streaming a video takes my PC’s download from 250K/sec (about the max for a 2Mbit line) to 220-230K/sec, because they’re both using the same frequency (as it’s necessary for repeater devices that only have one radio transceiver).

Jan 28, 2017 12:25pm Jon Abbott (1421) 2661 posts	Wasn’t the FPA10 restricted to 25MHz? Was a faster FPA ever released? I expect the reason we’re in this state is because Acorn implemented a “RISC like FPA” (their words), by RISC they simply meant “not a full implementation”. As a consequence the FPA needed assistance before you could use it, making it pretty useless as a general use FPA and only really relevant to carefully written code that only used normalized values and specific instructions. Requiring hand-holding meant several versions of FPEmulator, one which was totally software based and another which handled the unimplemented scenarios. That’s fair enough, the FPA10 was pretty advanced for the day. Only problem is, they never produced an FPA that fully implemented the full instruction set and handled non-normalized numbers etc. So, in short hardware based FPA was rocky from day 1. Do any ARM chips these days even implement an FPA? I expect not, as it’s a co-processor so will at best be implementation defined or worst, not even an option.

Jan 28, 2017 12:49pm Clive Semmens (2335) 3282 posts	Do any ARM chips these days even implement an FPA? I expect no You expect right. Nothing since the A7500. I think that’s actually the only chip that’s ever had the FPA onboard.

Jan 28, 2017 12:59pm Jeffrey Lee (213) 6048 posts	I think someone needs to teach Rick the concept of “history” :-) and the CPU can’t get any useful work done while waiting for the FPU reply. Is this still the case? I thought the likes of VFP could work in parallel so long as results were not needed immediately, rather like the behaviour of the Cortex dual pipeline. Theo was talking about 80’s-era FPUs. Modern FPUs definitely avoid blocking other parts of the pipeline as much as possible. Likewise with the undefined instruction behaviour – surely contemporary ARM have pipeline logic that understands the FP instructions so they can be handled in a sensible/fast manner? Theo was talking about emulated FP on ARM6/7. Clearly if you’ve got hardware FP being handled by the CPU (either integrated or on the coprocessor bus) then there should be no extra cost in instruction decoding. (Although it wouldn’t surprise me if earlier coprocessor-based designs did have an extra cycle of latency because they had to decode the instruction after being fed it by the ARM) How is soft float more optimal than hardware FP? It isn’t. Theo is talking about soft float vs. emulating FP hardware. Wouldn’t the most logical way forward be to introduce VFP support Theo is talking about the mid-90’s, when VFP didn’t exist.

Jan 28, 2017 1:15pm Rick Murray (539) 14048 posts	I think someone needs to teach Rick the concept of “history” :-) History…oh, that’s that thing where lots of people die for the same stupid reasons over and over again… What I was asking was to compare what was with what is, for example how does ARM rationalise FP operations, given the potential for stalling awaiting a result. Is it more efficient to string together FP instructions, or to interleave FP and ARM so the ARM has something to do while the FP is busy? Not that I plan to be using FP ops. Just wondering…

Pages: 1 2

Reply

To post replies, please first log in.

Forums → General →

Taking FP seriously

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options