FP support
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
David Feugey (2125) 2709 posts |
Quick question about FP support. What’s the state and possibilities for FPU support in… ??? |
jim lesurf (2082) 1445 posts |
Just want to echo the question to flag that I also would like FP hardware support. Jim |
David Feugey (2125) 2709 posts |
Nota, there are two subjects here: That supposes too that FPEM will trap FPA instructions for software that use Hard Float mode on computer without FPU. |
Steffen Huber (91) 1966 posts |
Basic for all Basic code: certainly not, because Basic V has its own float implementation and does not use the FPE at all. Maybe Basic VI could be enhanced? |
Jeffrey Lee (213) 6048 posts |
Using VFP for BASIC64 is certainly a possibility. However there are a couple of annoyances which would need resolving:
For the DDE, objasm 4.0 introduced full support for the VFP/NEON instruction set, so assembler code can use it as much as it wants. C support is IIRC held back by lack of development time/money, and lack of a plan on what PCS should be used – i.e. whether to create a new version of the APCS or whether to use an existing version (e.g. whatever GCC currently uses – AAPCS?). The ARMv7 inline assembler bounty will get things part-way there, but it’s just the tip of the iceberg. For the FPEmulator, producing a version which uses VFP is something that I did consider early on in development but have later dismissed, for a few reasons:
|
Ben Avison (25) 445 posts |
Technically, objasm 4.01, but yes. Objasm supports a huge number of PCS variants (but then it’s easy for objasm, it doesn’t need to do anything to marshall the arguments itself, just set flags in the object file).
One of the easier approaches is to base it on the softfp PCS that the compiler already supports – which uses FPA endianness for doubles. The main thing lacking is an implementation in the C library of all the runtime support functions that would need, though it would provide a simple way to avoid optimising for any one platform over the others, by ensuring that the ROM C library uses the best instruction set for the platform. For example, the double + double function _dadd could be implemented in the IOMD ROM C library as STMDB r13!, {r0-r3} and in the ROM C library for ARMv6+ VMOV d0, r1, r0 ; note, no particular penalty for FPA ordering and for the ROM C library for Iyonix, or for softloadable versions, you’d pinch the integer implementation from FPEmulator to do the same thing. Quite a lot of work to do the whole set of functions three times over, but doesn’t need too much exposure to the compiler (other than breaking its assumption that even in the softfp case you can return floating point types in FPA registers, because they don’t actually exist in new hardware). |
Steve Drain (222) 1620 posts |
Here’s what I think I know, which is not compehensive: FPE remains largely as it has been. Assembler can be written using FPA co-processor instructions that will use it. BASIC VI uses those instructions for its float type. Assembler can be written using VFP/NEON instructions for those processors that support them. The more recent versions of BASIC will assemble such instructions, but they are not documented under the HELP [ command. The VFP instructions are limited in scope and cannot directly do all that the FPA/FPE offers. There are single and double precision instructions, but not extended. There are no trancendental instructions: SIN, EXP etc. I have written two documentary aids: A BASICHelp module that attempts to reproduce BASIC’s HELP command as a *command, with more information and the extended VFP instructions. This is not registered and uses a syntax that might not meet approval, so comments are welcome. I have not done anything with it for more than a year. A VFP/NEON StrongHelp manual encapsulating the extensive information Jeffrey posted here a long while back. This is fairly sound for VFP, but needs more editing of the complex NEON instructions, although I doubt that these are likely to be needed any time soon. ;-) I have also written a Float module that provides double precision floating point support through SWIs. This uses VFP when available and FPA when not, through a single interface. This does have trancendental functions implemented using VFP instructions. The SWI interface is a large overhead and the speed increase is small, but the code can be called directly for significantly better performance. This is not registered and comments are welcome. I have done some work with Basalt to implement double precision floats in BASIC V using my Float module, but this is not published. In reply to Jeffrey:
Float treats all external double precision floats as big-endian and converts to little-endian for VFP.
Float provides all the the trancendental operations of FPA.
Float uses FPA in the absence of VFP.
Float has my best efforts at providing suitable algorithms. As far as I can judge, accuracy is as good as FPA, except for the POW function, which may loose one place in 17. The source is available, so if anyone want to use those algorithms in a more suitable format, they have my blessing.
Float requires a program to create and release a context with Float_Start and Float_Stop, which do a little more than just VFPSupport.
Calling the code directly is a very significant gain over FPA, despite the overheads.
Float takes the reverse approach, but to the same end. Changing that approach would not be difficult. I did nearly all this about 18 months ago and have hardly visited it since. There are certain to be improvement to be made, so I would welcome som feedback. ;-) |
David Feugey (2125) 2709 posts |
Thanks for this very long (and useful) answer. I agree on VFP point VS legacy FPA. Just need to get it in C :) For Basic, could I suggest to open a branch for a beta of BBC Basic VI? So if something does not work, it’ll be a compatibility problem we don’t need to solve. The same way some code made for BBC Basic IV does not work on V. Of course, if ABC can support BBC Basic VI, everything will be perfect. If you borrow trig functions from existing ones, it could be even a 5 minute job :) It could be a good idea to borrow some ideas from the PC version. The closer the two products will be, the better it’ll for us. For example, BBCBasic4Win provides 80bit floats and 64bit integers. Unicode is supported too. Could be cool for BBC Basic VI, as there are effort for Unicode support in RISC OS 5. Another fantastic idea would be the possibility to define & rewrite keywords. It was possible, but never done. Basically it’s a trick to implement in the parser. Could be fantastic, to change some features, or extend basic (by loading a library of keywords implemented in Basic or assembly). A good way to make the core of BBC Basic becoming smaller and to give the possibility to ‘non power users’ to work to the evolution of the project. For more modern use, a parser that can accept keywords in lower case, too. |
Fred Graute (114) 645 posts |
A useful resource Steve, especially as I’m extending StrongED code colouring to cover ARMv7 and VFPv3. Unfortunately the links on the VFP page don’t work due to a rogue I also notice that negative multiply instructions (VNMUL etc) are missing. (They are also missing in BASIC V 1.60 but Debugger 1.90 does know them.)
That will be very handy to test the new code colouring. Most instruction are already coloured correctly, just a few more to go. |
Rick Murray (539) 13958 posts |
Do you have any statistics? Every so often when discussing CLib, the “quirk” of linking directly into the module itself is raised and the suggestion of calling C functions via SWIs is often raised. Asides from the constant jumping in and out of SVC mode, I can only imagine the sort of speed hit that repeatedly calling the SWI handler would cause, over one load and two branches (which is what the jumptable method requires). Your post implies that you might have tested things and so have figures for direct entry vs SWI’d entry. Do you? If so, please share! |
David Feugey (2125) 2709 posts |
Nota: Basic VI could definitively be a bounty. I have about 300 € ready for this one. |
jim lesurf (2082) 1445 posts |
Afraid I don’t understand all the above details. My situation is that I write programs that get complied/linked using the ‘ROOL’ ‘C’ compiler. For precision, etc, these use double floats a lot of the time. So I simply wish for a situation I had eons ago when I had a machine with real FP hardware. i.e. that all the double floating point instructions got done by the relevant hardrware. The result tended to be about a 20x speed hike. Makes a big difference when doing something like going though a 100 MB audio file and FFT’ing all the chunks in that. Maybe I could use some of the above I don’t understand like ‘Neon’ or ‘Float’. But would that mean total re-writes of the programs? If so, not keen as it would seem more sensible for the hardware to support the established language and not make added assumptions about what is available. Jim |
Steve Drain (222) 1620 posts |
I can see that #prefix command, but with the old version of StrongHelp that I mainly use it does not cause any problem, ie: the prefix is a null string. I have a much more fully edited version here and I will try to check it and upload it soon. Thanks.
If they are missing it is most likely because Jeffrey did not include them. The basic production of the manual was directly from his file. If you have details of what is missing, I will include it. |
Steve Drain (222) 1620 posts |
I think I have fooled you, and perhaps myself. My attention is almost entirely towards BASIC, so when I refer to calling SWIs I am really thinking of SYS. That adds quite a large additional overhead. At a theoretical level, a Float SWI call is in place of single machine code instruction. With the FPE this actually hides a considerable amount of integer code and overheads may not be very significant. With VFP it could be just that one instruction and any overhead is likely to be significant. Even for the trancendental operations that need a number of VFP instructions, SYS/SWI overheads seem to be significant. The testing I did is from BASIC and the code is included with the module. It is the usual FOR NEXT timing loops. I did not record timings, but noted the differences and satisfied myself that a direct call of the code was very much faster. It might be worth noting that BASIC VI has to have a fair amount of overhead before it calls just one FPA instruction to implement it floating point operations. ;-) |
Steve Drain (222) 1620 posts |
BASIC VI already exists; I claim my 300 euros. ;-) Seriously, I am puzzled by what you want to see, or imagine can be done. If you could produce an actual specification it could be assessed more accurately.
To some extent I think this reflects the underlying machine, but remember, ARM BASIC dates back more than 25 years and has not had the continual attention that BB4W has enjoyed from Richard Russell. If you want to use 80-bit floats, then you can write assembler for the FPE and hide it away in BASIC routines, but it will be slow and no modern ARM processor supports them. If you want to use 64-bit integers you can do the same, and I am not alone in writing a library to do this. It is also on my list of things to include in Basalt.
How possible? And why? I do extend the use of keywords with Basalt, but that is not integral to BASIC. I have long considered the possibility of modularising, but have yet to see a way. As for BASIC itself, the code is very unfriendly to such a concept, I think.
How is this different from libraries of PROCs and FNs? |
Dave Higton (1515) 3584 posts |
That’s what I thought. I was puzzled by these references. Didn’t BASIC VI exist in the 1980s? What became of it? |
Martin Avison (27) 1512 posts |
Enter *help basic64 at a command line! Basic VI is the FP version of Basic V |
Steve Drain (222) 1620 posts |
BASIC VI is the release of interpreter version 51 that uses 64-bit floats. Otherwise BASIC V an VI are pretty well identical to the programmer. It was a soft-load option but is now included in the ROM. 1 In case anyone quibbles with this terminology, look at the identity word in the BASIC module preceding the environment information pointer passed in R14 to CALL. In both cases it is &BA51C005. |
Rick Murray (539) 13958 posts |
The thing is, consider the following program:
The nonsense with _kernel_swi() is to prevent the compiler being smart and optimising out most of the code. ;-) This translates to:
It looks pretty good, right? The problem is, these are FPA instructions. If a hardware FPA is available, it will execute the instructions. If not, the ARM will raise an undefined instruction exception at which point the FPEmulator will step in and perform the operation. As you can see, executing this nice tight little three instruction FP multiply could involve something in the order of three hundred instructions being executed. If you are a maths nerd, you could probably do something better using fixed point and integer maths to fake it… There are two alternatives. The first is “VFP”, a newer type of floating point built into most ARMs made in the last decade.
I think. I reserve the right to be utterly wrong. At any rate, it takes about six FP instructions instead of hundreds of ARM ones. The alternative FP implementation, found on ARMv7A (Cortex-A) is called NEON. It is supposed to three times faster than VFPv2 (ARMv5) and twice as fast as VFPv3?4? (ARMv6); but it comes with caveats. It isn’t IEEE compatible and only works with single precision mathematics.
Why? The program above could be recompiled to use VFP or NEON simply by passing flags to the compiler and recompiling. Unfortunately, passing “-cpu cortex-a8” to the compiler still generates FPA code. Maybe in the future it would be able to use something more appropriate to the processor type/family chosen. It would require a little more complication in CLib, though, in order for it to recognise different types of float in printf() and so on, for the float variations may be NEON, VFP, or FPE.
Sometimes you have to draw a line. Would you prefer half a dozen FP instructions, or hundreds of ARM instructions. Remember, I am talking about potentially one hundred ARM instructions per FP instruction. Sometime or other we will have to accept that going native is the only sensible option for programs that make heavy use of FP. For me, I am not that bothered as I rarely use FP. I used to for working out percentages, but by rearranging the calculation I can do it in straight integer maths. For you, with the work you do on audio samples, I can imagine a lot of FP would be necessary. I wonder how much faster (less system load) AMPlayer would be if there was a VFP version… |
David Feugey (2125) 2709 posts |
I mean VII :)
Nothing, or many things. SupermanLee told that VF support could break compatibility with some code. So just change the version, and give the people the choice to use V, VI or VII. No more problems, and many opportunities to make other bigger changes.
To provide Basic & non system programmers a way to create and modify keywords. Basic programmers prove every day that they can make useful things. But they need simple interfaces to help RISC OS. The same with other parts of the OS (skeletons for image conversion modules, for example, could help non system C developers to port things). I love plug-ins :)
You could define new keywords, or even change some existing ones. Richard, for example, make a big change on sound command, available as a patch for BBCB4Win. Just load it, and play. To change BBC Basic is not for everyone, but to extend it from Basic, would be – IMHO – simpler. That’s just a parser thing anyway (the new sound could simply be replaced on the fly by some FNnewsound) |
jim lesurf (2082) 1445 posts |
Sorry, afraid I still didn’t understand how to use VFP/NEON without changing my existing ‘C’ code. I think you may be answering a question I wasn’t asking! :-) I do understand how any FP instructions are caught, and lacking real access to FP hardware are emulated by bucketloads of int instructions, etc. That’s why the process is so tediously slow. What I’m asking is if there is a way now (or soon) for having the hardware simply access and use FP hardware without my having to change my existing C code, etc. At present I’ll have lots of lines with things like a = b*c; z = g/pi;etc where a, b, etc are all double precision floats. and of course all the calls like a = cos(pi2*f);ditto. How do I tell the machine now to handle the resulting compilied and linked code using FP hardware? And if not now, when/how? One point of course is to avoid users having to recompile if their machine lacks these hardware alternatives to the FPE. I understand the point of having an FPE which can trap and handle via integer or pass on to accessible FP hardware. It means the person compiling and linking doesn’t have to worry about generating multiple versions and ensuring the user runs the ‘correct’ one. The only worry being the dramatic difference is speed for the users between the two, which is unavoidable. FWIW I did give away my remaining FPA11 (IIRC the chip number) years ago. Maybe I should have kept it as a reminder of what we have since lost because Acorn seemed to decide this area simply didn’t matter. My guess is that what is needed is a modern update to the FPE which traps the instructions and then sends something appropriate to NEON, etc. But I’m not sure I’ve understood. Jim |
Rick Murray (539) 13958 posts |
You don’t, unless you wasn’t to drop to assembler and add your own routines. As it is, we are stuck with something that was “old” a quarter of a century ago.
That’s the question.
Or to give the compiler the ability to output VFP (NEON?) instructions and let the programmer decide? Personally, I’m concerned about the majority suffering for the minority. We ought to offer two versions of programs in that case – one for old machines, one for newer.
Oh hell yes! I never understood why Acorn seemed so against hardware FP, when it was being introduced on the competitor platform. I know that the FPEmulator is clever, but it is no match for real FP. |
Rick Murray (539) 13958 posts |
And if not now, when/how? Actually… If we can forget about NEON for now, it might be simpler than it first seems. I have the FPA10 datasheet and it claims to be IEEE 754 compliant. I also have the VFP data in the ARM ARM 2 and it claims to be IEEE 754 compliant. In this case, I would imagine saving FP registers to memory would use the same format? With this in mind, I’m going to have a crack at writing the program I gave earlier to drop to an assembler routine to use VFP instructions instead. With any luck, printf() will work, which would imply that as long as the FP values are stored to memory, the CLib functions ought to still work. |
Dave Higton (1515) 3584 posts |
So let me ask a naive question or several. Could the shared C library discover what FP system (if any) a given platform has, and use the best available? Could BASIC use the shared C library? Should it? And presumably anyone writing code could use it, although the documentation may or may not be adequate. |
David Feugey (2125) 2709 posts |
Not with a FVPEmulator module
Perhaps because it’s much more difficult to design than an ALU? :)
Only needed for VFP/Neon problem. for the other cases (FPU/noFPU), VFPEmulation will be OK. Of course, if there is no VFP Emulation, we need two versions of a program: one for VFP, another for FPA (that will use FPEmulator). For speed issues and old software, FPEmulator should be able to use VFP if present. It’s what we call Soft FP. So 3 things here: For Basic and other ASM software, it can be directly VFP code. But you’ll need one of the two: For a potential BBC Basic VII, I suggest VFP mode, so just for modern motherboards… until we’ll have a VFPEmulator module. Some functions will be a bit different, as between V and VI. Could also be the occasion to solve some problems with zero page. |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12