FP support

280 posts, 29 voices

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Jul 30, 2015 6:28am David Feugey (2125) 2709 posts	Quick question about FP support. What’s the state and possibilities for FPU support in… - FPEmulator (for old applications). - DDE (for new one). - Basic (for all basic code). ???

Jul 30, 2015 8:20am jim lesurf (2082) 1445 posts	Just want to echo the question to flag that I also would like FP hardware support. Jim

Jul 30, 2015 11:10am David Feugey (2125) 2709 posts	Nota, there are two subjects here: - FPEM to FPA bridge (SoftFP) - Hard Float for DDE & Basic. That supposes too that FPEM will trap FPA instructions for software that use Hard Float mode on computer without FPU. State today is Soft Float. SoftFP could be a first step in the right direction.

Jul 30, 2015 12:21pm Steffen Huber (91) 1969 posts	Basic for all Basic code: certainly not, because Basic V has its own float implementation and does not use the FPE at all. Maybe Basic VI could be enhanced?

Jul 30, 2015 1:42pm Jeffrey Lee (213) 6048 posts	Using VFP for BASIC64 is certainly a possibility. However there are a couple of annoyances which would need resolving: Word ordering. Although FPA was little-endian, the two words that made up a double were stored in big-endian order (i.e. high 32 bits were first, low 32 bits second). We’d have to decide whether to stick with that big-endian ordering or switch to VFP native little-endian ordering. This has the potential to result in compatibility issues, because there are places where BASIC allows the FP values to leak into external code (e.g. assembler routines are given access to some of BASIC’s internals) Lack of trig functions in VFP. We’d need to track down a suitable software implementation (either full integer math or using VFP), or have it fall back to using the FPA instructions. Not an impossible task, just something that changes it from a quick five minute job to something that will need a bit more planning and testing to make sure the new implementations are sufficiently accurate. For the DDE, objasm 4.0 introduced full support for the VFP/NEON instruction set, so assembler code can use it as much as it wants. C support is IIRC held back by lack of development time/money, and lack of a plan on what PCS should be used – i.e. whether to create a new version of the APCS or whether to use an existing version (e.g. whatever GCC currently uses – AAPCS?). The ARMv7 inline assembler bounty will get things part-way there, but it’s just the tip of the iceberg. For the FPEmulator, producing a version which uses VFP is something that I did consider early on in development but have later dismissed, for a few reasons: One reason is that the design of VFPSupport doesn’t really allow for it – with FPEmulator there is always a FPA context which programs can rely on existing. But with VFPSupport programs need to manually request a context. So FPEmulator can’t assume that a context is going to be available at any given point, and the overhead of switching context for every emulated FPA instruction could be quite significant. I could probably add some kind of back-door to work around it, or we could switch to a design where there is always a default context available, but it’s the kind of change I don’t want to make unless there’s a clear need for it. Even if there was a VFP context already set up, it’s not clear how much of a performance gain you’d get – you’d still have the overhead of going through the undefined instruction vector and decoding the FPA instruction. Plus there are some operations (trig operations, or anything using extended precision) which can’t be done in VFP. FPA is legacy. Like, really really legacy. VFP, on the other hand, is almost a given – you’re highly unlikely to see a (popular) port of RISC OS to a device which lacks VFP hardware. So it would make much more sense to make VFP the default FP target rather than FPA, and to produce a version of VFPSupport which provides full emulation so that older computers can run VFP code. New machines will benefit from full hardware FP performance, without impacting the performance of old machines (unless they were lucky enough to have FPA hardware).

Jul 30, 2015 7:18pm Ben Avison (25) 445 posts	For the DDE, objasm 4.0 introduced full support for the VFP/NEON instruction set, so assembler code can use it as much as it wants. Technically, objasm 4.01, but yes. Objasm supports a huge number of PCS variants (but then it’s easy for objasm, it doesn’t need to do anything to marshall the arguments itself, just set flags in the object file). C support is IIRC held back by lack of development time/money, and lack of a plan on what PCS should be used. One of the easier approaches is to base it on the softfp PCS that the compiler already supports – which uses FPA endianness for doubles. The main thing lacking is an implementation in the C library of all the runtime support functions that would need, though it would provide a simple way to avoid optimising for any one platform over the others, by ensuring that the ROM C library uses the best instruction set for the platform. For example, the double + double function _dadd could be implemented in the IOMD ROM C library as STMDB r13!, {r0-r3} LDFD f0, [r13], #8 LDFD f1, [r13] ADFD f0, f0, f1 STFD f0, [r13] LDMIA r13!, {r0-r1} ; TBD whether double results are returned in 2 integer registers MOV pc, r14 and in the ROM C library for ARMv6+ VMOV d0, r1, r0 ; note, no particular penalty for FPA ordering VMOV d1, r3, r2 VADD.f64 d0, d0, d1 VMOV r1, r0, d0 MOV pc, lr and for the ROM C library for Iyonix, or for softloadable versions, you’d pinch the integer implementation from FPEmulator to do the same thing. Quite a lot of work to do the whole set of functions three times over, but doesn’t need too much exposure to the compiler (other than breaking its assumption that even in the softfp case you can return floating point types in FPA registers, because they don’t actually exist in new hardware).

Jul 30, 2015 7:29pm Steve Drain (222) 1620 posts	Here’s what I think I know, which is not compehensive: FPE remains largely as it has been. Assembler can be written using FPA co-processor instructions that will use it. BASIC VI uses those instructions for its float type. Assembler can be written using VFP/NEON instructions for those processors that support them. The more recent versions of BASIC will assemble such instructions, but they are not documented under the HELP [ command. The VFP instructions are limited in scope and cannot directly do all that the FPA/FPE offers. There are single and double precision instructions, but not extended. There are no trancendental instructions: SIN, EXP etc. I have written two documentary aids: A BASICHelp module that attempts to reproduce BASIC’s HELP command as a *command, with more information and the extended VFP instructions. This is not registered and uses a syntax that might not meet approval, so comments are welcome. I have not done anything with it for more than a year. A VFP/NEON StrongHelp manual encapsulating the extensive information Jeffrey posted here a long while back. This is fairly sound for VFP, but needs more editing of the complex NEON instructions, although I doubt that these are likely to be needed any time soon. ;-) I have also written a Float module that provides double precision floating point support through SWIs. This uses VFP when available and FPA when not, through a single interface. This does have trancendental functions implemented using VFP instructions. The SWI interface is a large overhead and the speed increase is small, but the code can be called directly for significantly better performance. This is not registered and comments are welcome. I have done some work with Basalt to implement double precision floats in BASIC V using my Float module, but this is not published. In reply to Jeffrey: Word ordering Float treats all external double precision floats as big-endian and converts to little-endian for VFP. Lack of trig functions in VFP Float provides all the the trancendental operations of FPA. have it fall back to using the FPA instructions Float uses FPA in the absence of VFP. We’d need to track down a suitable software implementation … using VFP … something that changes it from a quick five minute job to something that will need a bit more planning and testing to make sure the new implementations are sufficiently accurate Float has my best efforts at providing suitable algorithms. As far as I can judge, accuracy is as good as FPA, except for the POW function, which may loose one place in 17. The source is available, so if anyone want to use those algorithms in a more suitable format, they have my blessing. with VFPSupport programs need to manually request a context Float requires a program to create and release a context with Float_Start and Float_Stop, which do a little more than just VFPSupport. it’s not clear how much of a performance gain you’d get Calling the code directly is a very significant gain over FPA, despite the overheads. it would make much more sense to make VFP the default FP target rather than FPA, and to produce a version of VFPSupport which provides full emulation so that older computers can run VFP code Float takes the reverse approach, but to the same end. Changing that approach would not be difficult. I did nearly all this about 18 months ago and have hardly visited it since. There are certain to be improvement to be made, so I would welcome som feedback. ;-)

Jul 30, 2015 7:30pm David Feugey (2125) 2709 posts	Thanks for this very long (and useful) answer. I agree on VFP point VS legacy FPA. Just need to get it in C :) For Basic, could I suggest to open a branch for a beta of BBC Basic VI? So if something does not work, it’ll be a compatibility problem we don’t need to solve. The same way some code made for BBC Basic IV does not work on V. Of course, if ABC can support BBC Basic VI, everything will be perfect. If you borrow trig functions from existing ones, it could be even a 5 minute job :) It could be a good idea to borrow some ideas from the PC version. The closer the two products will be, the better it’ll for us. For example, BBCBasic4Win provides 80bit floats and 64bit integers. Unicode is supported too. Could be cool for BBC Basic VI, as there are effort for Unicode support in RISC OS 5. Another fantastic idea would be the possibility to define & rewrite keywords. It was possible, but never done. Basically it’s a trick to implement in the parser. Could be fantastic, to change some features, or extend basic (by loading a library of keywords implemented in Basic or assembly). A good way to make the core of BBC Basic becoming smaller and to give the possibility to ‘non power users’ to work to the evolution of the project. For more modern use, a parser that can accept keywords in lower case, too.

Jul 30, 2015 10:12pm Fred Graute (114) 651 posts	A VFP/NEON StrongHelp manual encapsulating the extensive information Jeffrey posted here a long while back. A useful resource Steve, especially as I’m extending StrongED code colouring to cover ARMv7 and VFPv3. Unfortunately the links on the VFP page don’t work due to a rogue `#Prefix` command. I also notice that negative multiply instructions (VNMUL etc) are missing. (They are also missing in BASIC V 1.60 but Debugger 1.90 does know them.) I have also written a Float module that provides double precision floating point support through SWIs. That will be very handy to test the new code colouring. Most instruction are already coloured correctly, just a few more to go.

Jul 30, 2015 10:38pm Rick Murray (539) 14047 posts	The SWI interface is a large overhead and the speed increase is small, but the code can be called directly for significantly better performance. Do you have any statistics? Every so often when discussing CLib, the “quirk” of linking directly into the module itself is raised and the suggestion of calling C functions via SWIs is often raised. Asides from the constant jumping in and out of SVC mode, I can only imagine the sort of speed hit that repeatedly calling the SWI handler would cause, over one load and two branches (which is what the jumptable method requires). Your post implies that you might have tested things and so have figures for direct entry vs SWI’d entry. Do you? If so, please share!

Jul 31, 2015 6:08am David Feugey (2125) 2709 posts	Nota: Basic VI could definitively be a bounty. I have about 300 € ready for this one.

Jul 31, 2015 8:24am jim lesurf (2082) 1445 posts	Afraid I don’t understand all the above details. My situation is that I write programs that get complied/linked using the ‘ROOL’ ‘C’ compiler. For precision, etc, these use double floats a lot of the time. So I simply wish for a situation I had eons ago when I had a machine with real FP hardware. i.e. that all the double floating point instructions got done by the relevant hardrware. The result tended to be about a 20x speed hike. Makes a big difference when doing something like going though a 100 MB audio file and FFT’ing all the chunks in that. Maybe I could use some of the above I don’t understand like ‘Neon’ or ‘Float’. But would that mean total re-writes of the programs? If so, not keen as it would seem more sensible for the hardware to support the established language and not make added assumptions about what is available. Jim

Jul 31, 2015 9:02am Steve Drain (222) 1620 posts	Unfortunately the links on the VFP page don’t work due to a rogue #Prefix command. I can see that #prefix command, but with the old version of StrongHelp that I mainly use it does not cause any problem, ie: the prefix is a null string. I have a much more fully edited version here and I will try to check it and upload it soon. Thanks. I also notice that negative multiply instructions (VNMUL etc) are missing. If they are missing it is most likely because Jeffrey did not include them. The basic production of the manual was directly from his file. If you have details of what is missing, I will include it.

Jul 31, 2015 9:20am Steve Drain (222) 1620 posts	Your post implies that you might have tested things and so have figures for direct entry vs SWI’d entry. I think I have fooled you, and perhaps myself. My attention is almost entirely towards BASIC, so when I refer to calling SWIs I am really thinking of SYS. That adds quite a large additional overhead. At a theoretical level, a Float SWI call is in place of single machine code instruction. With the FPE this actually hides a considerable amount of integer code and overheads may not be very significant. With VFP it could be just that one instruction and any overhead is likely to be significant. Even for the trancendental operations that need a number of VFP instructions, SYS/SWI overheads seem to be significant. The testing I did is from BASIC and the code is included with the module. It is the usual FOR NEXT timing loops. I did not record timings, but noted the differences and satisfied myself that a direct call of the code was very much faster. It might be worth noting that BASIC VI has to have a fair amount of overhead before it calls just one FPA instruction to implement it floating point operations. ;-)

Jul 31, 2015 9:47am Steve Drain (222) 1620 posts	Basic VI could definitively be a bounty. I have about 300 € ready for this one. BASIC VI already exists; I claim my 300 euros. ;-) Seriously, I am puzzled by what you want to see, or imagine can be done. If you could produce an actual specification it could be assessed more accurately. BBCBasic4Win provides 80bit floats and 64bit integers. To some extent I think this reflects the underlying machine, but remember, ARM BASIC dates back more than 25 years and has not had the continual attention that BB4W has enjoyed from Richard Russell. If you want to use 80-bit floats, then you can write assembler for the FPE and hide it away in BASIC routines, but it will be slow and no modern ARM processor supports them. If you want to use 64-bit integers you can do the same, and I am not alone in writing a library to do this. It is also on my list of things to include in Basalt. Another fantastic idea would be the possibility to define & rewrite keywords. It was possible, but never done. How possible? And why? I do extend the use of keywords with Basalt, but that is not integral to BASIC. I have long considered the possibility of modularising, but have yet to see a way. As for BASIC itself, the code is very unfriendly to such a concept, I think. extend basic (by loading a library of keywords implemented in Basic or assembly) How is this different from libraries of PROCs and FNs?

Jul 31, 2015 10:09am Dave Higton (1515) 3592 posts	BASIC VI already exists; That’s what I thought. I was puzzled by these references. Didn’t BASIC VI exist in the 1980s? What became of it?

Jul 31, 2015 10:41am Martin Avison (27) 1517 posts	Didn’t BASIC VI exist in the 1980s? What became of it? Enter *help basic64 at a command line! Basic VI is the FP version of Basic V

Jul 31, 2015 10:51am Steve Drain (222) 1620 posts	BASIC VI is the release of interpreter version 5¹ that uses 64-bit floats. Otherwise BASIC V an VI are pretty well identical to the programmer. It was a soft-load option but is now included in the ROM. ¹ In case anyone quibbles with this terminology, look at the identity word in the BASIC module preceding the environment information pointer passed in R14 to CALL. In both cases it is &BA51C005.

Jul 31, 2015 11:27am Rick Murray (539) 14047 posts	Afraid I don’t understand all the above details. My situation is that I write programs that get complied/linked using the ‘ROOL’ ‘C’ compiler. For precision, etc, these use double floats a lot of the time. […] Maybe I could use some of the above I don’t understand like ‘Neon’ or ‘Float’. The thing is, consider the following program: `int main(void) { float one; float two; _kernel_swi_regs r; _kernel_swi(OS_ReadMonotonicTime, &r, &r); one = (float)r.r[0]; two = one * (float)2; printf("%f", two); return 0; }` The nonsense with _kernel_swi() is to prevent the compiler being smart and optimising out most of the code. ;-) This translates to: `main ; lots of APCS init baggage snipped MOV a1, #&42 BL _kernel_swi LDR a1, [sp] FLTS f1, a1 FMLS f2, f1, #2 STFD f2, [sp, #-8]! LDMIA sp!, {a2, a3} ADD a1, pc, #2, 30 BL printf ; exit and data follows` It looks pretty good, right? The problem is, these are FPA instructions. If a hardware FPA is available, it will execute the instructions. If not, the ARM will raise an undefined instruction exception at which point the FPEmulator will step in and perform the operation. The FLT instruction could take up to 35 instructions, plus about 40 instructions to decode the instruction in the FPE, plus overheads in RISC OS dealing with the exception in the first place. The FML instruction takes around 33 instructions if using fast multiply (UMULL etc). Otherwise? It’s long. Plus 40+exception. The STF takes between 25 and 45ish (depending on what is being saved). Plus 40+exception. [aside: why does it ‘load’ a single, multiply a single, then store a double?] As you can see, executing this nice tight little three instruction FP multiply could involve something in the order of three hundred instructions being executed. If you are a maths nerd, you could probably do something better using fixed point and integer maths to fake it… There are two alternatives. The first is “VFP”, a newer type of floating point built into most ARMs made in the last decade. `MOV a2, #2 FMSR s0, a1 ; load value to multiply FSITOD d1, s0 FMSR s0, a2 ; load what to multiply it with FSITOD d2, s0 FNMULD d0, d0, d1 FSTD d0, [sp, #-8]` I think. I reserve the right to be utterly wrong. At any rate, it takes about six FP instructions instead of hundreds of ARM ones. The alternative FP implementation, found on ARMv7A (Cortex-A) is called NEON. It is supposed to three times faster than VFPv2 (ARMv5) and twice as fast as VFPv3?4? (ARMv6); but it comes with caveats. It isn’t IEEE compatible and only works with single precision mathematics. But, then, it wasn’t designed for accuracy. It was designed to allow MP3s to be decoded on a processor clocking 10MHz. But would that mean total re-writes of the programs? Why? The program above could be recompiled to use VFP or NEON simply by passing flags to the compiler and recompiling. Unfortunately, passing “-cpu cortex-a8” to the compiler still generates FPA code. Maybe in the future it would be able to use something more appropriate to the processor type/family chosen. It would require a little more complication in CLib, though, in order for it to recognise different types of float in printf() and so on, for the float variations may be NEON, VFP, or FPE. Of course, there is always the problem of using system-specific features – NEON code probably won’t work on an XScale and definitely won’t work on a RiscPC. However going forward it is more logical to look to supporting these (and having a small function trap and abort the program if not present) than remaining with FPE. FPE served a very real purpose, but it is ridiculous now that most ARM processors that RISC OS runs on have __two_ FP units inside… If so, not keen as it would seem more sensible for the hardware to support the established language and not make added assumptions about what is available. Sometimes you have to draw a line. Would you prefer half a dozen FP instructions, or hundreds of ARM instructions. Remember, I am talking about potentially one hundred ARM instructions per FP instruction. Sometime or other we will have to accept that going native is the only sensible option for programs that make heavy use of FP. For me, I am not that bothered as I rarely use FP. I used to for working out percentages, but by rearranging the calculation I can do it in straight integer maths. For you, with the work you do on audio samples, I can imagine a lot of FP would be necessary. I wonder how much faster (less system load) AMPlayer would be if there was a VFP version…

Jul 31, 2015 12:05pm David Feugey (2125) 2709 posts	BASIC VI already exists; I claim my 300 euros. ;-) I mean VII :) Seriously, I am puzzled by what you want to see, or imagine can be done. If you could produce an actual specification it could be assessed more accurately. Nothing, or many things. SupermanLee told that VF support could break compatibility with some code. So just change the version, and give the people the choice to use V, VI or VII. No more problems, and many opportunities to make other bigger changes. How possible? And why? To provide Basic & non system programmers a way to create and modify keywords. Basic programmers prove every day that they can make useful things. But they need simple interfaces to help RISC OS. The same with other parts of the OS (skeletons for image conversion modules, for example, could help non system C developers to port things). I love plug-ins :) How is this different from libraries of PROCs and FNs? You could define new keywords, or even change some existing ones. Richard, for example, make a big change on sound command, available as a patch for BBCB4Win. Just load it, and play. To change BBC Basic is not for everyone, but to extend it from Basic, would be – IMHO – simpler. That’s just a parser thing anyway (the new sound could simply be replaced on the fly by some FNnewsound)

Jul 31, 2015 1:40pm jim lesurf (2082) 1445 posts	It looks pretty good, right? Sorry, afraid I still didn’t understand how to use VFP/NEON without changing my existing ‘C’ code. I think you may be answering a question I wasn’t asking! :-) I do understand how any FP instructions are caught, and lacking real access to FP hardware are emulated by bucketloads of int instructions, etc. That’s why the process is so tediously slow. What I’m asking is if there is a way now (or soon) for having the hardware simply access and use FP hardware without my having to change my existing C code, etc. At present I’ll have lots of lines with things like a = bc; z = g/pi; etc where a, b, etc are all double precision floats. and of course all the calls like a = cos(pi2f); ditto. How do I tell the machine now to handle the resulting compilied and linked code using FP hardware? And if not now, when/how? One point of course is to avoid users having to recompile if their machine lacks these hardware alternatives to the FPE. I understand the point of having an FPE which can trap and handle via integer or pass on to accessible FP hardware. It means the person compiling and linking doesn’t have to worry about generating multiple versions and ensuring the user runs the ‘correct’ one. The only worry being the dramatic difference is speed for the users between the two, which is unavoidable. FWIW I did give away my remaining FPA11 (IIRC the chip number) years ago. Maybe I should have kept it as a reminder of what we have since lost because Acorn seemed to decide this area simply didn’t matter. My guess is that what is needed is a modern update to the FPE which traps the instructions and then sends something appropriate to NEON, etc. But I’m not sure I’ve understood. Jim

Jul 31, 2015 2:43pm Rick Murray (539) 14047 posts	How do I tell the machine now to handle the resulting compilied and linked code using FP hardware? You don’t, unless you wasn’t to drop to assembler and add your own routines. As it is, we are stuck with something that was “old” a quarter of a century ago. And if not now, when/how? That’s the question. One point of course is to avoid users having to recompile if their machine lacks these hardware alternatives to the FPE. Or to give the compiler the ability to output VFP (NEON?) instructions and let the programmer decide? Personally, I’m concerned about the majority suffering for the minority. We ought to offer two versions of programs in that case – one for old machines, one for newer. what we have since lost because Acorn seemed to decide this area simply didn’t matter. Oh hell yes! I never understood why Acorn seemed so against hardware FP, when it was being introduced on the competitor platform. I know that the FPEmulator is clever, but it is no match for real FP.

Jul 31, 2015 3:41pm Rick Murray (539) 14047 posts	And if not now, when/how? That’s the question. Actually… If we can forget about NEON for now, it might be simpler than it first seems. I have the FPA10 datasheet and it claims to be IEEE 754 compliant. I also have the VFP data in the ARM ARM 2 and it claims to be IEEE 754 compliant. In this case, I would imagine saving FP registers to memory would use the same format? With this in mind, I’m going to have a crack at writing the program I gave earlier to drop to an assembler routine to use VFP instructions instead. With any luck, printf() will work, which would imply that as long as the FP values are stored to memory, the CLib functions ought to still work. The result may be a mess of VFP and FPE, but if it does work, it may indicate a possible step forward?

Jul 31, 2015 3:42pm Dave Higton (1515) 3592 posts	So let me ask a naive question or several. Could the shared C library discover what FP system (if any) a given platform has, and use the best available? Could BASIC use the shared C library? Should it? And presumably anyone writing code could use it, although the documentation may or may not be adequate.

Jul 31, 2015 4:26pm David Feugey (2125) 2709 posts	We ought to offer two versions of programs in that case – one for old machines, one for newer. Not with a FVPEmulator module Oh hell yes! I never understood why Acorn seemed so against hardware FP Perhaps because it’s much more difficult to design than an ALU? :) Could the shared C library discover what FP system (if any) a given platform has, and use the best available? Only needed for VFP/Neon problem. for the other cases (FPU/noFPU), VFPEmulation will be OK. Of course, if there is no VFP Emulation, we need two versions of a program: one for VFP, another for FPA (that will use FPEmulator). For speed issues and old software, FPEmulator should be able to use VFP if present. It’s what we call Soft FP. So 3 things here: 1/ VFP support in FPEmulator 2/ Direct VFP support for DDE 3/ VFPEmulator for software with VFP code, that you’ll run on older hardware (without VFP) For Basic and other ASM software, it can be directly VFP code. But you’ll need one of the two: 1/ Keep FPA version alive or 2/ Provide a VFPEmulator module For a potential BBC Basic VII, I suggest VFP mode, so just for modern motherboards… until we’ll have a VFPEmulator module. Some functions will be a bit different, as between V and VI. Could also be the occasion to solve some problems with zero page.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Reply

To post replies, please first log in.

Forums → Wish lists →

FP support

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options