RISC OS Open: Forum: FP support

Jul 31, 2015 4:37pm

GavinWraith (26) 1563 posts

Jim:

So I simply wish for a situation I had eons ago when I had a machine with real FP hardware. .. But would that mean total re-writes of the programs?

Not with a new Shared C Library.

Dave:

Could the shared C library discover what FP system (if any) a given platform has, and use the best available?

Yes. Charm does this already. See the files lib.src.fp and lib.src.maths in the Charm 2.6.6 distribution.

Recompilation would be necessary if the C library were extended by functions to check which FP system were available. It may be that the C runtime would need changing to save and restore vfp state – I am not sure about that. Otherwise I think it might be possible to have different CLib modules to suit each FP system.

Jul 31, 2015 7:09pm

Rick Murray (539) 13840 posts

Okay. For the lulz.

Here’s a C program (you’re on your own for the MakeFile but note that objasm will whinge like hell as we’re mixing FPA and VFP and pre-UAL and UAL):

#include <stdio.h>

extern int do_fpe(double *);
extern int do_vfp(double *);


int main(int argc, char *argv[])
{
   double fpval;
   int  delay;

   printf("The correct value of 123.456 * 654.321 is 80779.853...\n\n");

   fpval = (123.456 * argc) * 654.321; // nonsense to defeat optimiser
   printf("Calculated using FPE ops generated by compiler: %f\n", fpval);

   printf("Performing FPE MUL 4,096,000 times: ");
   delay = do_fpe((double *)&fpval);
   printf("%f in %dcs\n", fpval, delay);

   printf("Performing VFP MUL 4,096,000 times: ");
   delay = do_vfp((double *)&fpval);
   printf("%f in %dcs\n", fpval, delay);

   return 0;
}

And here’s some assembler to go with it:

; assembler file for FP maths

        AREA |C$$code|, CODE, READONLY, A32bit

        EXPORT do_fpe
        EXPORT do_vfp

; On entry, R0 is where we should write the result, as a DOUBLE.
;
; In each function - R0 is the end time
;                    R1 is the start time
;                    R2 is the loop counter
;                    R3 is where the result should be stored [FPE]
;                          the start time [VFP]
;                    R4 is where the result should be stored [VFP]

do_fpe
        ; FPA/FPE code - this will be emulated.
        MOV       R3, R0
        SWI       &42             ; OS_ReadMonotonicTime
        MOV       R1, R0
        MOV       R2, #4096000

fpe_again
        LDFD      F0, first       ;
        LDFD      F1, second      ;  } this is executed 4M times
        MUFD      F2, F0, F1      ;

        SUBS      R2, R2, #1
        BNE       fpe_again

        STFD      F2, [R3]        ; stores incorrect value?!?!

        SWI       &42             ; OS_ReadMonotonicTime
        SUB       R0, R0, R1

        MOV       PC, R14


first
        DCFD      123.456
second
        DCFD      654.321


do_vfp
        ; VFP code - the baggage is setting up VFP context
        STMFD     R13!, {R4, R14}
        MOV       R4, R0
        MOV       R0, #1<<31      ; activate now
        ADD       R0, R0, #3      ; user mode + in application space
        MOV       R1, #3          ; grant us D0-D3
        MOV       R2, #0          ; let VFPSupport deal with contexts
        MOV       R3, #0          ; initial FPSCR
        SWI       &58EC1          ; VFPSupport_CreateContext
        STMFD     R13!, {R0-R1}   ; Store current context and previous

        SWI       &42             ; OS_ReadMonotonicTime
        MOV       R3, R0          ; start time is in R3
        MOV       R2, #4096000

vfp_again
        VLDR      D0, first       ;
        VLDR      D1, second      ;  } this is executed 4M times
        VMUL.F64  D2, D0, D1      ;

        SUBS      R2, R2, #1
        BNE       vfp_again

        VSTR      D2, [R4]

        LDR       R0, [R4, #0]    ; annoyingly, VFP writes its words in the
        LDR       R1, [R4, #4]    ; opposite order to the older FPE, so
        STR       R1, [R4, #0]    ; swap 'em around to get CLib to understand
        STR       R0, [R4, #4]

        LDMFD     R13!, {R0-R1}   ; Load current/previous contexts
        SWI       &58EC2          ; VFPSupport_DestroyContext

        SWI       &42             ; OS_ReadMonotonicTime
        SUB       R0, R0, R3

        LDMFD     R13!, {R4, PC}

        END

Everything is double as the C compiler appears to convert single (float) to double prior to calling printf().

The FPE code seems to be broken. It consistently outputs an insanely large value. It used to work, and the code is more or less equivalent to that taken by the FPE code generated by the compiler, so I don’t know what’s going wrong. I don’t see what can go wrong with “load a value, load another value, multiply them, store the result”…

Doesn’t matter, really. The point is the timings. Um.

The correct value of 123.456 * 654.321 is 80779.853...

Calculated using FPE ops generated by compiler: 80779.853376
Performing FPE MUL 4,096,000 times: 2609062982277424.000000 in 290cs
Performing VFP MUL 4,096,000 times: 80779.853376 in 7cs

290cs vs 7cs. This is fairly consistent, running on a standard Pi, single tasking.

For the time it takes to do this once with FPE, I could do it forty one times with VFP. This is why the DDE ought to start supporting native hardware FP instead of ancient emulated FP.

Jul 31, 2015 7:59pm

Steve Pampling (1551) 8170 posts

The point is the timings.

I maybe off track here, but isn’t one of the reasons for RO multi-media handling being a bit sucky something to do with an absence of decent FP? Or rather the use of FPE instead of hardware FP?

Jul 31, 2015 9:04pm

David Feugey (2125) 2709 posts

This, and slow disc accesses.

Jul 31, 2015 9:31pm

Steve Pampling (1551) 8170 posts

This, and slow disc accesses.

I did say “one of the reasons”

Unless you have a magic wand deal with problems one at a time. Faster disc access is at least partially inflicted by limitations of the current hardware.

Jul 31, 2015 11:40pm

Rick Murray (539) 13840 posts

Faster disc access is at least partially inflicted by limitations of the current hardware.

Err, not really. Same hardware, different OS, makes the standard Pi quite a nice media playback system.

One of the main issues is the closed nature of many of the GPUs. The things can be controlled with a binary blob supplied by the manufacturer which will slot into Linux and together with the media framework will provide what is necessary for HD H.264 video. You can see, by looking at our MPlayer port and its just-about 320×240 capabilities exactly how much the GPU does assist. Without this, we’re kind of stuck.
Maybe we need some system to be able to load and make use of Linux kernel modules?

Anyway, for now, for today, it might be a nice idea if our compiler could perhaps make better use of the easily available facilities.
It isn’t without issues: the code will fail on a non VFP system (so ideally the runtime could check for this and fail gracefully?¹), and as you have seen the VFP stores data in the opposite word order so it would need to be swapped prior to using any CLib functions and swapped back prior to using VFP (though interestingly DCF appeared to work?!? maybe that explains the FPE fail? maybe objasm noticed the VFP code and automatically swapped the word order?). It isn’t perfect but the alternative (incompatibility with any FPE code) is untenable. At any rate I’ve shown that with a bit of care (contexts!) it is possible to directly use VFP with the existing CLib and the benefits can be… quite something.
At any rate, it would be nice to have VFP on builds with an appropriate -cpu setting…

¹ A possible alternative could be to supply code paths for VFP and non-VFP and select which one is used at runtime depending on the facilities of the host system? This might impact the efficiency of the compiler so…

Aug 1, 2015 7:20am

David Feugey (2125) 2709 posts

the code will fail on a non VFP system

Holidays? :)
As Jeffrey said, it’s not a problem with a FVPEmulator module for non VFP systems. VFP code will work, as FPA code works today on FPA-less systems.

The problem is more on Neon side. We could have a NeonEmulator, but the use of Neon code will not be very optimal on VFP (without Neon) systems.

Conclusion: VFP is definitively possible without any drawback, except on FPA systems (rare today). Neon is possible, but with speed problems on VFP only systems.

One solution would be to have the two ASM code in the binary. A simple solution would be to have some *ifNeon CLI command to launch a specific RunImage if Neon is present (easy to make [compile twice], easy to remove if some people want to save space [remove the unneeded RunImage). If you really want, some *ifFPA and *ifVFP commands could be provided too.

IMHO, several binaries will be much more flexible than a fat-binary. And it’s more RISCOSish too :)

Aug 1, 2015 9:27am

jim lesurf (2082) 1438 posts

Everything is double as the C compiler appears to convert single (float) to double prior to calling printf().

IIRC That was always the case and I assumed it was because the IEEE compliance was based on double precision error/rounding specs. BTW for me, IEEE compliance was important. And at least one version of the FPE emulation failed at one time. Most were fine, though. As was a faster 3rd party version I used for a while (I’ve forgotten who produced that).

And back in the day when I still had a machine with real FP hardware I found that hardware was indeed 20 – 40 times faster than emulation. Made a big difference to programs that used a lot of floating point number bashing.

I confess to being wary of having to generate multiple binary versions, etc. Opens up scope for odd failures as people start saying “doesn’t work here” or even more frustrating “I get a slightly different answer”. Seems to me that dealing with this via Clib/FPEmulator as the go-between makes most sense. But of course I have no idea what VFP/NEON entail, so what I’m saying may not be possible.

In the past I mainly wanted FP hardware for engineering/academic/scientific calculations. These days I’d be in a boat with more passengers as I feel it would help a lot with processing ‘AV’ data files and streams. These often involve bucketloads of data as well as require a lot of number crunching. Sometimes it can all be integer, but other times not.

Jim

Aug 1, 2015 9:36am

Steve Drain (222) 1620 posts

Yes. Charm does this already.

I have tried to keep an eye on Charm. When I looked several months ago I think I remember the floating point only extended to the arithmetic operations; now I see that all the operations offered by FPA are there.

If I understand correctly, to exploit VFP you have to recompile the compiler (written in Charm), so object code will only work in either of the areas we might be concerned with, not both.

I have not delved into the source, but I would be interested to find out how Charm provides the trancendental functions using VFP, and how it handles VFP contexts.

Nevertheless, it is clearly an exemplar for changing the C compiler.

Aug 1, 2015 9:38am

Steve Drain (222) 1620 posts

The point is the timings.

I have never been in doubt about the huge speed advantages of using VFP. My own tests confirmed it, but I thought it too obvious to make a point of. Thanks for your explicit numbers. ;-)

Aug 1, 2015 10:42am

Steve Drain (222) 1620 posts

[C compiler for VFP] It isn’t without issues: the code will fail on a non VFP system (so ideally the runtime could check for this and fail gracefully?)

Have you looked at my Float module? I considered all the issues raised here when writing it. My solution may not be suitable as it stands, but it is a solution.

First, as Jeffrey pointed out, you have to deal with context. At what level you do this is an important decision, and one that did not have to be made when using FPA.

My solution is at task level – not per instruction nor at system. If BASIC were to do this, it would be during task initialisation with *Basic and the context pointers stored in the spare words still available in the workspace. That is how I have experimented with it for Basalt. I expect that this can be done similarly in C.

Next is the compatiblity between FPA and VFP. The general feeling here seems to be for separate compilation depending on the processor, as with Charm, but I dislike the idea of having different code, despite arguments about what is the route forward. ;-)

My solution is to set up a context on systems that can use VFP, but to have null pointers otherwise. The choice of which code to run is then made on whether there is a context or not. This single instruction and branch is of no significance to FPA and a miniscule delay to VFP.

Then is the problem of endianness:

and as you have seen the VFP stores data in the opposite word order so it would need to be swapped prior to using any CLib functions and swapped back prior to using VFP […] It isn’t perfect but the alternative (incompatibility with any FPE code) is untenable.

I agree. Any existing data is stored for FPA, cf BASIC. So my solution is the same and requires the overhead of swapping registers before and after VFP code. I think this is acceptable for compatibility.

A problem that has only been discussed tangentially, I think, is precision. Certainly double precision IEEE is needed and that is where I stopped, because that is all BASIC requires. However, single precision is used. VFP can provide this and I think we can ignore NEON. Extended precision is out of the question with VFP, but how important is that?

Lastly is the problem of existential operations not provided by VFP. This is not trivial, but it has been a problem for computers for a long time. As an amateur, my solutions took some time and effort to tease out, but I expect those with computer science qualifications could rustle them up. How many of those do we have here? The algorithms might need explanation, but that is for another time, if anyone is interested. ;-)

A final comment. I do not see an FPE replacement using VFP as feasible, for all the reasons Jeffrey listed, and I think a VFPE alternative would impose unnecessary conditions on programs running on non-VFP machines, so I would rule both out.

Aug 1, 2015 11:28am

David Feugey (2125) 2709 posts

Have you looked at my Float module? I considered all the issues raised here when writing it. My solution may not be suitable as it stands, but it is a solution.

Yep, an elegant solution. But I was just thinking that perhaps it’s time to get this by default.

First, as Jeffrey pointed out, you have to deal with context.
Next is the compatiblity between FPA and VFP
Then is the problem of endianness
Lastly is the problem of existential operations not provided by VFP.

We could then fall back to VFPEmulator specific software functions made to be closed to FPEmulator.

That’s why I suggested Basic VII. A new solution with support for VFP/Neon (and non VFP computers with VFPEmulator), the fastest way possible, but with small differences with BBC Basic VI that can lead to incompatible code. The same as between BBC Basic V and VI.

And perhaps it will be the good time to add some features present in BBC Basic for Windows : “data structures, PRIVATE variables, long strings, event interrupts, an address of operator, byte variables, a line continuation character, indirect procedure and function calls and improved numeric accuracy.”

Directives for (de)tokeniser (AllowLowcaseKeywords, RemoveFN, RemovePROC, AllowKeywordRewrites, Aliases, Renames) could be added to, to get something more modern (I do this today with a very limited and buggy preprocessor).

I do not see an FPE replacement using VFP as feasible

So the solution is not hardware FP?
IMHO, this is simply not an option.

VFPE alternative would impose unnecessary conditions on programs running on non-VFP machines

A VFPEmulator, the same way with current software that needs FPEmulator. Not really a big change for users.

Aug 1, 2015 11:37am

David Feugey (2125) 2709 posts

From a strategic point of view, RISC OS attracts some people because of BBC Basic. And simply because it’s probably the fastest interpreter on ARM in the world (with no JIT). I think it’s really important to keep this advantage. On Windows, BBC Basic is one the smallest, and the fastest interpreter too. That’s really a good reason to use it, and to make cross platform software (OK: games) with it.

I’m OK with add ons (as the really excellent Basalt), and with support of legacy platforms, but we can also move on. Just to claim that we still have the fastest ARM interpreter in the world :)

Aug 1, 2015 11:43am

Steffen Huber (91) 1953 posts

Can someone summarize the situation with GCC and float stuff? Before we got all that shiny new hardware, I remember that there were “hard float” (FPA/FPE) and “soft float” (an internal GCC math lib) being the choice. “Soft float” was a lot faster on non-FPU hardware. ISTR that libraries needed to be compiled for the correct calling standard.

Aug 1, 2015 12:11pm

David Feugey (2125) 2709 posts

On non-FPU hardware, Soft float is the fastest solution. Hard Float the slowest.
On FPU hardware, Hard float is faster than Soft FP, which is much faster than Soft float.

VFP support should be complete in both GCC and UnixLib
http://www.riscos.info/pipermail/gcc/2015-March/006308.html

Is it available in stock GCC or in a specific beta version? I don’t know.

Aug 1, 2015 12:36pm

GavinWraith (26) 1563 posts

Just to claim that we still have the fastest ARM interpreter in the world.

It probably depends on what program you run, but I would claim that Lua is faster. The Lua binary is about 88K; that includes extra libraries like lpeg (parser expression grammars) and bc (big numbers). Basic 64 is smaller at 51K. But Lua does not have the nostalgia factor.

Aug 1, 2015 1:19pm

David Feugey (2125) 2709 posts

LUA has some libs that can lead to much faster results, but I doubt that each opcode is decoded and run with only a few ARM instructions. BBC Basic engine is very optimised here (I could say the same with BBC Basic 4 Windows). That does not remove qualities of LUA anyway.

Aug 1, 2015 2:09pm

Rick Murray (539) 13840 posts

but I would claim that Lua is faster.

Well, there’s one way to settle this.

CODEFIGHT!!!

(^_^)

Aug 1, 2015 3:13pm

GavinWraith (26) 1563 posts

The standard Lua distribution has always had “#define LUA_NUMBER double” as
the default. RiscLua uses “#define LUA_NUMBER int” because it is only recently that operations on doubles became available in hardware as far as RISC OS is concerned; and because the stack-width can be smaller. I think the Lua philosophy has always been that the target audience prefers simplicity; hence no fussing about differences between different types of number, and to hell with speed because anybody seriously interested in number-crunching would be using a different language anyway. I put bc-numbers into RiscLua (fixed point arbitrary size) because they have an educational use; again, anybody doing serious number theory would be using something else.

but I doubt that each opcode is decoded and run with only a few ARM instructions

You would be right. Some of the Lua VM instructions are pretty complex, especially the ones dealing with tables. Almost every operation depends on whether its operands have metatables. So an addition (+), which in the simplest case would come down to an “ADD result, arg1, arg2” ARM-instruction might be implemented by arbitrarily long user code in Lua (or C or assembler), if either of the operands has been set up to demand it. This is the penalty that has to be paid for user-controlled syntax. In this sense, Lua is not so much a language as a language-kit. It mandates certain aspects ( garbage collected memory management, lexical scoping, multi-return values from functions) but leaves a great deal else free to be defined by the user. The intention behind the register-based Lua VM was that each instruction should do as much work as possible, to cut down on interpretive overhead.

Aug 1, 2015 4:10pm

Rick Murray (539) 13840 posts

anybody doing serious number theory would be using something else.

Doesn’t this run the risk of marginalising the language away from serious use? After all, there are always two sides to the story. For example, all of my Windows programs are written with (true)VB because I didn’t grok how C programs started themselves up and I wasn’t confident to alter one of the demo apps to figure it out. VB on the other hand is overly friendly and extremely simple to use even if you pay for it in efficiency (the 6502 emulator I started in VB was taking the piddly more than anything else). Perhaps a person might feel more confident with Lua?

Some of the Lua VM instructions are pretty complex,

Just like the FPE instructions…

which in the simplest case would come down to an “ADD result, arg1, arg2” ARM-instruction might be implemented by arbitrarily long user code in Lua (or C or assembler)

Wouldn’t that be the same in BASIC? meh% + whatever won’t be manageable in a single instruction; neither will meh% += 123456…

Aug 1, 2015 9:29pm

Jeffrey Lee (213) 6048 posts

VFP support should be complete in both GCC and UnixLib
http://www.riscos.info/pipermail/gcc/2015-March/006308.html

Is it available in stock GCC or in a specific beta version? I don’t know.

Currently the only way to get a VFP/NEON capable version of GCC is to build it yourself.

It’s also worth pointing out that any programs compiled to use VFP/NEON (using GCC) will need SharedUnixLibrary 1.13 – which hasn’t seen a public release yet. If you build GCC yourself you’ll get a copy of it, but to avoid potential differences once the official 1.13 comes along I don’t think the GCC team will be happy with you distributing your own version.

Rick: The reason your do_fpe function returns the wrong value is because of the differing word order for doubles between FPA & VFP. It looks like objasm decided you wanted to use VFP word ordering, which is why your VFP code needs to swap the order on save (for interaction with the FPA CLib) but not on load.

Lack of VNMUL & friends in BASIC looks like an oversight – they should be there now in BASIC 1.61

Aug 1, 2015 11:07pm

Fred Graute (114) 645 posts

Lack of VNMUL & friends in BASIC looks like an oversight – they should be there now in BASIC 1.61

Thanks for the quick fix, Jeffrey.

Here’s a few more anomalies I found while extending StrongED’s ASM colouring:

The BASIC assembler accepts the newer SMC instruction but Debugger decodes it as the older SMI instruction.

The DBG instruction can have a condition code. The BASIC assembler accepts it but Debugger doesn’t show it.

The BASIC assembler accepts SP as a register for SRS but R13.

Debugger shows S & D register numbers that are much too high:

VLDMIAVC R12!,{S28-S271} ; VFP or ASIMD required ; *** Unpredictable VLDMIAVC R12!,{D14-D137} ; VFP or ASIMD required ; *** Unpredictable

Aug 2, 2015 12:30am

Jeffrey Lee (213) 6048 posts

Debugger shows S & D register numbers that are much too high:

What’s the hex for those instructions? For VLDM/VSTM the register count is stored in an 8 bit field, so you could theoretically load/store up to 255 registers if the hardware had that many. So I suspect that the debugger is disassembling it correctly, and it’s actually the instruction which is at fault. Which would then lead on to a second question of how you assembled those instructions!

Aug 2, 2015 9:16am

Steve Drain (222) 1620 posts

Lack of VNMUL & friends in BASIC looks like an oversight

I have looked back at the long list of VFP instructions you provided, but I cannot see the VNMUL & friends in it. Could you, or Fred, please post them or point me directly to the relevant ARM document. I will then add them to the VFP manual.

However, I notice that I missed out the VFP4 instructions VF[N]MA and VF[N]MS fused multiply instructions. I cannot recall why, but I do not suppose that they are the same.

Aug 2, 2015 9:55am

GavinWraith (26) 1563 posts

Doesn’t this run the risk of marginalising the language away from serious use?

Lua has been designed with very specific aims. In particular, it is designed as a C library to be embedded in applications written in C. Lua as a separate programming language is something of a side-issue. The idea is that if you are after speed, you cater for that on the C side of things. So something like RiscLua, which is a statically compiled C application to interpret an appropriate dialect of Lua for RISC OS, is only showing half the story.

So, yes, embedding Lua in a pre-existing number-crunching package, to make it easier to use, makes sense. Adding number-crunching facilities to RiscLua makes less sense IMHO. That is not to say that a future version of RiscLua won’t be using vfp or NEON.

Lua 5.3, the latest version, after years of discussion on the forums, addresses the problem that whereas doubles may be a useful number type to expose for the user, the internal code is only interested in pointers, essentially an integer type. My personal preference is for no coercion and keeping types separate, but that is not seen as making things simple for the casual user.

FP support

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options