FP support
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
David Feugey (2125) 2709 posts |
Really? What a surprise. It’s so new to me. OK, replace, “So users have no word to say until they become GCCSDK developers?” by “So users have no word to say until they become GCCSDK contributors?” if you want. Strange sensation of deja vu in this thread. I would prefer to stick on the main subject, please. |
Steve Pampling (1551) 8170 posts |
Not what I want to do at all, it doesn’t convey the point. It’s about doing rather than complaining that someone else isn’t doing. With which I shall go and play with the Filer again. |
Rick Murray (539) 13840 posts |
I’ll attempt to summarise it: Whinge whinge it’s complicated.
This won’t happen. Well… I suppose somebody clever could rewrite a version of the FPEmulator that picks up on the FPA instructions and calls the corresponding VFP instruction where it exists… Things won’t be as fast as using VFP natively, but ought to be faster than ARM code.
Ah, now herein lies a potential gotcha. Did you notice that my VFP code was wrapped in calls to create a VFP context, and then destroy it afterwards, yet the FPE code didn’t do this? If I remove the line that calls VFPSupport_CreateContext and then call the VFP code, I see this:
Add it back in again, the code works. Therefore – either RISC OS is going to have to assign itself a “default” VFP context so that applications can use the VFP hardware like they used the FPA (and creating/destroying subsequent contexts if a specific application and/or state is required) or every application is going to need to be aware of this and manage VFP contexts manually – which can include around Wimp_Poll and the like. It might be a good idea to consider the OS itself maintaining a default VFP context, so that VFP instructions can be used more freely than at the moment.
For a while, there existed FPA chips that could be inserted into the early computers, Acorn never seemed to consider FP to be something worth promoting. Not present as standard on most machines, rare, expensive, and forgotten in the RiscPC era. Heck, even CJE don’t have any in stock! :-) The thing that hurts, though, is that we are still emulating an FP unit like we did in 1987 when most modern hardware has one, if not two, entirely functional FP units built in. |
Rick Murray (539) 13840 posts |
The Rick Guide to hacking VFP into a C program: Here’s how I made my example. I wrote it in C. Then I compiled with the “-S” flag set. This tells the compiler to spit out a textual “source” of the assembler instructions that the compiler translated the C source into. It places this in the ‘o’ directory, so don’t panic if the build then fails. Just open the o.whatever file and look at it. So… I looked at the code to see what the C compiler was doing, and I created my own bit of assembler code, and copied the FPA code into a function. In the C code, I call the function. I pass a pointer to the “double” register to the function to make it easier for the assembler code to pick it up – it’s there in R0 (a1). For instance:
and the -S output was like this:
Okay, now the complication with the above is that I multiply with “argc” (from the main() definition). This should usually be ‘1’, so I am telling the code to multiply 123.456 by 1. This is important, because the compiler is smart and if I did not multiply by an unknown value, the compiler would recognise that 123.456 and 654.321 are both constants, and it would simply work out the result and use that, skipping the calculation entirely. So, the FLTD and the first MUFD are unnecessary. The importants part here are the two LDFDs, the MUFD, and the STFD. As I am really wanting to do a less complicated calculation, I could load the first FP value into F0, load the second into F1, then calculate the result into F2, saving F2. This means my assembler routine would look like this:
Notice that the LDFDs are less complicated. I tell objasm what I want to have loaded into F0 and F1 and let it worry about how best to do it. The end result is the same… And the C code? At the top I do this:
and in place of the calculation, I do this:
Or, to put it all together, this:
Note – the AREA specified in the assembler code should be “|C$$code|” to place the code alongside the C code. This requires you to also specify the area as being CODE and READONLY. So… I got the compiler to build a proper version this time (no more -S) and before running it I looked it over in Zap. Once the start of the main() code was found, I just stepped through it to make sure it was doing what it used to do, only now with a branch to the FPA code, and back again afterwards. It was good, it worked as expected. I mention this just in case you are feeling adventurous with trying out some VFP code. Write your function in C as you would normally, then take a look at what the compiler is doing with the FPA parts. Then have a go at moving them to an assembler file and calling those functions instead of performing the calculation in C. Once that is working, then you have the fun part – have a crack at replacing the FPA instructions with VFP ones. To help you along the way – save the C code as “c.fptest” and the assembler code as “s.fpcode”. Here is a MakeFile to build the code:
And, remember, we’re here to help. Stuck? Just ask. [disclaimer: I can help with the nuts and bolts but I suck at maths] |
Steve Drain (222) 1620 posts |
Jeffrey has implemented Wimp context switching on 26 Nov 2010:
|
Dave Higton (1515) 3525 posts |
It makes me uneasy… it’s not as automatic as I’d like it to be. Is there a better way? Back in the old days, we had FP instructions – one and one only set. Apps were written to use them. If the hardware that the app was running on had hardware FP, the app ran quickly; if not, it ran slower. The key desirable feature was that it was all automatic. There was no “if hardware FP, do this, else do that”. Can we get back to that degree of being automatic? |
jim lesurf (2082) 1438 posts |
Persackly. The point here is that from my POV what is needed is that the FPE/Clib/whatever simply trap and deal with this so the persons writing and using the program just get ‘best performance for their box’. No need for hacking about or special cases. If you have accessible FP hardware, it gets used. If not, its emulated. We had that once. We need it again. I’d like this for the ROOL compiler even though people seem to focus on GCC. TBH it would seem odd if it was done for GCC but not ROOL’s ‘own’ compiler. But that’s a different detail to the above more fundamental point. Jim |
Rick Murray (539) 13840 posts |
It would be nice if the OS owned a “default context” so VFP instructions could just be used without faulting or needing contexts created and destroyed.
Hear hear. It makes no sense at all on a Pi (Beagle, iMX6, …) for cc to build an executable using FPA instructions. NO current system supports them natively, and the older ones that did were rare.
Please be aware that there are several co-existing issues here. First of all, the FPA is faked. Not only is it faked, it is ancient. I’m surprised ARM have not deprecated the use of its instruction space in order to put something else there… The proper hardware FP is VFP. The XScale (Iyonix) doesn’t have it. Every RISC OS machine afterwards does. We can discount NEON. It is a non-IEEE compliant mini-FP designed to get “good enough” results extremely quickly and sometimes in parallel. This isn’t aimed at precision, it is aimed at media decoding, so it is a somewhat specialist FP unit that can be largely ignored (a person wanting to use it can always drop to assembler…). It is preferable to drop support for FPA and instead consider VFP to be the new floating point system. This cannot be done for the primary reason that the FPA and the VFP store their data back to front in comparison to each other. Therefore modifying CLib to deal with VFP-style data (a simple enough modification) would immediately make it incompatible with all FPA code, including anything built using the current compiler. The only logical approach is to retain the current FPA behaviour in the compiler, and add an option to use VFP instead. This part of the compiler will know about the backwards word ordering and will swap the words after STFD (or whatever that is in UAL). Suboptimal? Yes, a bit. But it is far far better than sticking with FPA and certainly a lot nicer than arbitrarily wiping out support for the FPA (which would affect anything using floating point compiled today and earlier). If you have any question as to why, consider what is supposed to happen if you were to use the We also run into the problem of how to deal with code that is going to run on an earlier machine. For these, I would say the best solution would be to have an FPA build of the software for them. It ought to be viable, instead, to have a VFPEmulator to fake the VFP instructions, but the question there is who is going to write such a thing? Is it even viable to take the time to do it when simply building an ‘old machine’ version is easier all around? So back to this:
Today, we have an old set and the new set. For the majority, the older set was never ‘real’, the newer set is. For various reasons, the two – even though they are IEEE compliant and understand the same basic number styles – are not directly compatible due to an annoying quirk. We need to make use of the new FP hardware. We can’t arbitrarily kill off the old FP mechanism. What has been written above are my ideas. Yours are welcome too. But to be honest, to get the best use of new FP hardware while keeping FPA code going is likely to involve compromises. |
Jeffrey Lee (213) 6048 posts |
it’s down to each program to create/destroy contexts as needed. There are several things here:
Also note that although it would be possible to emulate VFP on all hardware, NEON can only be emulated as far back as ARMv5, since the instruction encodings make heavy use of the ‘NV’ condition code (which will be ignored on ARMv4 and below). |
Steve Drain (222) 1620 posts |
Version 021 is now available. ;-) Edit: updated version There are bound to be errors, especially among the NEON instructions, so please let me know. |
jim lesurf (2082) 1438 posts |
Given that the older set wasn’t “real”, but worked, that seems a strange basis for some of what you say! :-) The whole point of the FPE was to trap and deal with situations where the hardware didn’t match the compiled instructions. If there things an endian-change in the data values than I guess it would make life easier for all if that was also trapped. It seems a mess to have to recompile seperate versions. The compiled code defines what the program is meant to do. Provided there is one set of rules for the meaning, the method of how to get that enacted is something I’d assumed an FPE/CLib could handle if those were suitable machine specific. Byte-shuffles might be a slow-down, but probably not as bad as having to do everything via bucketloads of int operations. And if most platforms are now VFP that could be the default. That might mean the compilers have to flag what they now do so the FPE/Clib can tell them from old code, but again I can’t see why every programmer should have to generate multiple compiled version or think that makes more sense than having an FPE/Clib deal with it. However what do I know? Just that this did work fine for a while, so it seems weird that things have been ‘improved’ so much that what was done then is now impossible. Not what I’d have envisaged as ‘progress’ I guess… 8-] Jim |
GavinWraith (26) 1563 posts |
Interpreted languages are often slow for number crunching because of interpretive overhead. Interpreters can only give you off-the-peg operations: standard arithmetic operations, square root, exponential, logarithm, trigonometric functions and so on. What they really need are more fundamental operations, such as inner-product of vectors, determinants, Horner’s method for evaluating polynomials, continued fraction evaluation, etc. Joe Taylor and I produced the MATROM, a sideways ROM for the BBC B, to extend BBC BASIC with matrix arithmetic, some time before Acorn did. Ours did matrix inversion, but Acorn’s did not. So what mathematical operations would you like to see in an interpreted language? Or does one throw up one’s hands and say use assembly language instead ? To avoid interpretive overhead one wants to get as much as possible, particularly looping, done with low level code. I always felt that BBC BASIC was never developed as far as it could have been in this direction. |
Rick Murray (539) 13840 posts |
Jim – there is not a big problem in dealing with VFP now/soon (depends how much the hardware can do and how much needs to be picked up in software – typically the more esoteric functions are handled in software). The problem, as you might have guessed from my lengthy message(s) is being able to do this in such a way as to not affect code compiled with the existing FPA instructions. Unfortunately the word endianness (words, not bytes) is different, which means that one cannot easily make assumptions as to what an FP value is when it is seen in memory. If you refer to the original program I posted when doing the comparison between VFP and FPE, I got a really weird value for the FPA code because objasm spotted the VFP instructions and stored the FP data in the appropriate format. The FPA code loaded them and saw something completely different because the two words of data that comprise the FP value were back to front. Not invalid data, just something else.
Indeed, and it will typically aim for the lowest common denominator in order to have the widest compatibility. It wasn’t until the 32 bit change that the compiler started to use the MRS instructions. I’ve just looked through a program of mine and I see MUL a number of times, but no UMULL or any of those. Why? Well, I have a vague recollection that a 20 year old processor might be upset by it, and since RISC OS 5 could potentially run on a 23 year old processor that doesn’t support it at all, the base default option is not use it.
It isn’t ideal, but to my knowledge there is no VFP emulator for older machines nor one planned…
Sarcasm aside, the problem (as I see it) is not in supporting VFP or even whether or not RISC OS should mandate basic VFP (v2?) going forward. |
Steve Drain (222) 1620 posts |
Not a problem without any solution – I carefully described one a few days ago. You might not want to do it that way, but you should not make it seem more difficult than necessary. However, a slot-in replacement for the FPE is going to be difficult and is way beyond my pay grade. ;-) |
Steve Drain (222) 1620 posts |
A7000+ with ARM7500FE. ;-) |
Steve Drain (222) 1620 posts |
@Gavin I find myself in total agreement, but I think this should be in a new topic |
Rick Murray (539) 13840 posts |
Yup – you did. I was going to plug it into my code to test it, but Float crashes on my Pi. The second instruction of the module init code is SWI &54C81 which is not recognised. Hmmm…
The SWI with the VFP is OS_IntOn, which seemed to be the simplest SWI I could think of. I had used XOS_GenerateError but that took ages to do nothing. It’s to see what the impact of an overlying SWI call would be. The answer? “It varies”. I’m not worried about your module returning the wrong data. You say you use FPA format data, so you’re reading it the same as FPEmulator, and indeed get the same result. Is your module safe to call in USR mode? I wonder if I could look up the address of your module, work out where the SWI handler is, then set up the registers to permit a direct jump into there rather than the SWI mechanism. Perhaps this might permit some speed to be gained?
It would be nice if the compiler could emit code that would use VFP if available, or FPE otherwise. I think your idea of hanging onto the context is the way to go – a context means VFP, no context means fall back to FPE. I think in the compiler this could be checked with a load and a compare, with a branch to the FPA code (let VFP be the fall through case, as FPE is slow so an extra branch won’t change much). Sort of like this, to have a (slightly larger) executable that will work best on anything:
I really like your idea of storing backwards using the single versions of the FP registers. I wish I’d thought of that! ;-) |
Steve Drain (222) 1620 posts |
Sorry about that. The easier thing is to set the debug flag to 0 at the start of the program.
My pleasure. ;-)
You did assemble in BASIC VI, as noted in the REMs, didn’t you?
That is much as I would expect.
No need for that. SWI “Float_Start” returns the start of the SWI table – it is all documented. And I CALL the code from BASIC, so it is USR safe. There is even a library to put the routine addresses into variables. ;-) |
Rick Murray (539) 13840 posts |
Err… No. I didn’t read them. ;-)
Where? I’m looking at “Float_Start (&0C0040)” in StrongHelp. It returns R0 = context, R1 = prev context. There’s a nit about how it uses VFPSupport to set itself up, plus some extra notes below. Nothing about the SWI table. |
Steve Drain (222) 1620 posts |
Oh b****r. Mea culpa. Its 18 months since I uploaded that and I have been using 0.65 here. Look tomorrow and that version will be up. It will be worth it. ;-) Edit: www.kappasite.pwp.blueyonder.co.uk/Modules/swFloat065.zip Note that it is unregistered and was always intended as something for discussion rather than use. |
Theo Markettos (89) 919 posts |
I hate to quell the storm in this teacup but…
I can’t speak for Lee or John (there isn’t really a ‘GCC team’ except a bunch of people on a mailing list, which anyone is free to join) but if there is call for an ‘official’ release of SUL 1.13 then we can do something about that. Things only happen when people have time, so if nobody gets round to it it doesn’t happen – but asking for it is a good way to make it happen. Additionally we don’t have any automated testing at present, so any help with manual testing is very much appreciated. Test is the biggest blocker to releases we have currently: due to the huge amount of infrastructure that has been done over the years, building is now easy but ensuring the quality of the vast amount of stuff we build is hard.
Regarding GCC, if you want to make a test build, fine. In fact, if you make a test build and give feedback on what does/doesn’t work, that’s very useful. We can very easily turn the handle and make an ‘official’ release so there’s not much to be gained by having confusing third party forks, though if we are tardy in doing that then I don’t think anyone will be mortally offended. SUL is a slightly more complex story because it’s a system-wide thing – you can only run one version, like SCL. That means forking is more tricky to handle. We don’t quite have nightly builds (of everything), but it’s very close – just requires time (you can see the build server here – there are still some blockers that need sorting) . Again, asking for things is a good form of encouragement. If there’s anyone interested, we can also give you access to infrastructure to do the handle-turning yourself.
We tried, it didn’t work out. The problem is that GCCSDK is more than just the compiler – you need make, bash, fileutils, autoconf, etc etc to build any pre-existing programs of size. This is possible with cygwin, but there are all kinds of niggly differences between cygwin and Linux. The larger target programs become the more sensitive they become to little differences. Worse, cygwin, the upstream compiler and the packages are moving targets, so your fix today might not work tomorrow. This makes maintaining a Windows branch an equivalent amount of work to the Linux branch, which we already have enough trouble finding manpower for. In fact, given that cygwin is a hack on top of Windows and doesn’t always work, probably more so. The simple answer today is run a VM. An Ubuntu VM is easy to set up on Windows (VirtualBox is free), and GCCSDK will just work in there. It’s not worth the hassle of trying to deal with Windows+cygwin at present, given the small number of prospective users. Again, if someone is interested in changing this then do go ahead – I don’t know what happened with politics 15 years ago but any input is encouraged. TL;DR: If you want something, please at least ask. If you can help, so much the better – we’re happy to show you how. If you can’t, we don’t always (often?) have developer time to do something about it (as with ROOL bounties), but simple things are quick, and if you pester enough someone might find some time. |
jim lesurf (2082) 1438 posts |
It wasn’t really sarcasm TBH. It was a mix of accepting I don’t grasp all the complications and regret that what in many ways is “improvements” in the hardware we can use has also come with a lot of complications. But from my POV the ideal aim would be to have a complier that generates a ‘standard’ set of instructions for the floating point ops it wants. Then have the FPE/Clib/whatever on given machine deal with that by making interpretations that make sense on that machine. I would guess converting FP code to run on a different FP hardware would be faster than converting it to run on int hardware. It makes sense that – if possible – this approach should be arranged so the newer faster machines would need the least ‘interpretation’. So code should work on an Iyonix or other ‘old’ system, but the user couldn’t expect that to run as fast anyway. For word order, presumably the compiler can put a new flag into the executable. Then the FPE/Clib/whatever can spot when this is absent and go “Aha, old word order mode needed”. So again, the result might be slower, but would work. And given the source code, could be recompiled to run faster because the word order was native for the new hardware. Indeed, the ‘new’ instructions from a new compiler that can flag its new could match the main new hardware, I’d guess. I can’t help feeling this would turn out easier for program creators and users because the system deals with the hardware differences. Presumably at the expense of more effort having to go into the compilers and the new model FP adopted. However I admit I have no idea how feasible all this would be, or if it simply requires so much more work on FPE/Clib/whatever that it simply doesn’t make sense because there are other things which need more urgent attention. Hence my “what do I know?” after thinking this way seems preferable to me. I realise I’m in no position to judge how hard the above would be. Just that it would seem desirable for users who are compiling, distributing, and using many small programs. Jim |
David Feugey (2125) 2709 posts |
Thanks. So I ask :) For the tests, there are a lot of Colin’s tools that are waiting. And to be honest, I have no problems with them. I agree anyway that a specific disclaimer could be provided with SUL 1.13.
Our worked. But I have not access any more the the developer team responsible of this project.
As usual, you can count on me here :) Just one stupid and completely off topic question (I’m sure someone already tries to explain me…) At was possible before to switch from x86 gcc to arm gcc (with a very simple bash command – like a link or an environment variable: I don’t remember). It was very useful: launch configure with x86 options, adapt a bit the makefile, switch to arm gcc and run make to get the tool. It was a trick, but very useful, even for complex projects (eQ did compile all SDL and X stuff this way). Autoconf rarely support RISCOS+ARM, but the makefile generated for Linux+x86 did work most of the time for a RISC OS build. Is it still possible to make this kind of switch, and how? I have no Linux VM here, but it could convince me to make some small ports of old SDL things. I just don’t have enough brain power left (and time too) to make autobuilder scripts. |
Steffen Huber (91) 1953 posts |
David, with GCCSDK, this “trick” is no longer necessary, because basically the ARM GCC outputting the RISC OS binaries is already compatible with all this bash autoconfig configure make stuff. Or did I misunderstand you? |
Rick Murray (539) 13840 posts |
If we could take the Linux approach and just mandate that “this has changed” then things would be a lot simpler. The hardware has improved. RISC OS itself is improving. But we’re stuck with having to keep a link to the past. Try
You do realise – the compiler inserts FP instructions directly into the code when required? These are not library calls, they are instructions for the co-processor to execute. I suppose the compiler could change FP instructions to call an FP library that will make a decision based upon the system in use, I’m just wondering if it is necessarily viable. The two machines that do not support the VFP are RiscPC era (RiscPC, A7000, etc) and the Iyonix. Both are old, one is very old. Frankly, I still regard the ability to select FPA only or VFP only as the sensible way forward (though for now the VFP will need to load the data backwards – thanks to Steve I see it can be done fairly easily in only two instructions). Does this mean that developers will need to offer two different versions? Yes and no. In essence, the support for VFP will be there for those of us who want to make use if it. The default state is to not.
…is only marginally different from converting 6502 code to run on the ARM. There are similarities, but there are also differences. Consider the two FP systems (three if you want to count NEON) to be like two(/three) different processors. Similar in operation but different in implementation and behaviour. Oh, and for fun, we are thus far dicussing using VFP as a replacement for FPA, we haven’t even touched upon VFP specific functionality – the “vector” part of the name, SIMD, strides, and the like.
Or just assume the data is in the old order and load/save it in two instructions instead of one. That way, CLib (etc) doesn’t even need to be touched. VFP-enabled code would need to call two new functions (to enable/disable the VFP context) but this can be placed into “Stubs”. It is probably better there, where it can fail kindly if VFP does not exist, instead of bombing out with an invalid instruction message. CLib will need to be updated (it uses FPA instructions), but it isn’t necessarily required for VFP support by the compiler. Additionally, given storing data in FPA format, one could mix FPA and VFP code…
If the compiler remains able to output FPA instructions, and this is the ‘default setting’, then at the very least everything will appear to work “as before”. This might be the most sensible option. CC emits VFP code when the -cpu option is suitable. In doing so, it will call two functions in stubs – one to create a VFP context, one to destroy it. There will be a third function, to allow the client to read the current/previous context. Doubles are loaded as two single reads, and written likewise (for FPA word order). Me? If FP was a core part of my software, I’d make a VFP version available for those able to run it. You just can’t ignore FPE running 40x slower than the native hardware……… |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12