FP support

280 posts, 29 voices

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Aug 9, 2015 7:45pm Rick Murray (539) 14035 posts	*This is too weird.* I have taken your suggested code and tweaked it as follows – I have pushed the entry R0 to R4; and you don’t need to preserve R0-R3 (APCS defines them as “can be corrupted” – normally a function returns something in R0). Here it is: STMFD R13!, {R4, R14} ; preserve pointer for result MOV R4, R0 ; copy pointer for result MOV R0, #0 ; ensure output FP value STR R0, [R4, #0] ; starts off as zero STR R0, [R4, #4] ; -- note #4 not #8 -- MOV R3, R4 SWI &C0040 ; Float_Start - returns VFP contexts STMFD R13!, {R0-R1} ; Store current context and previous MOV R1, R3 ADR R2, first ; pointer to first parameter ADR R3, second ; pointer to second parameter SWI &C004A ; Float_MUL (@R1 = @R2 * @R3) LDMFD R13!, {R0-R1} ; Load current/previous contexts SWI &C0041 ; Float_Stop LDMFD R13!, {R4, PC} I have used R4 for a very specific reason. The reason being that R4 is not R3. Confused? Yeah, so am I. Look at the code. The code doesn’t make sense, does it? I copy R4 into R3 and then call Float_Start, and then I copy R3 into R1 for Float_MUL. Why not just copy R4 into R1? Easy. This nonsense code works with Float 0.65. Copying R4 directly into R1 does not work. Placing the MOV R3, R4 after Float_Start does not work. What you see works. Uh… Okay. Over to you. My head hurts.

Aug 9, 2015 7:47pm Steve Drain (222) 1620 posts	it is pretty much the same as the code generated by C Perhaps that is what makes it odd to me. ;-) multiplication, is repeated 4,096,000 times I test from BASIC and use 10000000. I have to run the Float_Dummy SWI to discount the BASIC overheads, but I see those impressive speed advantages, too. There is nothing complex like multiple contexts going on here. Maybe I picked up on this, without analysing it properly: `R5 is a copy of the current context (Float needs it)` If you look, you are placing the pointer to the result in to R1. Yes. I spotted that a while after I posted, but I thought I would give you the pleasure of pointing it out. ;-) So, there is a use for R4 after all, although I would employ R3 for that purpose, for style only. If correcting that is not effective then I am not sure what to suggest. I would expect Float_Start to be called when the program runs and Float_Stop when it closes down. Then any Float SWI can be used in between, with the use of VFP or FPA code being automatically chosen. I had not thought it would be used the way you have.

Aug 9, 2015 8:00pm Rick Murray (539) 14035 posts	Got it! (wheeeee!) Your function for Float_Start preserves R3; however you call OS_Module 18 to check that VFPSupport is present. That call corrupts R4 and R5. I’ve made the change (STMFD R13!,{R3-R5,R14} and the corresponding LDMFD) and rebuilt the module, and now Float 0.65 works. Phew!

Aug 9, 2015 8:07pm Steve Drain (222) 1620 posts	Uh… Okay. Over to you. My head hurts. My first thought is that Float_Start is corrupting R4, but I am not near the machine to check. I will report back. I am glad that otherwise the code now works. ;-) Edit: We crossed, but good news. I will make the change.

Aug 9, 2015 10:05pm Rick Murray (539) 14035 posts	Pfffft…. I spent an hour and a half chasing ghosts. I was so busy getting the “fake BL by messing with R14 and PC” to work nicely that I forgot to pass the parameters to the Float_MUL code, and then wondering why it crashed. Duh. Observations: Steve – check what you are actually resetting with `FloatReset`, it brings down my system with a never-ending error about a SWI with a large number (&20xxxxx) not known. Nothing obvious in the code, but I’m too tired to concentrate properly. DDT actually let me set breakpoints in module space, which was pleasing, but it disassembles VFP instructions as co-processor ops (which, technically, is correct) but part way along it gets its panties in a twist and Data aborts. Okay. Here’s the final result, on a Pi model B revision 2, clocking at default speed (what is that, 700MHz or something?). `The correct value of 123.456 654.321 is 80779.853... Calculated using FPA ops generated by compiler: 80779.853376 Performing FPA MUL 4,096,000 times: 80779.853376 in 388cs Performing VFP MUL & OS_IntOn SWI 4,096,000 times: 80779.853376 in 146cs Performing VFP MUL 4,096,000 times: 80779.853376 in 7cs Performing Kappa Multiply 4,096,000 times: 80779.853376 in 224cs Performing KappaDirect Multiply 4,096,000 times: 80779.853376 in 32cs` The OS_IntOn SWI is the SWI with the lowest overheads I could think of that doesn’t muck up registers. This was mainly to see what influence a SWI call would make. Kappa Multiply is calling Float_MUL. KappaDirect Multiply is calling the same by branching directly into the jump table. The direct call into the Float module has the danger that it sets up a CLib “fixed pointers problem”, but unlike CLib, Float will refuse to quit if there are active clients – and for that Steve is to be commended. It has been shown that for those using assembler (and BASIC) that there is a viable option to call FP operations regardless of whether FPA, FPEmulator, or VFP is installed. Float’s 32cs is slower than raw VFP’s 7cs, but it is by far better than FPA’s 388cs. I suppose Float on a non-VFP system might be kind of slow, but there are fewer and fewer such machines and really we ought to look at methods to make use of the VFP we have instead…hang on, I have a feeling I’ve said this before. A lot. Actually, I noticed that Float can be optimised further – it looks like most of the instructions check the context pointer given and branch if VFP. Why? As the FPEmulator is the slow path on most 32 bit machines, by having the VFP code as the fall-through and branching to the FPE code could bring that number down a little? ;-) Okay. I’ve spent way too long on this stuff, but I wasn’t willing to let it go until the test program worked. I’m going to go watch Charlotte…

Aug 9, 2015 10:11pm Rick Murray (539) 14035 posts	PS: Steve, get Float registered. :-)

Aug 10, 2015 9:54am Steve Drain (222) 1620 posts	@Rick Thanks for all your help. I think it is true to say that developers crave feedback, and reports of errors as much as anything. Silence is the killer of ambition. ;-) I was so busy getting the “fake BL by messing with R14 and PC” to work I use an alternative method that reduces the ‘messing about’. This is taken straight from a bit of BASIC assembler: `._BRANCHES ;store pre-calculated branches ADD r0,r14,#&30:STR r0,VARIND +4 ADD r0,r14,#&34:STR r0,STOREA +4 MOV pc,r14 ;branch table .VARIND LDR pc,[pc,#-4]:EQUD 0 .STOREA LDR pc,[pc,#-4]:EQUD 0 ... CALL_BRANCHES:REM to create the branch table ... BL VARIND;call the routine as a branch with link` In this case r14 is also the branch table base, so substitute it in another situation. A more extensive version of this is used in Basalt. check what you are actually resetting with *FloatReset, it brings down my system with a never-ending error about a SWI with a large number (&20xxxxx) not known. I have been caught with that error, but not consistently. I think it is may be partly a consequence of not preserving R0 over the call. I have changed that and I will see what happens in the future. The OS_IntOn SWI is the SWI with the lowest overheads I could think of that doesn’t muck up registers. There is a Float_Dummy SWI. It is not documented, but I did mention it before. Float will refuse to quit if there are active clients – and for that Steve is to be commended. ;-) Float’s 32cs is slower than raw VFP’s 7cs, but it is by far better than FPA’s 388cs. Those figures are commensurate with what I found and justified the pursuit of this method. it looks like most of the instructions check the context pointer given and branch if VFP. Why? I took this point when you mentioned it before and I am in the process of reversing the branch order. It makes a tiny, but detectable, change. get Float registered OK, but it cannot be registered as Float, which is the name of Steve Fryatt’s application. Any suggestions? After all this, you have really not addressed the primary purpose of Float, which is to provide the trancendental operations that VFP does not have. Perhaps you could run your test on COS, for instance.

Aug 10, 2015 10:47am Dave Higton (1515) 3590 posts	OK, but it cannot be registered as Float, which is the name of Steve Fryatt’s application. Any suggestions? FloatingPoint

Aug 10, 2015 11:25am Rick Murray (539) 14035 posts	Any suggestions? SmartFP ?

Aug 10, 2015 11:33am Rick Murray (539) 14035 posts	Thanks for all your help. I think it is true to say that developers crave feedback, and reports of errors as much as anything. Silence is the killer of ambition. ;-) Tell me about it. I have in the past come across some very obvious problems and have thought how has nobody else seen this? People – pay attention. Whinge (nicely!) or there is a chance we may not ever see the problem. I don’t know about you, but sometimes in my programs I add features that I don’t need that are in some way relevant to make the program more useful. Consider the idea of mail merge. If I was writing a word processor, I would add something like that to it, despite having never had a use for it in my life… So if you see a problem, say something! After all this, you have really not addressed the primary purpose of Float, which is to provide the trancendental operations that VFP does not have. ;-) I was just looking for ways to handle non-integer numbers better than emulating a piece of hardware nearly a third of a century old….. Sure – I would be happy to do that. How? Isn’t COS the one to draw circles on the screen? I read your brief overview of the extra operations. Didn’t understand a word of it…. ;-)

Aug 10, 2015 12:23pm Rick Murray (539) 14035 posts	Update – looked COS up on wiki. https://en.wikipedia.org/wiki/Trigonometric_functions Riiiiight. Is there a version written in English? And I don’t mean https://simple.wikipedia.org/wiki/Trigonometric_function ;-)

Aug 10, 2015 12:28pm GavinWraith (26) 1572 posts	I think I am right in saying that BBC BASIC, at least back in the days of the BBC B, implemented the transcendental functions that VFP does not have by using rational function approximants, as continued fractions. That is to say, for each function f you store two tables of numbers a[i], b[i] whose size, n, determines the accuracy and range of the approximation and then define `f = function (x) local y = 1 for i = 1, n do y = a[i] + (b[i]*x)/y end return y end` No doubt appropriate values of these tables for a given size of fp-number can be found in the holy books. This gives a straightforward and uniform method of implementation, at least for an appropriate range of argument.

Aug 10, 2015 2:43pm Steve Drain (222) 1620 posts	I would be happy to do that. How? I just mean that you substitute Float_COS for Float_MUL in your routine, remembering that COS has only one parameter. You are relieved of all responsibility for the mathematics. ;-)

Aug 10, 2015 3:37pm Steve Drain (222) 1620 posts	implemented the transcendental functions that VFP does not have by using rational function approximants, as continued fractions. That could well be the the case – I have no knowledge – but continued fractions involve repeated division, and even on a system with hard float, division is more costly than multiplication. Float uses Chebychev polynomials, which are derived from Taylor series expansions. The latter are interesting mathematically, but very inefficient computationally. The aim with Chebychev polynomials is to have as few terms as possible for the required precision. The number of terms is also governed by the range over which the precision is required, so there is an element of range reduction before actual computation starts. The calculation of the polynomial coefficients for the required precision is not simply done. The Holy Book for this is ‘Computer Approximations’ by John Hart et al, first published in 1968, half of which is computer typeset tables. The other half is ‘Notes’ on how to use them, but what notes! Continued fractions are mentioned, but only to derive polynomial approximations. Here is an example for cos(x) to 8 decimal places for -1/2 < x < +1/2 313x^4 + 6900x^2 + 15120 cos(x)~ ---------------------------- 13x^4 + 660x^2 + 15120 For the computation we first calculate X = x^2 then do: P = 15120 + X(6900 + X(313)) Q = 15120 + X(660 + X(13)) And lastly cos(x)= P/Q That is 4 additions, 5 multiplications and 1 division. Here endeth the lesson. ;-)

Aug 10, 2015 5:11pm GavinWraith (26) 1572 posts	That is 4 additions, 5 multiplications and 1 division. And only integer coefficients! even on a system with hard float, division is more costly than multiplication. You can always get away with only one division, and use Horner’s method to minimize the number of multiplications for the numerator and denominator polynomials. Rational functions may not make sense for approximating asymptotic behaviour. Theoretically you can aqpproximate any continuous function by polynomials over a bounded closed interval, but using rational functions probably has computational advantages. The code to evaluate a rational function can be completely generic, taking as parameters the argument x to the function, the size n of the two arrays and pointers to them. That might explain why vfp does not bother to include individual transcendental functions – but then the same might apply to other fp systems, I suppose.

Aug 10, 2015 6:22pm Steve Drain (222) 1620 posts	The code to evaluate a rational function can be completely generic, It took some effort to twist my head around what I have done already, but I am a sucker for more punishment, so can you point me to any books/papers on this? And the Holy Book, of course.

Aug 10, 2015 9:22pm GavinWraith (26) 1572 posts	point me to any books/papers on this? Not sure what to recommend. I only mean that the library would contain code compiled from something like this `/* .. or whatever / #define num double struct RF { const unsigned int order; num numerator; num denominator; }; static num Horner(num x, num p, const unsigned int n) { num v; unsigned int i; v = 0.0; for (i = n; i >= 0; i-- ) v = p[i] + xv; return v; } num RF_eval(num x, RF f) { num numer, denom; const unsigned int n; n = f->order; numer = Horner(x, f->numerator, n); denom = Horner(x, f->denominator, n); return numer/denom; }` for evaluating rational functions, with the user able to cook up her own. I am no numerical analyst. It is a shame Joe Taylor is not on hand as this kind of thing is up his street, and he would know which Holy books to consult.

Aug 10, 2015 9:42pm Steve Drain (222) 1620 posts	@Gavin On closer inspection, I think we are talking about the same thing, but coming from different directions and with only slightly overlapping terminology. What you have above is equivalent to what I have in Float, except that mine is in assembler. I do not think there is likely to be an advantage to be gained looking further.

Aug 11, 2015 6:40pm Rick Murray (539) 14035 posts	I just mean that you substitute Float_COS for Float_MUL in your routine, remembering that COS has only one parameter. Okay then. FPEmulator’s COSD versus Float’s Float_COS (using VFP). This is a rather obvious race, but okay, let’s do it, we might be surprised… Oh, really? Surprising would be America being dumb enough to vote Donald Trump for president¹. Surprising would not be Float kicking FPEm’s ass, that’s just so obvious. `Performing FPA COS 4,096,000 times: -0.594714 in 9010cs Performing Kappa COS 4,096,000 times: -0.594714 in 704cs Performing KappaDirect COS 4,096,000 times: -0.594714 in 442cs` So, Steve, a minute and a half versus seven seconds (worse case with SWI decode). Does that answer your question? ☺ ¹ Before anybody says naah, it’ll never happen, remember they re-elected Dubyah, leading to a perfect front page of the Daily Mirror.

Aug 12, 2015 9:04am Steve Drain (222) 1620 posts	Does that answer your question? Many thanks. It is consistent with what I found, but your tests are more rigorous and do not have BASIC getting in the way. I have asked for the module to be registered as SmartFP.

Aug 12, 2015 11:51am jim lesurf (2082) 1445 posts	Not sure it is relevant to the above, but FWIW whenever I need to compute maths functions I tend to start from either: 1) Numerical recipies in [whatever language] or 2) Abramowiz and Stegun, Pocketbook of mathematical functions (1) is, I assume, well known by programmers and has useful code examples. (2) may be less well known unless the programmer is also a mathematician. Dunno, I just latched on to it as a ‘go to’ for solving all the problems I’m not mathematician enough to do for myself. :-) There is also a Handbook of mathematical functions. Bigger and useful if you want tables of values to check against your own results. Jim

Aug 12, 2015 1:25pm Jeffrey Lee (213) 6048 posts	Steve: If you’re going to release Float/SmartFP ‘officially’ then I should probably point out a few concerns that came to mind when I had a look at the code the other night: In COS (and some of the other trig functions?) you’ve got a loop which uses subtraction to reduce the input to within 0…pi or thereabouts. Then you have a division by pi to get a number from 0-1 (these aren’t the exact values you’re using but the idea is going to be the same). This will obviously be slow if the input number is very large! It would be much better to do the divide by pi, and then extract the fraction bits (i.e. throwing away the integer part of the number). Unfortunately there isn’t a VFP/NEON instruction for this (despite it being a fairly common thing), so you’ll probably have to do the divide, transfer the double value to ARM registers, extract the fraction part manually, then put it back together again. I spotted a similar thing elsewhere (LOG?), but with multiply/divide by 10 to transform an input so it lies within 0-1 range. Not sure what the best approach is with this one – I’d assume there’s a Chebychev polynomial for LOG2 (which would allow you to essentially use the mantissa directly and then add the exponent to the result), so calculate the LOG2 and then multiply by a constant to convert to LOG10? The basic rule of both integer and floating point math is that division is slower than multiplication, so if it doesn’t have an adverse effect on accuracy then you might want to try replacing some of the divisions by a constant with multiplications by the reciprocal of the constant. Make sure you assemble the module in BASIC64 to avoid losing precision on the double precision constants that are in the assembler source (not sure if you’re doing this already) It might be nice to have error checking versions of the SWIs – i.e. they manually check for any floating point exceptions that were generated by the operation. Reason being that all modern VFP implementations seem to lack support for hardware trapping of exceptions (In terms of RISC OS, the Pi 1 is the only machine that will trap exceptions). So with the error-checking version of the SWI, on exit (potentially before writing back the result to memory?) you’d check the exception flags in the FPSCR and have the SWI return an error if any of the enabled exceptions have been detected (the VFP hardware will detect the exceptions, it just won’t raise them via a processor abort) You also need to be careful that your code doesn’t get stuck in an infinite loop somewhere when given NaN, infinity, etc. as an input (both for VFP and FPA implementations). For FPA testing (especially from BASIC assembler) you may have to manually turn off exceptions in the FPSR to make sure that it’s your code which is trapping the errors and not FPEmulator. If you can fix those then I think the code will be in a pretty good state to use as the basis for a VFP version of BASIC64.

Aug 12, 2015 1:45pm Steve Drain (222) 1620 posts	@Jim Thanks. I have come across the Abramowiz books. They are facinating, but when it comes to approximations at double-precision they come nowhere close to providing a solution.

Aug 12, 2015 2:16pm Steve Drain (222) 1620 posts	@ Jeffrey This [PI loop] will obviously be slow if the input number is very large! Something I was aware of, but I did not see anything simple to do with it, and my immediate attention then was on the calculation. I will go back and look with your suggestions in mind. extract the fraction bits […] Unfortunately there isn’t a VFP/NEON instruction for this […] so you’ll probably have to do the divide, transfer the double value to ARM registers, extract the fraction part manually, then put it back together again. Without looking, I think I have done this somewhere in the module already. multiply/divide by 10 to transform an input so it lies within 0-1 range […] so calculate the LOG2 and then multiply by a constant to convert to LOG10? It so happens that the coefficients for LOG10 are much easier to find the range for with the least computation at the necessary precision. I did read quite a lot about LOG2 algorithms, but in practical terms it did not seem the best solution. It may still be worth another look, but I am happy the speed is reasonably good. The basic rule of both integer and floating point math is that division is slower than multiplication […] try replacing some of the divisions by a constant with multiplications by the reciprocal of the constant. The same rule I quoted to Gavin. Your suggestion sounds like an excellent idea. Make sure you assemble the module in BASIC64 to avoid losing precision on the double precision constants that are in the assembler source Oh, yes! I rapped Rick’s knuckles for not reading the header REMs where this is specified. ;-) It might be nice to have error checking versions of the SWIs […] you’d check the exception flags in the FPSCR and have the SWI return an error if any of the enabled exceptions have been detected. So that is how it works! I am at this moment doing the FPSCR page for an update to the StrongHelp VFP manual and I was puzzling how you use it. The float SWIs do attempt to forestall exceptions by checking the input. They provide more useful error messages than the FPA ones, I think. You also need to be careful that your code doesn’t get stuck in an infinite loop somewhere when given NaN, infinity, etc. I would prefer to trap Nan and Infinity at the input stage, because I do not foresee a situation where the SWIs should be using them. If you can fix those then I think the code will be in a pretty good state to use as the basis for a VFP version of BASIC64. Thanks. I have been mentally preparing a hack of the BASIC64 module to do that. ;-)

Aug 12, 2015 4:28pm David Feugey (2125) 2709 posts	If you can fix those then I think the code will be in a pretty good state to use as the basis for a VFP version of BASIC64. Very good news for Basic Thanks. I have been mentally preparing a hack of the BASIC64 module to do that. ;-) Very good too :)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Reply

To post replies, please first log in.

Forums → Wish lists →

FP support

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options