ARM NEON in ObJAsm
Pages: 1 2
Paolo Fabio Zaino (28) 1882 posts |
hello guys, I have a working fast memcpy/memset/memmove rewrite for RISC OS that uses Neon and other ARM instructions (when Neon is not present), right now it’s written in GNU ASM and GCC. I would like to convert the ASM side so it can build on ObjAsm as well (the stuff I am posting in the RISC OS Community on GitHub generally compile on both). Now, the two Assemblers are profoundly different, so probably the best solution is to separate the ASM files and keep the C code C99, I have experience from the past with ObjAsm and (these days) more experience with GAS when it comes down to new ARM features as Neon. So, my question is: Does ObjAsm supports a “Neon is enabled on your target Architecture” macro definition (something like the official ARM C language extension: __ARM_NEON) when compiling (or assembling) for a target that has Neon extensions? Obviously I am using it in GCC/GAS to compile for Neon enabled targets vs Neon disabled targets. If ObjAsm supports the macro above (or a relative one), I imagine it will work fine in an ObjAsm IF statement like the following:
Or does it needs special treatments? (like, for example, it has to be equal to a specific value etc.) I don’t seem to be able to find any hint in the documentation about this, so thanks in advance! In case my English may be confusing, what I am asking is: If I assemble a piece of code in ObjAsm with the following CPU target: --cpu=Cortex-A15.no_neon instead of: --cpu=Cortex-A15 Does ObjAsm has a macro that tells my code to switch off the neon sections? |
Paolo Fabio Zaino (28) 1882 posts |
Never-mind, found it! :) Sorry late night coding… For future refs, the answer is: {TARGET_FEATURE_NEON} And is, as expected, {TRUE} if the target offers Neon extensions and {FALSE} if it doesn’t. Cheers! P.S. Reminder to myself: Read the manuals using Linux or Windows, so I can search in the PDFs! XD |
Matthew Phillips (473) 721 posts |
This sounds a very worthwhile development. If the SharedCLibrary incorporated it, and it appeared in the ROMs for the different platforms, I imagine this would give a speed boost to any C application using the standard library. How great is the increase in speed? Some of our applications throw a lot of data around so this could be beneficial. |
Rick Murray (539) 13850 posts |
Practically nothing. The problem is that the C compiler directly inserts FPA instructions into the executable. That which is present in CLib is the maths library for complex functions. Things that use square roots and powers and such would gain a small benefit, while ordinary maths on real numbers in programs (say, stuff like adding up currency) would see no benefit whatsoever as that’ll still be a set of emulated FPA instructions. Furthermore, there’s the additional complication that the word order is back to front (whether from the perspective of VFP or FPA, they’re not the same), so simply replacing FPA code with VFP won’t work. Not to mention even more complications in the form of contexts. None of this is insurmountable, however I think that it might be better to have a separate VFP enabled compiler, and a set of clone functions in CLib that are VFP enabled, and a Stubs that will link into the appropriate set of functions (remember, there’s all sorts of other things like atof, printf, etc!). |
Stuart Swales (8827) 1357 posts |
@Rick: Paolo and, I think Matthew, are interested in speeding up memmove here using NEON, not f.p. |
Rick Murray (539) 13850 posts |
<facepalm!> I wonder if, as DavidS suggests, there would be benefit to using NEON for shifting memory around if that’s the only thing that is happening (as is usually the case for a mem move), in which case both ARM and NEON are liable to max out the memory bus. As my mom would say: suck it and see |
Clive Semmens (2335) 3276 posts |
Aaaarghhh! Anyone who uses Floating Point for currency calculations wants their head seeing to… |
Stuart Swales (8827) 1357 posts |
FPA does have packed decimal for just those reasons, Clive. |
Rick Murray (539) 13850 posts |
Indeed, but I have seen it done. 10 A = 0 20 FOR l% = 1 TO 100 30 A = A + 0.1 40 PRINT A 50 NEXT This will start off well: 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Bit it’ll get odd when it comes to three and a half: 3.4 3.5 3.59999999 3.69999999 3.79999999 3.89999999 3.99999999 And, note, that we have two values beginning 3.5. This could be resolved by rounding to nearest, but now you’re effectively changing every value to the next one up. A little further on is this: 8.99999999 9.09999999 9.2 9.3 Plus, what happens if we add 7.49999998 to 4.39999999? Should that be 11.88 or 11.90? Interestingly, this test appears to work with I think the simplest way of handling currency is to multiply by a hundred and just use integers to count cents/pennies. Or multiply by a thousand if accuracy is required to be within a tenth of a penny. This is, of course, assuming a decimal currency and not something batpoop crazy like British pre-decimalisation. :-) |
Clive Semmens (2335) 3276 posts |
Batpoop crazy indeed, but fairly typical of pre-decimalization units of all kinds and currencies everywhere. Pre-decimalizaton Indian currency was actually binary – down to a 64th of a rupee. None of these things are a problem though. You make the smallest coin, or 1/10th of it, or 1/12th of it or whatever, be the unit, and work in integer multiples of it. Maybe you need 64-bit integers if you’re doing orbital calculations to millimetre accuracy… |
Jeffrey Lee (213) 6048 posts |
Packed decimal is just a memory storage format. As soon as a value is loaded into a FP register it gets converted to ordinary binary floating point. |
Stuart Swales (8827) 1357 posts |
Fairy nuff, but ISTR FPE is using the equivalent of 80 bits for exp/1mantissa internally? So STFP will then re-round so that were A in Rick’s loop to be defined as decimal type (in some imaginary variant of BASIC), if wouldn’t accumulate those round-off errors. Not that it does anyway with BASIC64 FPA. |
Clive Semmens (2335) 3276 posts |
The round-off errors aren’t the biggest issue in my thinking: it’s speed. Integer arithmetic is far faster – and consumes far less energy – than floating point. Only an issue of any significance if you’re doing an awful lot of it, of course. |
Stuart Swales (8827) 1357 posts |
For decimal arithmetic (and other odd currency values) where you are forced to use f.p. (e.g. JavaScript Numbers), you just need to pre-scale by the non-power-of-two factors. So to get 0.01 accuracy, just prescale values by 25, rather than 100. |
Chris Evans (457) 1614 posts |
Stuart can you explain that a bit more please, I don’t understand. |
Stuart Swales (8827) 1357 posts |
It’s down to fractions that aren’t a power-of-two not having a finite binary representation (like 1/5, being 0.00110011…) so floating point values in IEEE 754 aren’t precise for those numbers (and multiples thereof). So to represent hundredths precisely, you could multiply all your values by 100 (e.g. keeping track of an integral number of pennies rather than £.pp). Dividing by two doesn’t lose any precision: the floating point mantissa remains the same, just the exponent is reduced by one, so you can take the powers of two out of the scale factor, leaving 25. Little practical difference for examples you could do by hand, but if you need to interwork with different fractions, to implement a system to keep track of prime factors to help maintain precision, you don’t need to keep track of the number of two factors. Arithmetic rules are simple, just like you learned at school for fractions: when adding/subtracting, multiply both numerators and denominators until you get a common denominator; when multiplying, multiply the denominators together; when dividing, cancel out common factors from the denominators. If you need a common floating point value, just divide by the denominator. I did hack PipeDream in about 2000 to do just this, but foolishly replaced the numeric arithmetic section with this new behaviour, so as you’d expect/hope, it gave different results, and so would cause spreadsheets with existing cross-checks to fail, so it was never released. If I’d have had half a brain, I’d have made it a spreadsheet evaluator option to supplement the existing code, not replace it. |
Clive Semmens (2335) 3276 posts |
On the other hand, if instead of multiplying only by all the non-power-of-two factors you do multiply by all the powers of two as well, then you don’t need floating point at all. Then you can use integers, which are altogether more efficient – faster, and less power-hungry. Less memory-hungry as well. (Of course if you’re dealing with national budgets, or corporate accounts and working to the nearest penny, then +/-2^31 may not be big enough – +/-2^63 is of course plenty big enough for most purposes. BBC BASIC doesn’t provide that option, although it would be more efficient for most purposes than floating point. But what you do elsewhere is up to the programmer.) |
Stuart Swales (8827) 1357 posts |
Doing it with 64-bit integer numerators (and tracking the twos as well) would be way to go these days – back then ye olde Norcroft cc4.05 didn’t support 64-bit integer types! Weren’t there plans to support i%% in BASIC? [@Gavin: your link to Zahl below is broken (put here to save another post)] |
Chris Evans (457) 1614 posts |
Thanks Stuart, I’m suitable enlightened. |
GavinWraith (26) 1563 posts |
While we are talking numbers may I put in a puff for Zahl ? I made this as a substitute for Nick Craig-Wood’s BigNumber module. It provides two integer types: small
There is an example to show how to implement RSA cryptography with it. |
Clive Semmens (2335) 3276 posts |
The ARM instruction set (up to v7; I don’t know v8 at all) is particularly well suited to very large integer handling. I wrote a whole set of very large integer arithmetic routines back in the day – before the shift from 26-bit addressing, so they’d need updating to be any use now. Just addition, subtraction, multiplication, division, x^y, int(x^(a/b)) – that by Newton-Raphson – conversion between any two bases (not just binary/hex and decimal…) and a couple of fast methods of finding factors. I don’t know how Zahl represents integers – my trick was simply to use the first (32-bit) word to specify how many following words were used to represent the integer. That limits you to integers less than (2^32)^(2^32 – 1)… it would be easy enough to use one bit of the “size” word to specify whether to use a second size word (and perhaps a third… etc), but let’s not get too silly… |
Paolo Fabio Zaino (28) 1882 posts |
I totally forgot about this thread lol, what the heck has happened here? XD @ Matthew As soon as I have a good benchmark running on RISC OS I’ll share the numbers I get on all my boards. @ Rick yes the original topic is about improving memory (copy, set and move) performance in C on RISC OS. |
Paolo Fabio Zaino (28) 1882 posts |
@ Stuart Gavin link to Zhal here: http://www.wra1th.plus.com/soft.html Scroll down. And great work Gavin! :) |
Paolo Fabio Zaino (28) 1882 posts |
For the new topic about math operations using NEON: NEON can benefits such type of operations only when a piece of software does a lot of similar computations that could be parallelised. In other words if you just need to do a single sqrt for a single value, then there will be little to no benefit in using a NEON based implementation of it. A reason why I am working on another project using NEON for math functions, is to improve Genann performance where each neuron performs an FP calculation and (on backfeed neural networks) this can happen multiple times too (and for the same neuron) and given that a single neural network layer may have many neurons, and a neural network is composed by multiple layers of neurons you can imagine. There are many applications for NEON in AI: https://www.youtube.com/watch?v=lUmjnCdGtGE The video above is a presentation of a Tech Talk that should happen on the 21st of September 2021, and it’s specifically for ARM. Some of the techniques that will be explained in that Tech Talk I am implementing on RISC OS and hopefully I’ll have enough time to get to demo-able state at some point. All of this has very little to help BBC BASIC type of uses and/or retro-coding type of approaches on RISC OS. So, maybe not of interest here, in which case sorry for the noise. |
GavinWraith (26) 1563 posts |
The big integer type in !Zahl is realized by the IMath library (copyright @ 2002-2009 Michael J. Fromberger), an ISO C arbitrary precision integer and rational library that will tell you want you want to know about its representation of numbers. Nick Craig-Wood originally wrote his BigNum (relocatable) module in ARM assembler, offering SWI calls so that any kind of RISC OS application could use it. When he left RISC OS-land I offered to maintain it for him. I came across some bugs. For example the Newton-Raphson algorithm for square root does not work in integer arithmetic when you start with a number one less than a power of two. After a while I found the ARM assembler code too difficult to maintain, and, finding C easier to work with, I produced !Zahl as a partial replacement. If I may mount my hobby horse at this point, there are indefinitely many ways of representing numbers in a computer, but it seems that most people are fixated on the standard floating point representations and are unaware of alternatives. Floating point numbers suit engineering very well (I am not being rude here), but they do have their own drawbacks. Their principal virtue is to break the bond between the value of a number and the number of bytes required to store its representation. Their principal vice is that the ordinary laws of arithmetic, e.g that |
Pages: 1 2