RISC OS Open: Forum: ARM NEON in ObJAsm

Sep 5, 2021 11:15pm

hello guys,
a quick question about recent changes in ObjAsm on the matter of ARM extensions:

I have a working fast memcpy/memset/memmove rewrite for RISC OS that uses Neon and other ARM instructions (when Neon is not present), right now it’s written in GNU ASM and GCC. I would like to convert the ASM side so it can build on ObjAsm as well (the stuff I am posting in the RISC OS Community on GitHub generally compile on both).

Now, the two Assemblers are profoundly different, so probably the best solution is to separate the ASM files and keep the C code C99, I have experience from the past with ObjAsm and (these days) more experience with GAS when it comes down to new ARM features as Neon.

So, my question is:

Does ObjAsm supports a “Neon is enabled on your target Architecture” macro definition (something like the official ARM C language extension: __ARM_NEON) when compiling (or assembling) for a target that has Neon extensions? Obviously I am using it in GCC/GAS to compile for Neon enabled targets vs Neon disabled targets.

If ObjAsm supports the macro above (or a relative one), I imagine it will work fine in an ObjAsm IF statement like the following:


[ __ARM_NEON

    ; ARM Neon related instructions

|

    ; Non Neon related instructions

]

Or does it needs special treatments? (like, for example, it has to be equal to a specific value etc.)

I don’t seem to be able to find any hint in the documentation about this, so thanks in advance!

In case my English may be confusing, what I am asking is:

If I assemble a piece of code in ObjAsm with the following CPU target:

--cpu=Cortex-A15.no_neon

instead of:

--cpu=Cortex-A15

Does ObjAsm has a macro that tells my code to switch off the neon sections?

Sep 5, 2021 11:38pm

Paolo Fabio Zaino (28) 1882 posts

Never-mind, found it! :)

Sorry late night coding…

For future refs, the answer is:

{TARGET_FEATURE_NEON}

And is, as expected, {TRUE} if the target offers Neon extensions and {FALSE} if it doesn’t.

Cheers!

P.S. Reminder to myself: Read the manuals using Linux or Windows, so I can search in the PDFs! XD

Sep 15, 2021 7:17am

Matthew Phillips (473) 721 posts

This sounds a very worthwhile development. If the SharedCLibrary incorporated it, and it appeared in the ROMs for the different platforms, I imagine this would give a speed boost to any C application using the standard library. How great is the increase in speed?

Some of our applications throw a lot of data around so this could be beneficial.

Sep 15, 2021 12:34pm

Rick Murray (539) 13840 posts

How great is the increase in speed?

Practically nothing.

The problem is that the C compiler directly inserts FPA instructions into the executable. That which is present in CLib is the maths library for complex functions. Things that use square roots and powers and such would gain a small benefit, while ordinary maths on real numbers in programs (say, stuff like adding up currency) would see no benefit whatsoever as that’ll still be a set of emulated FPA instructions.

Furthermore, there’s the additional complication that the word order is back to front (whether from the perspective of VFP or FPA, they’re not the same), so simply replacing FPA code with VFP won’t work. Not to mention even more complications in the form of contexts.

None of this is insurmountable, however I think that it might be better to have a separate VFP enabled compiler, and a set of clone functions in CLib that are VFP enabled, and a Stubs that will link into the appropriate set of functions (remember, there’s all sorts of other things like atof, printf, etc!).

Sep 15, 2021 1:08pm

Stuart Swales (8827) 1357 posts

@Rick: Paolo and, I think Matthew, are interested in speeding up memmove here using NEON, not f.p.

Sep 15, 2021 4:07pm

Rick Murray (539) 13840 posts

interested in speeding up memmove here using NEON, not f.p.

<facepalm!>

I wonder if, as DavidS suggests, there would be benefit to using NEON for shifting memory around if that’s the only thing that is happening (as is usually the case for a mem move), in which case both ARM and NEON are liable to max out the memory bus.

As my mom would say: suck it and see

Sep 15, 2021 5:07pm

Clive Semmens (2335) 3276 posts

ay, stuff like adding up currency) would see no benefit whatsoever as that’ll still be a set of emulated FPA instructions

Aaaarghhh!

Anyone who uses Floating Point for currency calculations wants their head seeing to…

Sep 15, 2021 5:18pm

Stuart Swales (8827) 1357 posts

FPA does have packed decimal for just those reasons, Clive.

Sep 15, 2021 5:42pm

Rick Murray (539) 13840 posts

Anyone who uses Floating Point for currency calculations wants their head seeing to…

Indeed, but I have seen it done.

10 A = 0
20 FOR l% = 1 TO 100
30   A = A + 0.1
40   PRINT A
50 NEXT

This will start off well:

Bit it’ll get odd when it comes to three and a half:

And, note, that we have two values beginning 3.5. This could be resolved by rounding to nearest, but now you’re effectively changing every value to the next one up.

A little further on is this:

Plus, what happens if we add 7.49999998 to 4.39999999? Should that be 11.88 or 11.90?

Interestingly, this test appears to work with @%="F.2". Maybe Sophie figured somebody was bound to try it in BASIC?

I think the simplest way of handling currency is to multiply by a hundred and just use integers to count cents/pennies. Or multiply by a thousand if accuracy is required to be within a tenth of a penny.

This is, of course, assuming a decimal currency and not something batpoop crazy like British pre-decimalisation. :-)

Sep 15, 2021 6:02pm

Clive Semmens (2335) 3276 posts

something batpoop crazy like British pre-decimalisation. :-)

Batpoop crazy indeed, but fairly typical of pre-decimalization units of all kinds and currencies everywhere. Pre-decimalizaton Indian currency was actually binary – down to a 64th of a rupee.

None of these things are a problem though. You make the smallest coin, or 1/10th of it, or 1/12th of it or whatever, be the unit, and work in integer multiples of it. Maybe you need 64-bit integers if you’re doing orbital calculations to millimetre accuracy…

Sep 15, 2021 7:12pm

Jeffrey Lee (213) 6048 posts

FPA does have packed decimal for just those reasons, Clive.

Packed decimal is just a memory storage format. As soon as a value is loaded into a FP register it gets converted to ordinary binary floating point.

Sep 15, 2021 9:16pm

Stuart Swales (8827) 1357 posts

As soon as a value is loaded into a FP register it gets converted to ordinary binary floating point.

Fairy nuff, but ISTR FPE is using the equivalent of 80 bits for exp/1mantissa internally? So STFP will then re-round so that were A in Rick’s loop to be defined as decimal type (in some imaginary variant of BASIC), if wouldn’t accumulate those round-off errors. Not that it does anyway with BASIC64 FPA.

Sep 16, 2021 5:23am

Clive Semmens (2335) 3276 posts

wouldn’t accumulate those round-off errors

The round-off errors aren’t the biggest issue in my thinking: it’s speed. Integer arithmetic is far faster – and consumes far less energy – than floating point. Only an issue of any significance if you’re doing an awful lot of it, of course.

Sep 16, 2021 12:47pm

Stuart Swales (8827) 1357 posts

work in integer multiples of it

For decimal arithmetic (and other odd currency values) where you are forced to use f.p. (e.g. JavaScript Numbers), you just need to pre-scale by the non-power-of-two factors. So to get 0.01 accuracy, just prescale values by 25, rather than 100.

Sep 16, 2021 6:05pm

Chris Evans (457) 1614 posts

Stuart can you explain that a bit more please, I don’t understand.

Sep 17, 2021 9:44am

Stuart Swales (8827) 1357 posts

It’s down to fractions that aren’t a power-of-two not having a finite binary representation (like 1/5, being 0.00110011…) so floating point values in IEEE 754 aren’t precise for those numbers (and multiples thereof). So to represent hundredths precisely, you could multiply all your values by 100 (e.g. keeping track of an integral number of pennies rather than £.pp). Dividing by two doesn’t lose any precision: the floating point mantissa remains the same, just the exponent is reduced by one, so you can take the powers of two out of the scale factor, leaving 25. Little practical difference for examples you could do by hand, but if you need to interwork with different fractions, to implement a system to keep track of prime factors to help maintain precision, you don’t need to keep track of the number of two factors. Arithmetic rules are simple, just like you learned at school for fractions: when adding/subtracting, multiply both numerators and denominators until you get a common denominator; when multiplying, multiply the denominators together; when dividing, cancel out common factors from the denominators. If you need a common floating point value, just divide by the denominator.

I did hack PipeDream in about 2000 to do just this, but foolishly replaced the numeric arithmetic section with this new behaviour, so as you’d expect/hope, it gave different results, and so would cause spreadsheets with existing cross-checks to fail, so it was never released. If I’d have had half a brain, I’d have made it a spreadsheet evaluator option to supplement the existing code, not replace it.

Sep 17, 2021 12:19pm

Clive Semmens (2335) 3276 posts

On the other hand, if instead of multiplying only by all the non-power-of-two factors you do multiply by all the powers of two as well, then you don’t need floating point at all. Then you can use integers, which are altogether more efficient – faster, and less power-hungry. Less memory-hungry as well.

(Of course if you’re dealing with national budgets, or corporate accounts and working to the nearest penny, then +/-2^31 may not be big enough – +/-2^63 is of course plenty big enough for most purposes. BBC BASIC doesn’t provide that option, although it would be more efficient for most purposes than floating point. But what you do elsewhere is up to the programmer.)

Sep 17, 2021 12:58pm

Stuart Swales (8827) 1357 posts

Doing it with 64-bit integer numerators (and tracking the twos as well) would be way to go these days – back then ye olde Norcroft cc4.05 didn’t support 64-bit integer types!

Weren’t there plans to support i%% in BASIC?

[@Gavin: your link to Zahl below is broken (put here to save another post)]

Sep 17, 2021 1:17pm

Chris Evans (457) 1614 posts

Thanks Stuart, I’m suitable enlightened.

Sep 17, 2021 1:21pm

GavinWraith (26) 1563 posts

While we are talking numbers may I put in a puff for Zahl ? I made this as a substitute for Nick Craig-Wood’s BigNumber module. It provides two integer types: small int32 for 32-bit integers compatible with RISC OS internals, and big integer for integers as large as your RAM will tolerate. The variables $0, … ,$9 are initialized to big 0, … ,9. For example

print ($2^($2^10))  -->
179769313486231590772930519078902473361797697894230657273430081157732675805500
963132708477322407536021120113879871393357658789768814416622492847430639474124
377767893424865485276302219601246094119453082952085005768838150682342462881473
913110540827237163350510684586298239947245938479716304835356329624224137216

There is an example to show how to implement RSA cryptography with it.

Sep 17, 2021 2:01pm

Clive Semmens (2335) 3276 posts

The ARM instruction set (up to v7; I don’t know v8 at all) is particularly well suited to very large integer handling. I wrote a whole set of very large integer arithmetic routines back in the day – before the shift from 26-bit addressing, so they’d need updating to be any use now. Just addition, subtraction, multiplication, division, x^y, int(x^(a/b)) – that by Newton-Raphson – conversion between any two bases (not just binary/hex and decimal…) and a couple of fast methods of finding factors.

I don’t know how Zahl represents integers – my trick was simply to use the first (32-bit) word to specify how many following words were used to represent the integer. That limits you to integers less than (2^32)^(2^32 – 1)… it would be easy enough to use one bit of the “size” word to specify whether to use a second size word (and perhaps a third… etc), but let’s not get too silly…

Sep 17, 2021 3:15pm

Paolo Fabio Zaino (28) 1882 posts

I totally forgot about this thread lol, what the heck has happened here? XD

@ Matthew
Performance obviously vary based on the size of memcpy/memmove/memset your code does in one time. For instance, if it copies a single byte or like 3 bytes per time, you have no performance improvements (and btw in these cases I fall back to standard RISC OS implementation as it’s lighter). But if your code needs to copy/set/move a lot of bytes at once then performance improvement can be substantial (depending on amount of bytes in a single call, the board, memory bandwidth, the ARM chip etc.).

As soon as I have a good benchmark running on RISC OS I’ll share the numbers I get on all my boards.

@ Rick

yes the original topic is about improving memory (copy, set and move) performance in C on RISC OS.

Sep 17, 2021 3:16pm

Paolo Fabio Zaino (28) 1882 posts

@ Stuart

Gavin link to Zhal here: http://www.wra1th.plus.com/soft.html

Scroll down. And great work Gavin! :)

Sep 17, 2021 3:29pm

Paolo Fabio Zaino (28) 1882 posts

For the new topic about math operations using NEON:

NEON can benefits such type of operations only when a piece of software does a lot of similar computations that could be parallelised. In other words if you just need to do a single sqrt for a single value, then there will be little to no benefit in using a NEON based implementation of it.

A reason why I am working on another project using NEON for math functions, is to improve Genann performance where each neuron performs an FP calculation and (on backfeed neural networks) this can happen multiple times too (and for the same neuron) and given that a single neural network layer may have many neurons, and a neural network is composed by multiple layers of neurons you can imagine.

There are many applications for NEON in AI: https://www.youtube.com/watch?v=lUmjnCdGtGE

The video above is a presentation of a Tech Talk that should happen on the 21st of September 2021, and it’s specifically for ARM. Some of the techniques that will be explained in that Tech Talk I am implementing on RISC OS and hopefully I’ll have enough time to get to demo-able state at some point.

All of this has very little to help BBC BASIC type of uses and/or retro-coding type of approaches on RISC OS. So, maybe not of interest here, in which case sorry for the noise.

Sep 17, 2021 4:59pm

GavinWraith (26) 1563 posts

The big integer type in !Zahl is realized by the IMath library (copyright @ 2002-2009 Michael J. Fromberger), an ISO C arbitrary precision integer and rational library that will tell you want you want to know about its representation of numbers.

Nick Craig-Wood originally wrote his BigNum (relocatable) module in ARM assembler, offering SWI calls so that any kind of RISC OS application could use it. When he left RISC OS-land I offered to maintain it for him. I came across some bugs. For example the Newton-Raphson algorithm for square root does not work in integer arithmetic when you start with a number one less than a power of two. After a while I found the ARM assembler code too difficult to maintain, and, finding C easier to work with, I produced !Zahl as a partial replacement.

If I may mount my hobby horse at this point, there are indefinitely many ways of representing numbers in a computer, but it seems that most people are fixated on the standard floating point representations and are unaware of alternatives. Floating point numbers suit engineering very well (I am not being rude here), but they do have their own drawbacks. Their principal virtue is to break the bond between the value of a number and the number of bytes required to store its representation. Their principal vice is that the ordinary laws of arithmetic, e.g that x + (y + z) be equal to (x + y) + z, do not hold. There are systems for which they do, but not ones for which numbers take up a fixed number of bytes to store.

ARM NEON in ObJAsm

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Sep 5, 2021 11:15pm Paolo Fabio Zaino (28) 1882 posts	hello guys, a quick question about recent changes in ObjAsm on the matter of ARM extensions: I have a working fast memcpy/memset/memmove rewrite for RISC OS that uses Neon and other ARM instructions (when Neon is not present), right now it’s written in GNU ASM and GCC. I would like to convert the ASM side so it can build on ObjAsm as well (the stuff I am posting in the RISC OS Community on GitHub generally compile on both). Now, the two Assemblers are profoundly different, so probably the best solution is to separate the ASM files and keep the C code C99, I have experience from the past with ObjAsm and (these days) more experience with GAS when it comes down to new ARM features as Neon. So, my question is: Does ObjAsm supports a “Neon is enabled on your target Architecture” macro definition (something like the official ARM C language extension: __ARM_NEON) when compiling (or assembling) for a target that has Neon extensions? Obviously I am using it in GCC/GAS to compile for Neon enabled targets vs Neon disabled targets. If ObjAsm supports the macro above (or a relative one), I imagine it will work fine in an ObjAsm IF statement like the following: `[ __ARM_NEON ; ARM Neon related instructions \| ; Non Neon related instructions ]` Or does it needs special treatments? (like, for example, it has to be equal to a specific value etc.) I don’t seem to be able to find any hint in the documentation about this, so thanks in advance! In case my English may be confusing, what I am asking is: If I assemble a piece of code in ObjAsm with the following CPU target: --cpu=Cortex-A15.no_neon instead of: --cpu=Cortex-A15 Does ObjAsm has a macro that tells my code to switch off the neon sections?

Sep 5, 2021 11:38pm Paolo Fabio Zaino (28) 1882 posts	Never-mind, found it! :) Sorry late night coding… For future refs, the answer is: {TARGET_FEATURE_NEON} And is, as expected, {TRUE} if the target offers Neon extensions and {FALSE} if it doesn’t. Cheers! P.S. Reminder to myself: Read the manuals using Linux or Windows, so I can search in the PDFs! XD

Sep 15, 2021 7:17am Matthew Phillips (473) 721 posts	This sounds a very worthwhile development. If the SharedCLibrary incorporated it, and it appeared in the ROMs for the different platforms, I imagine this would give a speed boost to any C application using the standard library. How great is the increase in speed? Some of our applications throw a lot of data around so this could be beneficial.

Sep 15, 2021 12:34pm Rick Murray (539) 13840 posts	How great is the increase in speed? Practically nothing. The problem is that the C compiler directly inserts FPA instructions into the executable. That which is present in CLib is the maths library for complex functions. Things that use square roots and powers and such would gain a small benefit, while ordinary maths on real numbers in programs (say, stuff like adding up currency) would see no benefit whatsoever as that’ll still be a set of emulated FPA instructions. Furthermore, there’s the additional complication that the word order is back to front (whether from the perspective of VFP or FPA, they’re not the same), so simply replacing FPA code with VFP won’t work. Not to mention even more complications in the form of contexts. None of this is insurmountable, however I think that it might be better to have a separate VFP enabled compiler, and a set of clone functions in CLib that are VFP enabled, and a Stubs that will link into the appropriate set of functions (remember, there’s all sorts of other things like atof, printf, etc!).

Sep 15, 2021 1:08pm Stuart Swales (8827) 1357 posts	@Rick: Paolo and, I think Matthew, are interested in speeding up memmove here using NEON, not f.p.

Sep 15, 2021 4:07pm Rick Murray (539) 13840 posts	interested in speeding up memmove here using NEON, not f.p. <facepalm!> I wonder if, as DavidS suggests, there would be benefit to using NEON for shifting memory around if that’s the only thing that is happening (as is usually the case for a mem move), in which case both ARM and NEON are liable to max out the memory bus. As my mom would say: suck it and see

Sep 15, 2021 5:07pm Clive Semmens (2335) 3276 posts	ay, stuff like adding up currency) would see no benefit whatsoever as that’ll still be a set of emulated FPA instructions Aaaarghhh! Anyone who uses Floating Point for currency calculations wants their head seeing to…

Sep 15, 2021 5:18pm Stuart Swales (8827) 1357 posts	FPA does have packed decimal for just those reasons, Clive.

Sep 15, 2021 5:42pm Rick Murray (539) 13840 posts	Anyone who uses Floating Point for currency calculations wants their head seeing to… Indeed, but I have seen it done. 10 A = 0 20 FOR l% = 1 TO 100 30 A = A + 0.1 40 PRINT A 50 NEXT This will start off well: 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Bit it’ll get odd when it comes to three and a half: 3.4 3.5 3.59999999 3.69999999 3.79999999 3.89999999 3.99999999 And, note, that we have two values beginning 3.5. This could be resolved by rounding to nearest, but now you’re effectively changing every value to the next one up. A little further on is this: 8.99999999 9.09999999 9.2 9.3 Plus, what happens if we add 7.49999998 to 4.39999999? Should that be 11.88 or 11.90? Interestingly, this test appears to work with `@%="F.2"`. Maybe Sophie figured somebody was bound to try it in BASIC? I think the simplest way of handling currency is to multiply by a hundred and just use integers to count cents/pennies. Or multiply by a thousand if accuracy is required to be within a tenth of a penny. This is, of course, assuming a decimal currency and not something batpoop crazy like British pre-decimalisation. :-)

Sep 15, 2021 6:02pm Clive Semmens (2335) 3276 posts	something batpoop crazy like British pre-decimalisation. :-) Batpoop crazy indeed, but fairly typical of pre-decimalization units of all kinds and currencies everywhere. Pre-decimalizaton Indian currency was actually binary – down to a 64th of a rupee. None of these things are a problem though. You make the smallest coin, or 1/10th of it, or 1/12th of it or whatever, be the unit, and work in integer multiples of it. Maybe you need 64-bit integers if you’re doing orbital calculations to millimetre accuracy…

Sep 15, 2021 7:12pm Jeffrey Lee (213) 6048 posts	FPA does have packed decimal for just those reasons, Clive. Packed decimal is just a memory storage format. As soon as a value is loaded into a FP register it gets converted to ordinary binary floating point.

Sep 15, 2021 9:16pm Stuart Swales (8827) 1357 posts	As soon as a value is loaded into a FP register it gets converted to ordinary binary floating point. Fairy nuff, but ISTR FPE is using the equivalent of 80 bits for exp/1mantissa internally? So STFP will then re-round so that were A in Rick’s loop to be defined as decimal type (in some imaginary variant of BASIC), if wouldn’t accumulate those round-off errors. Not that it does anyway with BASIC64 FPA.

Sep 16, 2021 5:23am Clive Semmens (2335) 3276 posts	wouldn’t accumulate those round-off errors The round-off errors aren’t the biggest issue in my thinking: it’s speed. Integer arithmetic is far faster – and consumes far less energy – than floating point. Only an issue of any significance if you’re doing an awful lot of it, of course.

Sep 16, 2021 12:47pm Stuart Swales (8827) 1357 posts	work in integer multiples of it For decimal arithmetic (and other odd currency values) where you are forced to use f.p. (e.g. JavaScript Numbers), you just need to pre-scale by the non-power-of-two factors. So to get 0.01 accuracy, just prescale values by 25, rather than 100.

Sep 16, 2021 6:05pm Chris Evans (457) 1614 posts	Stuart can you explain that a bit more please, I don’t understand.

Sep 17, 2021 9:44am Stuart Swales (8827) 1357 posts	It’s down to fractions that aren’t a power-of-two not having a finite binary representation (like 1/5, being 0.00110011…) so floating point values in IEEE 754 aren’t precise for those numbers (and multiples thereof). So to represent hundredths precisely, you could multiply all your values by 100 (e.g. keeping track of an integral number of pennies rather than £.pp). Dividing by two doesn’t lose any precision: the floating point mantissa remains the same, just the exponent is reduced by one, so you can take the powers of two out of the scale factor, leaving 25. Little practical difference for examples you could do by hand, but if you need to interwork with different fractions, to implement a system to keep track of prime factors to help maintain precision, you don’t need to keep track of the number of two factors. Arithmetic rules are simple, just like you learned at school for fractions: when adding/subtracting, multiply both numerators and denominators until you get a common denominator; when multiplying, multiply the denominators together; when dividing, cancel out common factors from the denominators. If you need a common floating point value, just divide by the denominator. I did hack PipeDream in about 2000 to do just this, but foolishly replaced the numeric arithmetic section with this new behaviour, so as you’d expect/hope, it gave different results, and so would cause spreadsheets with existing cross-checks to fail, so it was never released. If I’d have had half a brain, I’d have made it a spreadsheet evaluator option to supplement the existing code, not replace it.

Sep 17, 2021 12:19pm Clive Semmens (2335) 3276 posts	On the other hand, if instead of multiplying only by all the non-power-of-two factors you do multiply by all the powers of two as well, then you don’t need floating point at all. Then you can use integers, which are altogether more efficient – faster, and less power-hungry. Less memory-hungry as well. (Of course if you’re dealing with national budgets, or corporate accounts and working to the nearest penny, then +/-2^31 may not be big enough – +/-2^63 is of course plenty big enough for most purposes. BBC BASIC doesn’t provide that option, although it would be more efficient for most purposes than floating point. But what you do elsewhere is up to the programmer.)

Sep 17, 2021 12:58pm Stuart Swales (8827) 1357 posts	Doing it with 64-bit integer numerators (and tracking the twos as well) would be way to go these days – back then ye olde Norcroft cc4.05 didn’t support 64-bit integer types! Weren’t there plans to support i%% in BASIC? [@Gavin: your link to Zahl below is broken (put here to save another post)]

Sep 17, 2021 1:17pm Chris Evans (457) 1614 posts	Thanks Stuart, I’m suitable enlightened.

Sep 17, 2021 1:21pm GavinWraith (26) 1563 posts	While we are talking numbers may I put in a puff for Zahl ? I made this as a substitute for Nick Craig-Wood’s BigNumber module. It provides two integer types: small `int32` for 32-bit integers compatible with RISC OS internals, and big `integer` for integers as large as your RAM will tolerate. The variables $0, … ,$9 are initialized to big 0, … ,9. For example `print ($2^($2^10)) --> 179769313486231590772930519078902473361797697894230657273430081157732675805500 963132708477322407536021120113879871393357658789768814416622492847430639474124 377767893424865485276302219601246094119453082952085005768838150682342462881473 913110540827237163350510684586298239947245938479716304835356329624224137216` There is an example to show how to implement RSA cryptography with it.

Sep 17, 2021 2:01pm Clive Semmens (2335) 3276 posts	The ARM instruction set (up to v7; I don’t know v8 at all) is particularly well suited to very large integer handling. I wrote a whole set of very large integer arithmetic routines back in the day – before the shift from 26-bit addressing, so they’d need updating to be any use now. Just addition, subtraction, multiplication, division, x^y, int(x^(a/b)) – that by Newton-Raphson – conversion between any two bases (not just binary/hex and decimal…) and a couple of fast methods of finding factors. I don’t know how Zahl represents integers – my trick was simply to use the first (32-bit) word to specify how many following words were used to represent the integer. That limits you to integers less than (2^32)^(2^32 – 1)… it would be easy enough to use one bit of the “size” word to specify whether to use a second size word (and perhaps a third… etc), but let’s not get too silly…

Sep 17, 2021 3:15pm Paolo Fabio Zaino (28) 1882 posts	I totally forgot about this thread lol, what the heck has happened here? XD @ Matthew Performance obviously vary based on the size of memcpy/memmove/memset your code does in one time. For instance, if it copies a single byte or like 3 bytes per time, you have no performance improvements (and btw in these cases I fall back to standard RISC OS implementation as it’s lighter). But if your code needs to copy/set/move a lot of bytes at once then performance improvement can be substantial (depending on amount of bytes in a single call, the board, memory bandwidth, the ARM chip etc.). As soon as I have a good benchmark running on RISC OS I’ll share the numbers I get on all my boards. @ Rick yes the original topic is about improving memory (copy, set and move) performance in C on RISC OS.

Sep 17, 2021 3:16pm Paolo Fabio Zaino (28) 1882 posts	@ Stuart Gavin link to Zhal here: http://www.wra1th.plus.com/soft.html Scroll down. And great work Gavin! :)

Sep 17, 2021 3:29pm Paolo Fabio Zaino (28) 1882 posts	For the new topic about math operations using NEON: NEON can benefits such type of operations only when a piece of software does a lot of similar computations that could be parallelised. In other words if you just need to do a single sqrt for a single value, then there will be little to no benefit in using a NEON based implementation of it. A reason why I am working on another project using NEON for math functions, is to improve Genann performance where each neuron performs an FP calculation and (on backfeed neural networks) this can happen multiple times too (and for the same neuron) and given that a single neural network layer may have many neurons, and a neural network is composed by multiple layers of neurons you can imagine. There are many applications for NEON in AI: https://www.youtube.com/watch?v=lUmjnCdGtGE The video above is a presentation of a Tech Talk that should happen on the 21st of September 2021, and it’s specifically for ARM. Some of the techniques that will be explained in that Tech Talk I am implementing on RISC OS and hopefully I’ll have enough time to get to demo-able state at some point. All of this has very little to help BBC BASIC type of uses and/or retro-coding type of approaches on RISC OS. So, maybe not of interest here, in which case sorry for the noise.

Sep 17, 2021 4:59pm GavinWraith (26) 1563 posts	The big integer type in !Zahl is realized by the IMath library (copyright @ 2002-2009 Michael J. Fromberger), an ISO C arbitrary precision integer and rational library that will tell you want you want to know about its representation of numbers. Nick Craig-Wood originally wrote his BigNum (relocatable) module in ARM assembler, offering SWI calls so that any kind of RISC OS application could use it. When he left RISC OS-land I offered to maintain it for him. I came across some bugs. For example the Newton-Raphson algorithm for square root does not work in integer arithmetic when you start with a number one less than a power of two. After a while I found the ARM assembler code too difficult to maintain, and, finding C easier to work with, I produced !Zahl as a partial replacement. If I may mount my hobby horse at this point, there are indefinitely many ways of representing numbers in a computer, but it seems that most people are fixated on the standard floating point representations and are unaware of alternatives. Floating point numbers suit engineering very well (I am not being rude here), but they do have their own drawbacks. Their principal virtue is to break the bond between the value of a number and the number of bytes required to store its representation. Their principal vice is that the ordinary laws of arithmetic, e.g that `x + (y + z)` be equal to `(x + y) + z`, do not hold. There are systems for which they do, but not ones for which numbers take up a fixed number of bytes to store.