RISC OS Open: Forum: C to BASIC struggle

Oct 24, 2014 11:42am

David Feugey (2125) 2709 posts

Any speed up with ABC ?

Oct 24, 2014 12:24pm

Steve Drain (222) 1620 posts

It looks staightforward enough, so it might.

In BASIC V with Basalt it is really easy: SORT array%()

But that is just a disguise for the ArmSort *command. ;-)

Oct 24, 2014 12:30pm

Martin Avison (27) 1494 posts

@Kuemmel: I ran your latest version … output was:

Sorting of 1000000 32-bit integers
-—————————————————

Wait while creating random integers (same for all)…
Sorting with OS_HeapSort…

Wait while copying same random integers to sorting array…
Sorting with ASMv1…
Check if radix sorted data is euqal to OS_HeapSort…all values valid…

Wait while copying same random integers to sorting array…
Sorting with ASMv2…
Check if radix sorted data is euqal to OS_HeapSort…not valid at 1 335 579

Oct 24, 2014 8:53pm

Kuemmel (439) 384 posts

…totally makes no sense for me, my Panda doesn’t report any invalid results…is that only on Iyonix !?

Is there some problem with using the double load LDRD !? I remember reading somewhere that an LDRD needs to be aligned to 8 byte boundary memory addresses, but why would the Cortex A9 cope with non alignment (I didn’t check or ensured any alignment) and an Iyonix cpu not…superstrange…or the older cpu couldn’t cope and new later generation can…

Oct 24, 2014 9:56pm

Kuemmel (439) 384 posts

@Martin: I made a new test version using only LDR/LDM and no LDRD. You can find it here …hopefully that explains the problem…

@Steve: Great stuff regarding your BASIC optimisation. I think I forgot already all about those things. I used your version and crunched it down, now runs in 4,55 s instead of my initial 12,44 s :-) will put it in the final version, but I’ll leave some “readable” version somewhere also.

Oct 24, 2014 11:08pm

Rick Murray (539) 13850 posts

…totally makes no sense for me, my Panda doesn’t report any invalid results…is that only on Iyonix !?

The ARM ARM (2005, ARMv6 release) says: Prior to ARMv6, doubleword (LDRD/STRD) accesses to memory, where the address is not doubleword-aligned, are UNPREDICTABLE. The XScale is ARMv5. Some older cores don’t properly support LDRD, so, the XScale – is it an E variant? (that would be good) Is it an ARMv5TExP? (that would be bad)

Oct 25, 2014 6:48am

Steve Drain (222) 1620 posts

@Kuemmel
Like a dog with a bone, I could not let this go. I now have a revised BASIC implementation of the algorithm that is about 20% faster than before. I will post it tomorrow after the show.
I note from some web articles that the low level implementation of radix sort is critical when trying to be faster than other sorts.
I am also concerned about the memory demands if it is to be used as general purpose.

Oct 27, 2014 10:41pm

Steve Drain (222) 1620 posts

I will post it tomorrow after the show.

This uses a pointer to an array block rather than a BASIC array.

DEFPROCradix_sort(array%,size%)
 LOCAL i%,swap%,byte%,index%,count%(),bucket%()
 DIM count%(255),bucket%(255)
 swap%=END+256:REM like a local block
 FOR byte%=0 TO 3
  count%()=0
  FOR i%=array%+byte% TO array%+size% STEP 4
   count%(?i%)+=1
  NEXT i%
  index%=0
  FOR i%=0 TO 255
   bucket%(i%)=index%
   index%+=count%(i%)*4
  NEXT i%
  FOR i%=array%+byte% TO array%+size% STEP 4
   swap%!bucket%(?i%)=i%!-byte%
   bucket%(?i%)+=4
  NEXT i%
  SWAP array%,swap%
 NEXT byte%
ENDPROC

Oct 28, 2014 10:51am

Steve Revill (20) 1361 posts

Any reason for doing:

  NEXT i%

rather than just:

  NEXT

That’ll be needlessly slowing your routine down. Also, a quick look at that code and the

  index%+=count%(i%)*4

line could just be replaced with:

  index%+=count%(i%)

to eliminate the multiply, if you also change another line to:

  count%(?i%)+=4

(All untested, I just looked at the code for a minute or so.)

Oct 28, 2014 12:49pm

Steve Drain (222) 1620 posts

Using NEXT i% is just my preferred style. I would expect it to be reduced by crunching. The gain from omitting i% is small compared to concatenating lines – a single line for this – and reducing variable names.

Thanks for spotting the +=4 improvement; it is more satisfying. On the other hand, that loop is only 256, whereas the other two are aimed at 1000000, so the gain is likely to be undetectable. ;-)

Oct 30, 2014 12:55am

jan de boer (472) 78 posts

Made an assembler version from PROCradix_sort by Steve Drain but I cannot get it faster than 2.72 secs (Iyonix); the original ARMv3 version is faster! Only 60M instructions are needed for 1M integers, so one would expect 0.1 sec.
I omitted one instruction: the one that stores the value back from array into swap: 0.37 sec. Then I emulated it (1:30), only 8.57 sec.
So, something blocks assemblercode to go fast. I suspect the write buffer of the processor. In radixsort, the distance between successive stores is big, so that the write buffer is choked. Quicksort was faster (0.82) with more instructions but deeper in the recursion the distance between writelocations diminishes, so it suffers less from write buffer overload.

Oct 30, 2014 10:38pm

Kuemmel (439) 384 posts

@Jan: Interesting ! So on the Iyonix the quicksort is faster ? Did you test your quicksort-code on a Panda or Beagleboard ? Could you post or send a link to that quicksort code ?

I suspect that like stated in the original C-test code for the radix (see the link in my first post) that quicksort should be slower on high speed memory machines, as it uses much more compare operations than memory operations. Whereas the radix only uses memory transfer.

EDIT: Found a quicksort ARM also on Link . That code uses “stmfd sp!, {r4, r6, lr}” and “ldmlefd sp!, {r4, r6, pc}” …ist that still valid regarding 26/32bit issues ?

Oct 30, 2014 11:19pm

Martin Avison (27) 1494 posts

@Kuemmel: I have now run your latest !radix_sort_no_ldrd program successfully on my Iyonix, after modifying it to include ArmSort. The times are:

OS_HeapSort..: Time taken [s] 7.41
radix ASMv1..: Time taken [s]  2.64
radix ASMv2..: Time taken [s]  2.60
radix ASMv3..: Time taken [s]  2.61
radix BASIC..: Time taken [s] 18.98
ArmSort...   : Time taken [s]  2.81

So, assembler radix is much faster than HeapSort, and slightly faster than ArmSort. However, both HeapSort and ArmSort are generalised sort routines, whereas the radix sorts are integer only. ArmSort sorts Basic arrays, and will sort several together forming a complex key. I would consider adding a radix sort to ArmSort, but I do not think the size and complexities of the extra storage required are worth it for the possible performance improvement.

Oct 31, 2014 12:33am

Jeffrey Lee (213) 6048 posts

EDIT: Found a quicksort ARM also on Link . That code uses “stmfd sp!, {r4, r6, lr}” and “ldmlefd sp!, {r4, r6, pc}” …ist that still valid regarding 26/32bit issues ?

Yes, that’s perfectly fine 26/32bit neutral code.

Oct 31, 2014 12:46pm

jan de boer (472) 78 posts

FWIW, I have uploaded my ARM implementations to http://home-1.tiscali.nl/~jandboer, it is the last entry in the list of goodies.

Oct 31, 2014 5:25pm

Kuemmel (439) 384 posts

Thanks Jan, I’ve got to look into your code, seems you found a much better implementation of that radix sort, at least for the Pandaboard (1200 MHz) it’s much better. Timings are


radix_ 8 sorting (2^20 elements,  8, 000) = 0.13 s
radix 16 sorting (2^20 elements, 16, 000) = 0.06 s
quicksort recursive 1000000 elements      = 0.29 s
quicksort non-recursive 1000000 elements  = 0.28 s

I’ll try to incorporate you radix and quicksort in my code for direct comparison when I got time. Did you use RND to create the randoms also ?

At least it shows what I suspected, for modern CPU’s the radix code seems faster than the quicksort, I guess due to memory bandwith/prefetching.

Oct 31, 2014 7:05pm

jan de boer (472) 78 posts

When I fill the array with RND*(2^31) everything slows down (again), 4.02 sec for 16-bit on Iyonix. When I use a 32-bit crc routine (eor=&AF,seed=TIME, output looks random) it goes a lot faster. Puzzled. Theoretically 60M instructions on Iyonix should only take 0.1 sec. The IGEPv5 seems to perform closer to this, maybe timings on the other machines are unreliable because of ?

Nov 1, 2014 11:52am

Martin Avison (27) 1494 posts

Can I sugest that if results of timing sort algorithms are posted, then we all try to:
1. Use a common RND seed. I suggested -1234567 in an earlier post. This ensures that the data being sorted is always the same, as different data can have quite an effect! Note that this implies that tests also have to be done using other seeds!
2. Use a common data generation. I suggest plain RND for signed integers, and ABS if you want to restrict it to positive integers.
3. Include these in the post with any timings, the number of items sorted, and the machine processor.

I know from experience that it is too easy to jump to conclusions about sorts based on incorrect information!

Nov 1, 2014 12:21pm

Martin Avison (27) 1494 posts

@Jan: Just downloaded your radixx.zip, and found the assembler sorts ran in about 0.67 sec … but on closer investigation I suspect the array being sorted is all zeros! Hence why it goes ‘a lot faster’ than when RND is used.

Nov 1, 2014 11:14pm

Kuemmel (439) 384 posts

@Martin: I put your remarks in my code, so the suggested seed and then I create 1000000 times RND*2^31 random integers. I also do a &ff data alignment. I also test your ArmSort along with the other code (though I don’t know if one can align arrays, but I think alignment doesn’t matter too much here). Of course it’s not quite fair to compare the specialized sorts to your code.

@Jan: I also included your quicksort code. For your radix code, could you try the random number generation as proposed ? Are there some errors regarding what Martin said ?

The latest code can be found here. Still needs some clean up but should run fine. I got rid of the ARMv3 version as it didn’t show any speedup, therefore I tuned the ARMv2 version to 4 times loop unroling, what helps a bit (and of course no more LDRD’s). The latest result compilation is here:

                                 Panda (1.2 GHz)  IGEPv5 1.0 GHz)
OS_HeapSort....................:      1.49               0.71 
Radix ASMv1....................:      0.48               0.11
Radix ASMv2....................:      0.45               0.08
Radix BASIC (Steve)............:      3.56               2.93
ArmSort V4.08 (Martin).........:      0.90               0.49
Quicksort non recursive (Jan)..:      0.26               0.34
Quicksort recursive (Jan)......:      0.27               0.34

The good old quicksort is still quite fast, but just like said before on the IgepV5 the radix is a clear winner. Let’s see if Jan’s code can squeeze out more :-)

Nov 2, 2014 12:10am

Martin Avison (27) 1494 posts

@Kuemmel: Just run your latest code on my Iyonix …


OS_HeapSort....................: Time taken [s] 7.47
Radix ASMv1....................: Time taken [s] 2.61
Radix ASMv2....................: Time taken [s] 2.59
Radix BASIC (Steve)............: Time taken [s] 14.16
ArmSort V4.08 (Martin).........: Time taken [s] 2.79
Quicksort non recursive (Jan)..: Time taken [s] 0.82
Quicksort recursive (Jan)......: Time taken [s] 0.84

Quicksort impressive … but I have not yet checked code.

Why are you using RND(1)*2^31 rather than ABS(RND) to generate data? They do not give same results, and I cannot see an advantage of your version … yet!

Nov 2, 2014 12:38am

Kuemmel (439) 384 posts

…actually no specific reason at all, I just didn’t remember that kind of RND syntax, I just thought of the RND function…may be I was working with floats for too long ;-) I can use ABS(RND). The results are almost the same on the Panda.

Nov 2, 2014 1:57pm

jan de boer (472) 78 posts

Martin Avison: you are right. It proves not to be so smart to use TIME as a timer and as a seed. I bow my head in shame.
Yet, undeterred, I moved forth to see if the size of the dataset makes any difference for the speed, as imho, and for myself in any case, it makes little sense to optimise an algorithm, when speed is massively held back by other factors.
Unless optimisation addresses those factors, which seems to be the case for the improvement between ASMv1 and ASMv2 of the programs by Michael Kuebel.
Timings are for Iyonix, BBxM and RasPi, all 8 bit, followed by Iyonix,BBxM and RasPi in 16-bit, normalised for size=1,000,000 integers, in centiseconds, in a taskwindow and filling with a random routine because otherwise (BASIC interfering) measuring became totally unreliable.

Asm (32768 iterations of size=256): 16, 7, 13, 6209,563,2244
Asm (16384 iterations of size=512): 14, 6, 11, 3540,334,1200
Asm (8192 iterations of size=1024): 12, 5, 10, 2047,222,605
Asm (4096 iterations of size=2048): 12, 5, 12, 1223,164,342
Asm (2048 iterations of size=4096): 35, 5, 15, 814, 146,213 <—-iyonix goes up (32K cache)
Asm (1024 iterations of size=8192): 75, 5, 18, 626, 149,152
Asm (512 iterations of size=16384): 82, 5, 25, 552, 161,125
Asm (256 iterations of size=32768): 105,24, 33, 536, 177,116 <—-BBXM goes up (L2 cache)
Asm (128 iterations of size=65536): 213,104, 40, 537, 194,121
Asm (64 iterations of size=131072): 266,140, 70, 541, 217,122
Asm (32 iterations of size=262144): 292,158, 86, 543, 203,125
Asm (16 iterations of size=524288): 294,161, 89, 545, 207,126
Asm (8 iterations of size=1048576): 294,162, 93, 546, 210,128
Asm (4 iterations of size=2097152): 294,165, 97, 546, 216,129
Asm (2 iterations of size=4194304): 293,167, 98, 546, 230,132
Asm (1 iteration of size=8388608): 293,178, 99, 546, 251,139
Basic: 1672,671,1413, 1523,737,1085

Both for Iyonix and BBxM the timings go up as soon as the requirements for the dataset (2 X 4 bytes X dataset) exceed resp. primary cache (32 kb) or secundary cache (BBxM, 256K). For 16-bit radixsort, the requirements for count/bucket array always exceed what is offered on these 3 machines so these timings turn out to be extremely slow.

To compare with quicksort:
quicksort requires double the amount of instructions for a 1 Mb dataset, ie. some 130 M instructions as compared with radixsort, 60M, (but around the same number as radixsort does for size=1024); however from size=~32 on, it does more stores into the data-array.
Again measured on Iyonix, BBxM,RasPi with quicksort, then with nonrecursive quicksort. The latter avoids R14 being stored on stack and uses branches instead of branchlinks, otherwise they are the same.
Values are normalised as centiseconds per 1,000,000
32768 X size= 32 12,3, 11, 12,4, 10
16384X size= 64 15,5, 13, 12,5, 11
8192X size= 128 18,8, 16, 15,8, 13
4096X size= 256 19,10,16, 16,8, 15
2048X size= 512 20,12,18, 18,11,17
1024X size= 1024 20,13,19, 18,11,16
512X size= 2048 22,13,20, 19,13,17
256X size= 4096 22,13,21, 20,13,19
128X size= 8192 29,14,25, 26,13,22 <—Iyonix goes up, but more gradually
64X size= 16384 40,15,25 36,14,23
32X size= 32768 48,14,32, 47,15,30
16X size= 65536 57,17,35, 55,16,33 <—BBxM goes up, idem
8X size= 131072 67,20,43, 66,20,41
4X size= 262144 70,20,42, 68,20,41
2X size= 524288 79,23,47, 76,23,45
1X size=1048576 85,25,51, 83,25,49
Quicksort also is affected by cachesizes but more gradually than radixsort, perhaps because in deeper levels of recursion the addresses are closer together.
It would be interesting to see what IGEPv5 does with these algorithms, as it has a bigger secundary cache (1Mb).
So this is what I have learnt:
Test better before I submit anything.
That radixsort exists. A beautiful algorithm, there are not too many of them, thanks Michael Kuemmel!
Buy an IGEPv5.
Btw I uploaded new versions of quicksort and radixsort to http://home-1.tiscali.nl/~jandboer/
Radixsort now also sorts signed integers. The trick is (can be, there are more methods) to EOR the data in array% with 1<<31 before storing then to swap%, (at least) in passes 3 and 4.

Nov 2, 2014 6:21pm

Martin Avison (27) 1494 posts

@Jan: I have had a quick look at your new code … and I strongly suspect that the “random integers” you generate in the quicksort filling routine may be random, but all seem to be less than 256!

Again, I suggest that you use Basic ABS to generate unsigned numbers (or just RND for signed). They can be generated once, then copied to save time. Much more repeatable by others.

It would also help to understand your code if it was not rather compressed! One assembler instruction per line is MUCH easier to read, and will not make any appreciable time difference as the assembly is only done once! (ok twice).

You hint that the tests were run under TaskWindow: this is another cause of time variations, and single-tasking would give more reliable results.

Nov 2, 2014 10:07pm

jan de boer (472) 78 posts

@Martin: As the thread says it, it’s a struggle. The MOVS seed,seed,LSR#1 should have read MOVS seed,seed,LSL#1. Timings for quicksort don’t change (much) afaics.
Timings: I wasn’t so interested in timings but more in the possible reasons why these algorithms are relatively slow with large datasets, ie. 25 times slower than they should be. If these sorting algorithms should be made faster, code-optimising does not help much, the memory access should be improved. What’s proved now, for radixsort using an 11-bit or a 16-bit bucket does not help. About issuing preload instructions I’m not optimistic. A msb-first version of radixsort could well be tried but will not work wonders.
I never see assembler listings as something to understand; pseudocode is more comprehensible and should be added if something is to be clarified. For me, writing assembler is mechanically translating from (basic) source and optimising it; optimising often reorders registers and instructions in a way that make code unreadable and thus, useless to comment. If you did the same steps from pseudocode to assm and then optimised it, you would get practically the same code. I suspect readability, speed and codesize exclude each other mutually, but I’m not too sure.
The reason I used an assembler routine for random values was the specific reason why I did the timings. For each datasize, I needed enough iterations to get a time >0. Breaking out of assembler to refill with a BASIC routine would make the assembler timings too samll, so, useless. I don’t know the BASIC implementation of RND so I cannot rewrite it in assembler, and there is a very simple and short algorithm already: shift seed 1 position to the left and if b31 falls off, EOR with a suitable value (&AF is the smallest available for 32-bit). Repeat a few times and store seed.
Making an array with RND*(2^31) and copying: requires an extra block of memory, but could be done, yes. But as said, I was not so interested in exact speeds.
The idea behind radixsort was clear from reading the BASIC listings. My assembler version was optimised a little so it is not well readable anymore.
Quicksort and crc generation/making random numbers are often-used and simple algorithms, I did not feel the need to explain how they work.
Because I find it useless to comment assembler I use to write many instructions on a line, to see them all on one page. Writing assembler is faster than maintaining comments. As soon as you swap some instructions or rename registers the accompanying comment becomes incomprehensible anyhow. It’s not lazyness, I just find comments on this level useless and prefer a piece of pseudocode accompanying it to explain the idea, where appropriate.
Taskwindow: I needed the timings in computer-readable form to write the previous posting. I left the computer undisturbed while running the programs. So the timings will be relatively correct, given the 25X slowdown that we see. I agree that for other programs the timings could benefit from running under more replicatable circumstances.

C to BASIC struggle

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Oct 24, 2014 11:42am David Feugey (2125) 2709 posts	Any speed up with ABC ?

Oct 24, 2014 12:24pm Steve Drain (222) 1620 posts	It looks staightforward enough, so it might. In BASIC V with Basalt it is really easy: SORT array%() But that is just a disguise for the ArmSort *command. ;-)

Oct 24, 2014 12:30pm Martin Avison (27) 1494 posts	@Kuemmel: I ran your latest version … output was: Sorting of 1000000 32-bit integers -————————————————— Wait while creating random integers (same for all)… Sorting with OS_HeapSort… Wait while copying same random integers to sorting array… Sorting with ASMv1… Check if radix sorted data is euqal to OS_HeapSort…all values valid… Wait while copying same random integers to sorting array… Sorting with ASMv2… Check if radix sorted data is euqal to OS_HeapSort…not valid at 1 335 579

Oct 24, 2014 8:53pm Kuemmel (439) 384 posts	…totally makes no sense for me, my Panda doesn’t report any invalid results…is that only on Iyonix !? Is there some problem with using the double load LDRD !? I remember reading somewhere that an LDRD needs to be aligned to 8 byte boundary memory addresses, but why would the Cortex A9 cope with non alignment (I didn’t check or ensured any alignment) and an Iyonix cpu not…superstrange…or the older cpu couldn’t cope and new later generation can…

Oct 24, 2014 9:56pm Kuemmel (439) 384 posts	@Martin: I made a new test version using only LDR/LDM and no LDRD. You can find it here …hopefully that explains the problem… @Steve: Great stuff regarding your BASIC optimisation. I think I forgot already all about those things. I used your version and crunched it down, now runs in 4,55 s instead of my initial 12,44 s :-) will put it in the final version, but I’ll leave some “readable” version somewhere also.

Oct 24, 2014 11:08pm Rick Murray (539) 13850 posts	…totally makes no sense for me, my Panda doesn’t report any invalid results…is that only on Iyonix !? The ARM ARM (2005, ARMv6 release) says: Prior to ARMv6, doubleword (LDRD/STRD) accesses to memory, where the address is not doubleword-aligned, are UNPREDICTABLE. The XScale is ARMv5. Some older cores don’t properly support LDRD, so, the XScale – is it an E variant? (that would be good) Is it an ARMv5TExP? (that would be bad)

Oct 25, 2014 6:48am Steve Drain (222) 1620 posts	@Kuemmel Like a dog with a bone, I could not let this go. I now have a revised BASIC implementation of the algorithm that is about 20% faster than before. I will post it tomorrow after the show. I note from some web articles that the low level implementation of radix sort is critical when trying to be faster than other sorts. I am also concerned about the memory demands if it is to be used as general purpose.

Oct 27, 2014 10:41pm Steve Drain (222) 1620 posts	I will post it tomorrow after the show. This uses a pointer to an array block rather than a BASIC array. `DEFPROCradix_sort(array%,size%) LOCAL i%,swap%,byte%,index%,count%(),bucket%() DIM count%(255),bucket%(255) swap%=END+256:REM like a local block FOR byte%=0 TO 3 count%()=0 FOR i%=array%+byte% TO array%+size% STEP 4 count%(?i%)+=1 NEXT i% index%=0 FOR i%=0 TO 255 bucket%(i%)=index% index%+=count%(i%)*4 NEXT i% FOR i%=array%+byte% TO array%+size% STEP 4 swap%!bucket%(?i%)=i%!-byte% bucket%(?i%)+=4 NEXT i% SWAP array%,swap% NEXT byte% ENDPROC`

Oct 28, 2014 10:51am Steve Revill (20) 1361 posts	Any reason for doing: `NEXT i%` rather than just: `NEXT` That’ll be needlessly slowing your routine down. Also, a quick look at that code and the `index%+=count%(i%)*4` line could just be replaced with: `index%+=count%(i%)` to eliminate the multiply, if you also change another line to: `count%(?i%)+=4` (All untested, I just looked at the code for a minute or so.)

Oct 28, 2014 12:49pm Steve Drain (222) 1620 posts	Using `NEXT i%` is just my preferred style. I would expect it to be reduced by crunching. The gain from omitting `i%` is small compared to concatenating lines – a single line for this – and reducing variable names. Thanks for spotting the +=4 improvement; it is more satisfying. On the other hand, that loop is only 256, whereas the other two are aimed at 1000000, so the gain is likely to be undetectable. ;-)

Oct 30, 2014 12:55am jan de boer (472) 78 posts	Made an assembler version from PROCradix_sort by Steve Drain but I cannot get it faster than 2.72 secs (Iyonix); the original ARMv3 version is faster! Only 60M instructions are needed for 1M integers, so one would expect 0.1 sec. I omitted one instruction: the one that stores the value back from array into swap: 0.37 sec. Then I emulated it (1:30), only 8.57 sec. So, something blocks assemblercode to go fast. I suspect the write buffer of the processor. In radixsort, the distance between successive stores is big, so that the write buffer is choked. Quicksort was faster (0.82) with more instructions but deeper in the recursion the distance between writelocations diminishes, so it suffers less from write buffer overload.

Oct 30, 2014 10:38pm Kuemmel (439) 384 posts	@Jan: Interesting ! So on the Iyonix the quicksort is faster ? Did you test your quicksort-code on a Panda or Beagleboard ? Could you post or send a link to that quicksort code ? I suspect that like stated in the original C-test code for the radix (see the link in my first post) that quicksort should be slower on high speed memory machines, as it uses much more compare operations than memory operations. Whereas the radix only uses memory transfer. EDIT: Found a quicksort ARM also on Link . That code uses “stmfd sp!, {r4, r6, lr}” and “ldmlefd sp!, {r4, r6, pc}” …ist that still valid regarding 26/32bit issues ?

Oct 30, 2014 11:19pm Martin Avison (27) 1494 posts	@Kuemmel: I have now run your latest !radix_sort_no_ldrd program successfully on my Iyonix, after modifying it to include ArmSort. The times are: `OS_HeapSort..: Time taken [s] 7.41 radix ASMv1..: Time taken [s] 2.64 radix ASMv2..: Time taken [s] 2.60 radix ASMv3..: Time taken [s] 2.61 radix BASIC..: Time taken [s] 18.98 ArmSort... : Time taken [s] 2.81` So, assembler radix is much faster than HeapSort, and slightly faster than ArmSort. However, both HeapSort and ArmSort are generalised sort routines, whereas the radix sorts are integer only. ArmSort sorts Basic arrays, and will sort several together forming a complex key. I would consider adding a radix sort to ArmSort, but I do not think the size and complexities of the extra storage required are worth it for the possible performance improvement.

Oct 31, 2014 12:33am Jeffrey Lee (213) 6048 posts	EDIT: Found a quicksort ARM also on Link . That code uses “stmfd sp!, {r4, r6, lr}” and “ldmlefd sp!, {r4, r6, pc}” …ist that still valid regarding 26/32bit issues ? Yes, that’s perfectly fine 26/32bit neutral code.

Oct 31, 2014 12:46pm jan de boer (472) 78 posts	FWIW, I have uploaded my ARM implementations to http://home-1.tiscali.nl/~jandboer, it is the last entry in the list of goodies.

Oct 31, 2014 5:25pm Kuemmel (439) 384 posts	Thanks Jan, I’ve got to look into your code, seems you found a much better implementation of that radix sort, at least for the Pandaboard (1200 MHz) it’s much better. Timings are `radix_ 8 sorting (2^20 elements, 8, 000) = 0.13 s radix 16 sorting (2^20 elements, 16, 000) = 0.06 s quicksort recursive 1000000 elements = 0.29 s quicksort non-recursive 1000000 elements = 0.28 s` I’ll try to incorporate you radix and quicksort in my code for direct comparison when I got time. Did you use RND to create the randoms also ? At least it shows what I suspected, for modern CPU’s the radix code seems faster than the quicksort, I guess due to memory bandwith/prefetching.

Oct 31, 2014 7:05pm jan de boer (472) 78 posts	When I fill the array with RND*(2^31) everything slows down (again), 4.02 sec for 16-bit on Iyonix. When I use a 32-bit crc routine (eor=&AF,seed=TIME, output looks random) it goes a lot faster. Puzzled. Theoretically 60M instructions on Iyonix should only take 0.1 sec. The IGEPv5 seems to perform closer to this, maybe timings on the other machines are unreliable because of ?

Nov 1, 2014 11:52am Martin Avison (27) 1494 posts	Can I sugest that if results of timing sort algorithms are posted, then we all try to: 1. Use a common RND seed. I suggested -1234567 in an earlier post. This ensures that the data being sorted is always the same, as different data can have quite an effect! Note that this implies that tests also have to be done using other seeds! 2. Use a common data generation. I suggest plain RND for signed integers, and ABS if you want to restrict it to positive integers. 3. Include these in the post with any timings, the number of items sorted, and the machine processor. I know from experience that it is too easy to jump to conclusions about sorts based on incorrect information!

Nov 1, 2014 12:21pm Martin Avison (27) 1494 posts	@Jan: Just downloaded your radixx.zip, and found the assembler sorts ran in about 0.67 sec … but on closer investigation I suspect the array being sorted is all zeros! Hence why it goes ‘a lot faster’ than when RND is used.

Nov 1, 2014 11:14pm Kuemmel (439) 384 posts	@Martin: I put your remarks in my code, so the suggested seed and then I create 1000000 times RND*2^31 random integers. I also do a &ff data alignment. I also test your ArmSort along with the other code (though I don’t know if one can align arrays, but I think alignment doesn’t matter too much here). Of course it’s not quite fair to compare the specialized sorts to your code. @Jan: I also included your quicksort code. For your radix code, could you try the random number generation as proposed ? Are there some errors regarding what Martin said ? The latest code can be found here. Still needs some clean up but should run fine. I got rid of the ARMv3 version as it didn’t show any speedup, therefore I tuned the ARMv2 version to 4 times loop unroling, what helps a bit (and of course no more LDRD’s). The latest result compilation is here: `Panda (1.2 GHz) IGEPv5 1.0 GHz) OS_HeapSort....................: 1.49 0.71 Radix ASMv1....................: 0.48 0.11 Radix ASMv2....................: 0.45 0.08 Radix BASIC (Steve)............: 3.56 2.93 ArmSort V4.08 (Martin).........: 0.90 0.49 Quicksort non recursive (Jan)..: 0.26 0.34 Quicksort recursive (Jan)......: 0.27 0.34` The good old quicksort is still quite fast, but just like said before on the IgepV5 the radix is a clear winner. Let’s see if Jan’s code can squeeze out more :-)

Nov 2, 2014 12:10am Martin Avison (27) 1494 posts	@Kuemmel: Just run your latest code on my Iyonix … `OS_HeapSort....................: Time taken [s] 7.47 Radix ASMv1....................: Time taken [s] 2.61 Radix ASMv2....................: Time taken [s] 2.59 Radix BASIC (Steve)............: Time taken [s] 14.16 ArmSort V4.08 (Martin).........: Time taken [s] 2.79 Quicksort non recursive (Jan)..: Time taken [s] 0.82 Quicksort recursive (Jan)......: Time taken [s] 0.84` Quicksort impressive … but I have not yet checked code. Why are you using `RND(1)*2^31` rather than `ABS(RND)` to generate data? They do not give same results, and I cannot see an advantage of your version … yet!

Nov 2, 2014 12:38am Kuemmel (439) 384 posts	…actually no specific reason at all, I just didn’t remember that kind of RND syntax, I just thought of the RND function…may be I was working with floats for too long ;-) I can use `ABS(RND)`. The results are almost the same on the Panda.

Nov 2, 2014 1:57pm jan de boer (472) 78 posts	Martin Avison: you are right. It proves not to be so smart to use TIME as a timer and as a seed. I bow my head in shame. Yet, undeterred, I moved forth to see if the size of the dataset makes any difference for the speed, as imho, and for myself in any case, it makes little sense to optimise an algorithm, when speed is massively held back by other factors. Unless optimisation addresses those factors, which seems to be the case for the improvement between ASMv1 and ASMv2 of the programs by Michael Kuebel. Timings are for Iyonix, BBxM and RasPi, all 8 bit, followed by Iyonix,BBxM and RasPi in 16-bit, normalised for size=1,000,000 integers, in centiseconds, in a taskwindow and filling with a random routine because otherwise (BASIC interfering) measuring became totally unreliable. Asm (32768 iterations of size=256): 16, 7, 13, 6209,563,2244 Asm (16384 iterations of size=512): 14, 6, 11, 3540,334,1200 Asm (8192 iterations of size=1024): 12, 5, 10, 2047,222,605 Asm (4096 iterations of size=2048): 12, 5, 12, 1223,164,342 Asm (2048 iterations of size=4096): 35, 5, 15, 814, 146,213 <—-iyonix goes up (32K cache) Asm (1024 iterations of size=8192): 75, 5, 18, 626, 149,152 Asm (512 iterations of size=16384): 82, 5, 25, 552, 161,125 Asm (256 iterations of size=32768): 105,24, 33, 536, 177,116 <—-BBXM goes up (L2 cache) Asm (128 iterations of size=65536): 213,104, 40, 537, 194,121 Asm (64 iterations of size=131072): 266,140, 70, 541, 217,122 Asm (32 iterations of size=262144): 292,158, 86, 543, 203,125 Asm (16 iterations of size=524288): 294,161, 89, 545, 207,126 Asm (8 iterations of size=1048576): 294,162, 93, 546, 210,128 Asm (4 iterations of size=2097152): 294,165, 97, 546, 216,129 Asm (2 iterations of size=4194304): 293,167, 98, 546, 230,132 Asm (1 iteration of size=8388608): 293,178, 99, 546, 251,139 Basic: 1672,671,1413, 1523,737,1085 Both for Iyonix and BBxM the timings go up as soon as the requirements for the dataset (2 X 4 bytes X dataset) exceed resp. primary cache (32 kb) or secundary cache (BBxM, 256K). For 16-bit radixsort, the requirements for count/bucket array always exceed what is offered on these 3 machines so these timings turn out to be extremely slow. To compare with quicksort: quicksort requires double the amount of instructions for a 1 Mb dataset, ie. some 130 M instructions as compared with radixsort, 60M, (but around the same number as radixsort does for size=1024); however from size=~32 on, it does more stores into the data-array. Again measured on Iyonix, BBxM,RasPi with quicksort, then with nonrecursive quicksort. The latter avoids R14 being stored on stack and uses branches instead of branchlinks, otherwise they are the same. Values are normalised as centiseconds per 1,000,000 32768 X size= 32 12,3, 11, 12,4, 10 16384X size= 64 15,5, 13, 12,5, 11 8192X size= 128 18,8, 16, 15,8, 13 4096X size= 256 19,10,16, 16,8, 15 2048X size= 512 20,12,18, 18,11,17 1024X size= 1024 20,13,19, 18,11,16 512X size= 2048 22,13,20, 19,13,17 256X size= 4096 22,13,21, 20,13,19 128X size= 8192 29,14,25, 26,13,22 <—Iyonix goes up, but more gradually 64X size= 16384 40,15,25 36,14,23 32X size= 32768 48,14,32, 47,15,30 16X size= 65536 57,17,35, 55,16,33 <—BBxM goes up, idem 8X size= 131072 67,20,43, 66,20,41 4X size= 262144 70,20,42, 68,20,41 2X size= 524288 79,23,47, 76,23,45 1X size=1048576 85,25,51, 83,25,49 Quicksort also is affected by cachesizes but more gradually than radixsort, perhaps because in deeper levels of recursion the addresses are closer together. It would be interesting to see what IGEPv5 does with these algorithms, as it has a bigger secundary cache (1Mb). So this is what I have learnt: Test better before I submit anything. That radixsort exists. A beautiful algorithm, there are not too many of them, thanks Michael Kuemmel! Buy an IGEPv5. Btw I uploaded new versions of quicksort and radixsort to http://home-1.tiscali.nl/~jandboer/ Radixsort now also sorts signed integers. The trick is (can be, there are more methods) to EOR the data in array% with 1<<31 before storing then to swap%, (at least) in passes 3 and 4.

Nov 2, 2014 6:21pm Martin Avison (27) 1494 posts	@Jan: I have had a quick look at your new code … and I strongly suspect that the “random integers” you generate in the quicksort filling routine may be random, but all seem to be less than 256! Again, I suggest that you use Basic ABS to generate unsigned numbers (or just RND for signed). They can be generated once, then copied to save time. Much more repeatable by others. It would also help to understand your code if it was not rather compressed! One assembler instruction per line is MUCH easier to read, and will not make any appreciable time difference as the assembly is only done once! (ok twice). You hint that the tests were run under TaskWindow: this is another cause of time variations, and single-tasking would give more reliable results.

Nov 2, 2014 10:07pm jan de boer (472) 78 posts	@Martin: As the thread says it, it’s a struggle. The MOVS seed,seed,LSR#1 should have read MOVS seed,seed,LSL#1. Timings for quicksort don’t change (much) afaics. Timings: I wasn’t so interested in timings but more in the possible reasons why these algorithms are relatively slow with large datasets, ie. 25 times slower than they should be. If these sorting algorithms should be made faster, code-optimising does not help much, the memory access should be improved. What’s proved now, for radixsort using an 11-bit or a 16-bit bucket does not help. About issuing preload instructions I’m not optimistic. A msb-first version of radixsort could well be tried but will not work wonders. I never see assembler listings as something to understand; pseudocode is more comprehensible and should be added if something is to be clarified. For me, writing assembler is mechanically translating from (basic) source and optimising it; optimising often reorders registers and instructions in a way that make code unreadable and thus, useless to comment. If you did the same steps from pseudocode to assm and then optimised it, you would get practically the same code. I suspect readability, speed and codesize exclude each other mutually, but I’m not too sure. The reason I used an assembler routine for random values was the specific reason why I did the timings. For each datasize, I needed enough iterations to get a time >0. Breaking out of assembler to refill with a BASIC routine would make the assembler timings too samll, so, useless. I don’t know the BASIC implementation of RND so I cannot rewrite it in assembler, and there is a very simple and short algorithm already: shift seed 1 position to the left and if b31 falls off, EOR with a suitable value (&AF is the smallest available for 32-bit). Repeat a few times and store seed. Making an array with RND*(2^31) and copying: requires an extra block of memory, but could be done, yes. But as said, I was not so interested in exact speeds. The idea behind radixsort was clear from reading the BASIC listings. My assembler version was optimised a little so it is not well readable anymore. Quicksort and crc generation/making random numbers are often-used and simple algorithms, I did not feel the need to explain how they work. Because I find it useless to comment assembler I use to write many instructions on a line, to see them all on one page. Writing assembler is faster than maintaining comments. As soon as you swap some instructions or rename registers the accompanying comment becomes incomprehensible anyhow. It’s not lazyness, I just find comments on this level useless and prefer a piece of pseudocode accompanying it to explain the idea, where appropriate. Taskwindow: I needed the timings in computer-readable form to write the previous posting. I left the computer undisturbed while running the programs. So the timings will be relatively correct, given the 25X slowdown that we see. I agree that for other programs the timings could benefit from running under more replicatable circumstances.