RISC OS Open: Forum: C to BASIC struggle

Nov 2, 2014 10:50pm

Kuemmel (439) 384 posts

@Jan: Here are the IgepV5 (1 GHz) results from your code:

Speed of quicksort/non-recursive quicksort
Filling 4194304 (1M) numbers
Time (32768X quicksort @size=     32)=6  csec/1.000.000
Time (32768X nrecqsort @size=     32)=6  csec/1.000.000
Time (16384X quicksort @size=     64)=6  csec/1.000.000
Time (16384X nrecqsort @size=     64)=7  csec/1.000.000
Time ( 8192X quicksort @size=    128)=10 csec/1.000.000
Time ( 8192X nrecqsort @size=    128)=8  csec/1.000.000
Time ( 4096X quicksort @size=    256)=10 csec/1.000.000
Time ( 4096X nrecqsort @size=    256)=10 csec/1.000.000
Time ( 2048X quicksort @size=    512)=13 csec/1.000.000
Time ( 2048X nrecqsort @size=    512)=12 csec/1.000.000
Time ( 1024X quicksort @size=   1024)=15 csec/1.000.000
Time ( 1024X nrecqsort @size=   1024)=14 csec/1.000.000
Time (  512X quicksort @size=   2048)=16 csec/1.000.000
Time (  512X nrecqsort @size=   2048)=15 csec/1.000.000
Time (  256X quicksort @size=   4096)=17 csec/1.000.000
Time (  256X nrecqsort @size=   4096)=16 csec/1.000.000
Time (  128X quicksort @size=   8192)=17 csec/1.000.000
Time (  128X nrecqsort @size=   8192)=17 csec/1.000.000
Time (   64X quicksort @size=  16384)=17 csec/1.000.000
Time (   64X nrecqsort @size=  16384)=17 csec/1.000.000
Time (   32X quicksort @size=  32768)=19 csec/1.000.000
Time (   32X nrecqsort @size=  32768)=17 csec/1.000.000
Time (   16X quicksort @size=  65536)=19 csec/1.000.000                                                           
Time (   16X nrecqsort @size=  65536)=18 csec/1.000.000
Time (    8X quicksort @size= 131072)=20 csec/1.000.000
Time (    8X nrecqsort @size= 131072)=19 csec/1.000.000
Time (    4X quicksort @size= 262144)=20 csec/1.000.000
Time (    4X nrecqsort @size= 262144)=19 csec/1.000.000
Time (    2X quicksort @size= 524288)=20 csec/1.000.000
Time (    2X nrecqsort @size= 524288)=20 csec/1.000.000
Time (    1X quicksort @size=1048576)=20 csec/1.000.000
Time (    1X nrecqsort @size=1048576)=20 csec/1.000.000

No of bits in one go 8/16? (enter 8 or 16, 8=4 passes,16=2 passes) 8
Asm entire array: time for 1.000.000 is 118 csec (size= 8388608 words)
Asm (32768 iterations of size=256): time for 10000000 is 6 csec
Asm (16384 iterations of size=512): time for 10000000 is 6 csec
Asm (8192 iterations of size=1024): time for 10000000 is 5 csec
Asm (4096 iterations of size=2048): time for 10000000 is 5 csec
Asm (2048 iterations of size=4096): time for 10000000 is 5 csec
Asm (1024 iterations of size=8192): time for 10000000 is 5 csec
Asm (512 iterations of size=16384): time for 10000000 is 6 csec
Asm (256 iterations of size=32768): time for 10000000 is 7 csec
Asm (128 iterations of size=65536): time for 10000000 is 7 csec
Asm (64 iterations of size=131072): time for 10000000 is 8 csec
Asm (32 iterations of size=262144): time for 10000000 is 8 csec
Asm (16 iterations of size=524288): time for 10000000 is 10 csec
Asm (8 iterations of size=1048576): time for 10000000 is 37 csec
Asm (4 iterations of size=2097152): time for 10000000 is 61 csec
Asm (2 iterations of size=4194304): time for 10000000 is 87 csec
Asm (1 iterations of size=8388608): time for 10000000 is 118 csec

No of bits in one go 8/16? (enter 8 or 16, 8=4 passes,16=2 passes) 16
Asm entire array: time for 1.000.000 is 82 csec (size= 8388608 words)
Asm (32768 iterations of size=256): time for 10000000 is 205 csec
Asm (16384 iterations of size=512): time for 10000000 is 106 csec
Asm (8192 iterations of size=1024): time for 10000000 is 57 csec
Asm (4096 iterations of size=2048): time for 10000000 is 32 csec
Asm (2048 iterations of size=4096): time for 10000000 is 20 csec
Asm (1024 iterations of size=8192): time for 10000000 is 14 csec
Asm (512 iterations of size=16384): time for 10000000 is 11 csec
Asm (256 iterations of size=32768): time for 10000000 is 10 csec
Asm (128 iterations of size=65536): time for 10000000 is 9 csec
Asm (64 iterations of size=131072): time for 10000000 is 9 csec
Asm (32 iterations of size=262144): time for 10000000 is 9 csec
Asm (16 iterations of size=524288): time for 10000000 is 17 csec
Asm (8 iterations of size=1048576): time for 10000000 is 40 csec
Asm (4 iterations of size=2097152): time for 10000000 is 54 csec
Asm (2 iterations of size=4194304): time for 10000000 is 64 csec
Asm (1 iterations of size=8388608): time for 10000000 is 82 csec

Regarding the 4/2 pass choice it seems that for larger amounts (>1000000) of data the 2 pass solution becomes better.

I agree about the beauty of the algorithm, according to Wikipedia it dates back to tabulating machiens from 1887 !

I’m intrigued by the idea to use the NEON unit finally for sorting. In the NEON Programmer’s Guide is a NEON enhanced Bitonic with Merge sort algorithm used for a median 7×7 filter…but that’s a lot of stuff to read into before getting that done…and still might be slower, who knows…

Nov 2, 2014 11:53pm

Jeffrey Lee (213) 6048 posts

That NEON programmer’s guide looks like a useful doc – thanks for pointing it out! It’s a bit of a shame though that it’s a bit patchy in some areas. e.g. it doesn’t give examples of how VRECPE/VRECPS and VRSQRTE/VRSQRTS should be used, and although it goes into detail about the Cortex-A8 NEON performance it has very little information about the other processors.

But on the bright side, it does explain what the polynomial data type is meant to be used for, which I haven’t seen elsewhere at all.

Nov 3, 2014 12:20am

Martin Avison (27) 1494 posts

@Jan: It is your code, so you can write it how you like. And as you say, your aim was to do a very valid investigation in to the effect of data sizes on performance of your sort code.
However, it would make it easier to compare with other sorts if for any ‘published’ code it is obvious and easy to change to make comparisons valid. I never remarked on comments, just that your layout made it very difficult for others to peer review. But no criticism was implied!

I know from bitter experience that testing sorts is not easy. There are many variables to consider, and the ‘best’ will be a compromise. Some sorts can slow for relatively few duplicates, or runs in sequence (ascending or descending), or other ‘non-random’ distributions.

Nov 3, 2014 9:40pm

Kuemmel (439) 384 posts

@Jeffrey: A good example how to use VRECPE/VRECPS and VRSQRTE/VRSQRTS can be found in Lachlan Tychsen’s math function database => Link

I extracted the lines here:

//fast invsqrt approx
vmov.f32       d1, d0             \n\t"   //d1 = d0
vrsqrte.f32    d0, d0             \n\t"   //d0 = ~ 1.0 / sqrt(d0)
vmul.f32       d2, d0, d1         \n\t"   //d2 = d0 * d1
vrsqrts.f32    d3, d2, d0         \n\t"   //d3 = (3 - d0 * d2) / 2        
vmul.f32       d0, d0, d3         \n\t"   //d0 = d0 * d3
vmul.f32       d2, d0, d1         \n\t"   //d2 = d0 * d1  
vrsqrts.f32    d3, d2, d0         \n\t"   //d4 = (3 - d0 * d3) / 2        
vmul.f32       d0, d0, d3         \n\t"   //d0 = d0 * d3  
//fast reciporical approximation
vrecpe.f32     d1, d0             \n\t"   //d1 = ~ 1 / d0
vrecps.f32     d2, d1, d0         \n\t"   //d2 = 2.0 - d1 * d0
vmul.f32       d1, d1, d2         \n\t"   //d1 = d1 * d2
vrecps.f32     d2, d1, d0         \n\t"   //d2 = 2.0 - d1 * d0
vmul.f32       d0, d1, d2         \n\t"   //d0 = d1 * d2

You can “stop” the calculation before the last VRECPS or VRSQRTS, but of course you’ll get less accuracy, you could even only use the VRECPE or VRSQRTE, but end up even less accurate.

Nov 3, 2014 9:47pm

Jeffrey Lee (213) 6048 posts

@Jeffrey: A good example how to use VRECPE/VRECPS and VRSQRTE/VRSQRTS can be found in Lachlan Tychsen’s math function database

Yes, I know how to use them. The problem is that I doubt everyone else knows :-) I’ve seen people write general-purpose math code without using the ‘step’ instructions – not good!

Nov 4, 2014 7:01am

Chris Hall (132) 3554 posts

what ‘step’ instructions?

Nov 4, 2014 10:32am

Jeffrey Lee (213) 6048 posts

See what I mean? ;-)

VRECPS and VRSQRTS are the step instructions (S = ‘step’), which are designed to be used after the VRECPE/VRSQRTE ‘estimate’ instructions in order to iteratively improve on the accuracy of the result.

Nov 4, 2014 10:52am

Steve Drain (222) 1620 posts

I was interested in to see what these were, too, so I looked at the StrongHelp VFP manual I prepared from Jeffrey’s comprehensive lists of instructions. I knew I had not checked over the NEON bits and I have found that it is not good for the ones here. So, if you were going to to look there please wait for a couple of days until I do a proper job. ;-(

Nov 6, 2014 12:33pm

Martin Avison (27) 1494 posts

@Steve Drain: re program posted 27th Oct: I have been unable to get this to work! Does it only work for certain types of data? If I try to sort 100 random numbers from 0 to 255 it seems to sort the first 24 and leave the rest unsorted! Truly random signed integers seem even worse. And I have just cut’n’pasted your code! Honest!!

Nov 6, 2014 1:08pm

Martin Avison (27) 1494 posts

@Steve: Sorry for casting doubts on your program! I have now realised that the size% parameter needs to be set to 4*(elements+1)-1 for correct operation.

Nov 6, 2014 2:20pm

Steve Drain (222) 1620 posts

Yes, that was naughty of me – size% is the DIM size. I have been doing it that way for so long I forgot that others would not realise it.

The effect is to change the end condition of the FOR loops when byte% is added.

Sorry.

Nov 9, 2014 2:36am

jan de boer (472) 78 posts

@Kuemmel: Thank you for the taking the trouble of doing all these tests. Quicksort is a nice example of O=2logN.N ! For Radixsort on IGEPv5, however, your code was faster (0.09 csec) so please stick to that code. My main reason for posting was, why do these timings seem to be so slow? Cache issues? What to do about it?
(1) There is a variation upon radixsort, called radix quicksort. Do a MSD radix, then quicksort the fragments; i think it’s also well suited for multi-core. This one times between quicksort and radix.
(2) radix with more passes, e.g. 5 or 6, with fewer bits, actually faster than radix, possibly because the cachelines aren’t thrashed?
(3) add the preload (PLD) instruction, (available from ARMv5 on, iirc, certainly works on iyonix). PLD preloads main memory to secundary cache, before the processor needs it. Unfortunately I also tried it on a 4-pass (ie.8 bits at a time) radixsort but at >=8 bits no advantage: timings begin to rise at #bits=6. Benefits are not as dramatic as I had hoped for, 10-30% faster.
Timings on Ionix resp. 103, 221 (6 passes, no PLD), 198 (6 passes,PLD).
Ref: Radixsort and next-of-kin :Google for ‘18RadixSort’. A set of lecture slides.
Ref: (old) ArmArm containing PLD instruction: https://www.scss.tcd.ie/~waldroj/3d1/arm_arm.pdf page 240
@Martin Avison: Point taken. I’ll try to keep in mind that other people must read it and not get an headache. I don’t have experience on sorting, this quicksort algorithm I found somewhere and it works, the surrounding code I wrote in a hurry so it was rather unsightly. Sorry.

Nov 11, 2014 12:25pm

Martin Avison (27) 1494 posts

If anyone is still interested in testing and comparing various sort routines, then I have created SortTest which provides a test bed for integer sort routines, to help to make it easy to create consistent, comparable results, with plug-in sort programs that just do a sort. Test runs of the sorts are controlled by a simple script file. This should enable effort to be concentrated on the sort programs themselves!

Some of the recent postings on this thread have been converted to plug-in sorts within SortTest. For a few further details and download please see my website Comments always welcome!

Mar 5, 2015 10:16pm

Kuemmel (439) 384 posts

I finally updated my !radix_sort app code with the latest input from Jan. You can find the latest version here

It adds Jan’s “Radix,MSB first,then LSBs” and “Radix,MSB then LSB’s + Cache” routines, adding an even more major speed up on the Pandaboard (guess on other hardware too) regarding the task to sort integers. Also includes his code to experiment with the cache optimisations.

C to BASIC struggle

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Nov 2, 2014 10:50pm Kuemmel (439) 384 posts	@Jan: Here are the IgepV5 (1 GHz) results from your code: Speed of quicksort/non-recursive quicksort Filling 4194304 (1M) numbers Time (32768X quicksort @size= 32)=6 csec/1.000.000 Time (32768X nrecqsort @size= 32)=6 csec/1.000.000 Time (16384X quicksort @size= 64)=6 csec/1.000.000 Time (16384X nrecqsort @size= 64)=7 csec/1.000.000 Time ( 8192X quicksort @size= 128)=10 csec/1.000.000 Time ( 8192X nrecqsort @size= 128)=8 csec/1.000.000 Time ( 4096X quicksort @size= 256)=10 csec/1.000.000 Time ( 4096X nrecqsort @size= 256)=10 csec/1.000.000 Time ( 2048X quicksort @size= 512)=13 csec/1.000.000 Time ( 2048X nrecqsort @size= 512)=12 csec/1.000.000 Time ( 1024X quicksort @size= 1024)=15 csec/1.000.000 Time ( 1024X nrecqsort @size= 1024)=14 csec/1.000.000 Time ( 512X quicksort @size= 2048)=16 csec/1.000.000 Time ( 512X nrecqsort @size= 2048)=15 csec/1.000.000 Time ( 256X quicksort @size= 4096)=17 csec/1.000.000 Time ( 256X nrecqsort @size= 4096)=16 csec/1.000.000 Time ( 128X quicksort @size= 8192)=17 csec/1.000.000 Time ( 128X nrecqsort @size= 8192)=17 csec/1.000.000 Time ( 64X quicksort @size= 16384)=17 csec/1.000.000 Time ( 64X nrecqsort @size= 16384)=17 csec/1.000.000 Time ( 32X quicksort @size= 32768)=19 csec/1.000.000 Time ( 32X nrecqsort @size= 32768)=17 csec/1.000.000 Time ( 16X quicksort @size= 65536)=19 csec/1.000.000 Time ( 16X nrecqsort @size= 65536)=18 csec/1.000.000 Time ( 8X quicksort @size= 131072)=20 csec/1.000.000 Time ( 8X nrecqsort @size= 131072)=19 csec/1.000.000 Time ( 4X quicksort @size= 262144)=20 csec/1.000.000 Time ( 4X nrecqsort @size= 262144)=19 csec/1.000.000 Time ( 2X quicksort @size= 524288)=20 csec/1.000.000 Time ( 2X nrecqsort @size= 524288)=20 csec/1.000.000 Time ( 1X quicksort @size=1048576)=20 csec/1.000.000 Time ( 1X nrecqsort @size=1048576)=20 csec/1.000.000 No of bits in one go 8/16? (enter 8 or 16, 8=4 passes,16=2 passes) 8 Asm entire array: time for 1.000.000 is 118 csec (size= 8388608 words) Asm (32768 iterations of size=256): time for 10000000 is 6 csec Asm (16384 iterations of size=512): time for 10000000 is 6 csec Asm (8192 iterations of size=1024): time for 10000000 is 5 csec Asm (4096 iterations of size=2048): time for 10000000 is 5 csec Asm (2048 iterations of size=4096): time for 10000000 is 5 csec Asm (1024 iterations of size=8192): time for 10000000 is 5 csec Asm (512 iterations of size=16384): time for 10000000 is 6 csec Asm (256 iterations of size=32768): time for 10000000 is 7 csec Asm (128 iterations of size=65536): time for 10000000 is 7 csec Asm (64 iterations of size=131072): time for 10000000 is 8 csec Asm (32 iterations of size=262144): time for 10000000 is 8 csec Asm (16 iterations of size=524288): time for 10000000 is 10 csec Asm (8 iterations of size=1048576): time for 10000000 is 37 csec Asm (4 iterations of size=2097152): time for 10000000 is 61 csec Asm (2 iterations of size=4194304): time for 10000000 is 87 csec Asm (1 iterations of size=8388608): time for 10000000 is 118 csec No of bits in one go 8/16? (enter 8 or 16, 8=4 passes,16=2 passes) 16 Asm entire array: time for 1.000.000 is 82 csec (size= 8388608 words) Asm (32768 iterations of size=256): time for 10000000 is 205 csec Asm (16384 iterations of size=512): time for 10000000 is 106 csec Asm (8192 iterations of size=1024): time for 10000000 is 57 csec Asm (4096 iterations of size=2048): time for 10000000 is 32 csec Asm (2048 iterations of size=4096): time for 10000000 is 20 csec Asm (1024 iterations of size=8192): time for 10000000 is 14 csec Asm (512 iterations of size=16384): time for 10000000 is 11 csec Asm (256 iterations of size=32768): time for 10000000 is 10 csec Asm (128 iterations of size=65536): time for 10000000 is 9 csec Asm (64 iterations of size=131072): time for 10000000 is 9 csec Asm (32 iterations of size=262144): time for 10000000 is 9 csec Asm (16 iterations of size=524288): time for 10000000 is 17 csec Asm (8 iterations of size=1048576): time for 10000000 is 40 csec Asm (4 iterations of size=2097152): time for 10000000 is 54 csec Asm (2 iterations of size=4194304): time for 10000000 is 64 csec Asm (1 iterations of size=8388608): time for 10000000 is 82 csec Regarding the 4/2 pass choice it seems that for larger amounts (>1000000) of data the 2 pass solution becomes better. I agree about the beauty of the algorithm, according to Wikipedia it dates back to tabulating machiens from 1887 ! I’m intrigued by the idea to use the NEON unit finally for sorting. In the NEON Programmer’s Guide is a NEON enhanced Bitonic with Merge sort algorithm used for a median 7×7 filter…but that’s a lot of stuff to read into before getting that done…and still might be slower, who knows…

Nov 2, 2014 11:53pm Jeffrey Lee (213) 6048 posts	That NEON programmer’s guide looks like a useful doc – thanks for pointing it out! It’s a bit of a shame though that it’s a bit patchy in some areas. e.g. it doesn’t give examples of how VRECPE/VRECPS and VRSQRTE/VRSQRTS should be used, and although it goes into detail about the Cortex-A8 NEON performance it has very little information about the other processors. But on the bright side, it does explain what the polynomial data type is meant to be used for, which I haven’t seen elsewhere at all.

Nov 3, 2014 12:20am Martin Avison (27) 1494 posts	@Jan: It is your code, so you can write it how you like. And as you say, your aim was to do a very valid investigation in to the effect of data sizes on performance of your sort code. However, it would make it easier to compare with other sorts if for any ‘published’ code it is obvious and easy to change to make comparisons valid. I never remarked on comments, just that your layout made it very difficult for others to peer review. But no criticism was implied! I know from bitter experience that testing sorts is not easy. There are many variables to consider, and the ‘best’ will be a compromise. Some sorts can slow for relatively few duplicates, or runs in sequence (ascending or descending), or other ‘non-random’ distributions.

Nov 3, 2014 9:40pm Kuemmel (439) 384 posts	@Jeffrey: A good example how to use VRECPE/VRECPS and VRSQRTE/VRSQRTS can be found in Lachlan Tychsen’s math function database => Link I extracted the lines here: //fast invsqrt approx vmov.f32 d1, d0 \n\t" //d1 = d0 vrsqrte.f32 d0, d0 \n\t" //d0 = ~ 1.0 / sqrt(d0) vmul.f32 d2, d0, d1 \n\t" //d2 = d0 * d1 vrsqrts.f32 d3, d2, d0 \n\t" //d3 = (3 - d0 * d2) / 2 vmul.f32 d0, d0, d3 \n\t" //d0 = d0 * d3 vmul.f32 d2, d0, d1 \n\t" //d2 = d0 * d1 vrsqrts.f32 d3, d2, d0 \n\t" //d4 = (3 - d0 * d3) / 2 vmul.f32 d0, d0, d3 \n\t" //d0 = d0 * d3 //fast reciporical approximation vrecpe.f32 d1, d0 \n\t" //d1 = ~ 1 / d0 vrecps.f32 d2, d1, d0 \n\t" //d2 = 2.0 - d1 * d0 vmul.f32 d1, d1, d2 \n\t" //d1 = d1 * d2 vrecps.f32 d2, d1, d0 \n\t" //d2 = 2.0 - d1 * d0 vmul.f32 d0, d1, d2 \n\t" //d0 = d1 * d2 You can “stop” the calculation before the last VRECPS or VRSQRTS, but of course you’ll get less accuracy, you could even only use the VRECPE or VRSQRTE, but end up even less accurate.

Nov 3, 2014 9:47pm Jeffrey Lee (213) 6048 posts	@Jeffrey: A good example how to use VRECPE/VRECPS and VRSQRTE/VRSQRTS can be found in Lachlan Tychsen’s math function database Yes, I know how to use them. The problem is that I doubt everyone else knows :-) I’ve seen people write general-purpose math code without using the ‘step’ instructions – not good!

Nov 4, 2014 7:01am Chris Hall (132) 3554 posts	what ‘step’ instructions?

Nov 4, 2014 10:32am Jeffrey Lee (213) 6048 posts	See what I mean? ;-) VRECPS and VRSQRTS are the step instructions (S = ‘step’), which are designed to be used after the VRECPE/VRSQRTE ‘estimate’ instructions in order to iteratively improve on the accuracy of the result.

Nov 4, 2014 10:52am Steve Drain (222) 1620 posts	I was interested in to see what these were, too, so I looked at the StrongHelp VFP manual I prepared from Jeffrey’s comprehensive lists of instructions. I knew I had not checked over the NEON bits and I have found that it is not good for the ones here. So, if you were going to to look there please wait for a couple of days until I do a proper job. ;-(

Nov 6, 2014 12:33pm Martin Avison (27) 1494 posts	@Steve Drain: re program posted 27th Oct: I have been unable to get this to work! Does it only work for certain types of data? If I try to sort 100 random numbers from 0 to 255 it seems to sort the first 24 and leave the rest unsorted! Truly random signed integers seem even worse. And I have just cut’n’pasted your code! Honest!!

Nov 6, 2014 1:08pm Martin Avison (27) 1494 posts	@Steve: Sorry for casting doubts on your program! I have now realised that the size% parameter needs to be set to 4*(elements+1)-1 for correct operation.

Nov 6, 2014 2:20pm Steve Drain (222) 1620 posts	Yes, that was naughty of me – size% is the DIM size. I have been doing it that way for so long I forgot that others would not realise it. The effect is to change the end condition of the FOR loops when byte% is added. Sorry.

Nov 9, 2014 2:36am jan de boer (472) 78 posts	@Kuemmel: Thank you for the taking the trouble of doing all these tests. Quicksort is a nice example of O=2logN.N ! For Radixsort on IGEPv5, however, your code was faster (0.09 csec) so please stick to that code. My main reason for posting was, why do these timings seem to be so slow? Cache issues? What to do about it? (1) There is a variation upon radixsort, called radix quicksort. Do a MSD radix, then quicksort the fragments; i think it’s also well suited for multi-core. This one times between quicksort and radix. (2) radix with more passes, e.g. 5 or 6, with fewer bits, actually faster than radix, possibly because the cachelines aren’t thrashed? (3) add the preload (PLD) instruction, (available from ARMv5 on, iirc, certainly works on iyonix). PLD preloads main memory to secundary cache, before the processor needs it. Unfortunately I also tried it on a 4-pass (ie.8 bits at a time) radixsort but at >=8 bits no advantage: timings begin to rise at #bits=6. Benefits are not as dramatic as I had hoped for, 10-30% faster. Timings on Ionix resp. 103, 221 (6 passes, no PLD), 198 (6 passes,PLD). Ref: Radixsort and next-of-kin :Google for ‘18RadixSort’. A set of lecture slides. Ref: (old) ArmArm containing PLD instruction: https://www.scss.tcd.ie/~waldroj/3d1/arm_arm.pdf page 240 @Martin Avison: Point taken. I’ll try to keep in mind that other people must read it and not get an headache. I don’t have experience on sorting, this quicksort algorithm I found somewhere and it works, the surrounding code I wrote in a hurry so it was rather unsightly. Sorry.

Nov 11, 2014 12:25pm Martin Avison (27) 1494 posts	If anyone is still interested in testing and comparing various sort routines, then I have created SortTest which provides a test bed for integer sort routines, to help to make it easy to create consistent, comparable results, with plug-in sort programs that just do a sort. Test runs of the sorts are controlled by a simple script file. This should enable effort to be concentrated on the sort programs themselves! Some of the recent postings on this thread have been converted to plug-in sorts within SortTest. For a few further details and download please see my website Comments always welcome!

Mar 5, 2015 10:16pm Kuemmel (439) 384 posts	I finally updated my !radix_sort app code with the latest input from Jan. You can find the latest version here It adds Jan’s “Radix,MSB first,then LSBs” and “Radix,MSB then LSB’s + Cache” routines, adding an even more major speed up on the Pandaboard (guess on other hardware too) regarding the task to sort integers. Also includes his code to experiment with the cache optimisations.