C to BASIC struggle
David Feugey (2125) 2709 posts |
Any speed up with ABC ? |
Steve Drain (222) 1620 posts |
It looks staightforward enough, so it might. In BASIC V with Basalt it is really easy: SORT array%() But that is just a disguise for the ArmSort *command. ;-) |
Martin Avison (27) 1494 posts |
@Kuemmel: I ran your latest version … output was: Sorting of 1000000 32-bit integers Wait while creating random integers (same for all)… Wait while copying same random integers to sorting array… Wait while copying same random integers to sorting array… |
Kuemmel (439) 384 posts |
…totally makes no sense for me, my Panda doesn’t report any invalid results…is that only on Iyonix !? Is there some problem with using the double load LDRD !? I remember reading somewhere that an LDRD needs to be aligned to 8 byte boundary memory addresses, but why would the Cortex A9 cope with non alignment (I didn’t check or ensured any alignment) and an Iyonix cpu not…superstrange…or the older cpu couldn’t cope and new later generation can… |
Kuemmel (439) 384 posts |
@Martin: I made a new test version using only LDR/LDM and no LDRD. You can find it here …hopefully that explains the problem… @Steve: Great stuff regarding your BASIC optimisation. I think I forgot already all about those things. I used your version and crunched it down, now runs in 4,55 s instead of my initial 12,44 s :-) will put it in the final version, but I’ll leave some “readable” version somewhere also. |
Rick Murray (539) 13850 posts |
The ARM ARM (2005, ARMv6 release) says: Prior to ARMv6, doubleword (LDRD/STRD) accesses to memory, where the address is not doubleword-aligned, are UNPREDICTABLE. The XScale is ARMv5. Some older cores don’t properly support LDRD, so, the XScale – is it an E variant? (that would be good) Is it an ARMv5TExP? (that would be bad) |
Steve Drain (222) 1620 posts |
@Kuemmel |
Steve Drain (222) 1620 posts |
This uses a pointer to an array block rather than a BASIC array.
|
Steve Revill (20) 1361 posts |
Any reason for doing:
rather than just:
That’ll be needlessly slowing your routine down. Also, a quick look at that code and the
line could just be replaced with:
to eliminate the multiply, if you also change another line to:
(All untested, I just looked at the code for a minute or so.) |
Steve Drain (222) 1620 posts |
Using Thanks for spotting the +=4 improvement; it is more satisfying. On the other hand, that loop is only 256, whereas the other two are aimed at 1000000, so the gain is likely to be undetectable. ;-) |
jan de boer (472) 78 posts |
Made an assembler version from PROCradix_sort by Steve Drain but I cannot get it faster than 2.72 secs (Iyonix); the original ARMv3 version is faster! Only 60M instructions are needed for 1M integers, so one would expect 0.1 sec. |
Kuemmel (439) 384 posts |
@Jan: Interesting ! So on the Iyonix the quicksort is faster ? Did you test your quicksort-code on a Panda or Beagleboard ? Could you post or send a link to that quicksort code ? I suspect that like stated in the original C-test code for the radix (see the link in my first post) that quicksort should be slower on high speed memory machines, as it uses much more compare operations than memory operations. Whereas the radix only uses memory transfer. EDIT: Found a quicksort ARM also on Link . That code uses “stmfd sp!, {r4, r6, lr}” and “ldmlefd sp!, {r4, r6, pc}” …ist that still valid regarding 26/32bit issues ? |
Martin Avison (27) 1494 posts |
@Kuemmel: I have now run your latest !radix_sort_no_ldrd program successfully on my Iyonix, after modifying it to include ArmSort. The times are:
So, assembler radix is much faster than HeapSort, and slightly faster than ArmSort. However, both HeapSort and ArmSort are generalised sort routines, whereas the radix sorts are integer only. ArmSort sorts Basic arrays, and will sort several together forming a complex key. I would consider adding a radix sort to ArmSort, but I do not think the size and complexities of the extra storage required are worth it for the possible performance improvement. |
Jeffrey Lee (213) 6048 posts |
Yes, that’s perfectly fine 26/32bit neutral code. |
jan de boer (472) 78 posts |
FWIW, I have uploaded my ARM implementations to http://home-1.tiscali.nl/~jandboer, it is the last entry in the list of goodies. |
Kuemmel (439) 384 posts |
Thanks Jan, I’ve got to look into your code, seems you found a much better implementation of that radix sort, at least for the Pandaboard (1200 MHz) it’s much better. Timings are I’ll try to incorporate you radix and quicksort in my code for direct comparison when I got time. Did you use RND to create the randoms also ?
At least it shows what I suspected, for modern CPU’s the radix code seems faster than the quicksort, I guess due to memory bandwith/prefetching. |
jan de boer (472) 78 posts |
When I fill the array with RND*(2^31) everything slows down (again), 4.02 sec for 16-bit on Iyonix. When I use a 32-bit crc routine (eor=&AF,seed=TIME, output looks random) it goes a lot faster. Puzzled. Theoretically 60M instructions on Iyonix should only take 0.1 sec. The IGEPv5 seems to perform closer to this, maybe timings on the other machines are unreliable because of ? |
Martin Avison (27) 1494 posts |
Can I sugest that if results of timing sort algorithms are posted, then we all try to: I know from experience that it is too easy to jump to conclusions about sorts based on incorrect information! |
Martin Avison (27) 1494 posts |
@Jan: Just downloaded your radixx.zip, and found the assembler sorts ran in about 0.67 sec … but on closer investigation I suspect the array being sorted is all zeros! Hence why it goes ‘a lot faster’ than when RND is used. |
Kuemmel (439) 384 posts |
@Martin: I put your remarks in my code, so the suggested seed and then I create 1000000 times RND*2^31 random integers. I also do a &ff data alignment. I also test your ArmSort along with the other code (though I don’t know if one can align arrays, but I think alignment doesn’t matter too much here). Of course it’s not quite fair to compare the specialized sorts to your code. @Jan: I also included your quicksort code. For your radix code, could you try the random number generation as proposed ? Are there some errors regarding what Martin said ? The latest code can be found here. Still needs some clean up but should run fine. I got rid of the ARMv3 version as it didn’t show any speedup, therefore I tuned the ARMv2 version to 4 times loop unroling, what helps a bit (and of course no more LDRD’s). The latest result compilation is here: The good old quicksort is still quite fast, but just like said before on the IgepV5 the radix is a clear winner. Let’s see if Jan’s code can squeeze out more :-)
|
Martin Avison (27) 1494 posts |
@Kuemmel: Just run your latest code on my Iyonix …
Quicksort impressive … but I have not yet checked code. Why are you using |
Kuemmel (439) 384 posts |
…actually no specific reason at all, I just didn’t remember that kind of RND syntax, I just thought of the RND function…may be I was working with floats for too long ;-) I can use |
jan de boer (472) 78 posts |
Martin Avison: you are right. It proves not to be so smart to use TIME as a timer and as a seed. I bow my head in shame. Asm (32768 iterations of size=256): 16, 7, 13, 6209,563,2244 Both for Iyonix and BBxM the timings go up as soon as the requirements for the dataset (2 X 4 bytes X dataset) exceed resp. primary cache (32 kb) or secundary cache (BBxM, 256K). For 16-bit radixsort, the requirements for count/bucket array always exceed what is offered on these 3 machines so these timings turn out to be extremely slow. To compare with quicksort: |
Martin Avison (27) 1494 posts |
@Jan: I have had a quick look at your new code … and I strongly suspect that the “random integers” you generate in the quicksort filling routine may be random, but all seem to be less than 256! Again, I suggest that you use Basic ABS to generate unsigned numbers (or just RND for signed). They can be generated once, then copied to save time. Much more repeatable by others. It would also help to understand your code if it was not rather compressed! One assembler instruction per line is MUCH easier to read, and will not make any appreciable time difference as the assembly is only done once! (ok twice). You hint that the tests were run under TaskWindow: this is another cause of time variations, and single-tasking would give more reliable results. |
jan de boer (472) 78 posts |
@Martin: As the thread says it, it’s a struggle. The MOVS seed,seed,LSR#1 should have read MOVS seed,seed,LSL#1. Timings for quicksort don’t change (much) afaics. |