RISC OS Open: Forum: C to BASIC struggle

Oct 19, 2014 10:15pm

Kuemmel (439) 384 posts

I recently tried to port some nice radix integer sort code from Link written in C to BBC Basic. Here’s the C code:


void radix_sort(unsigned *begin, unsigned *end)
{
    unsigned *begin1 = new unsigned[end - begin];
    unsigned *end1 = begin1 + (end - begin);
    for (unsigned shift = 0; shift < 32; shift += 8) {
        size_t count[0x100] = {};
        for (unsigned *p = begin; p != end; p++)
            count[(*p >> shift) & 0xFF]++;
        unsigned *bucket[0x100], *q = begin1;
        for (int i = 0; i < 0x100; q += count[i++])
            bucket[i] = q;
        for (unsigned *p = begin; p != end; p++)
            *bucket[(*p >> shift) & 0xFF]++ = *p;
        std::swap(begin, begin1);
        std::swap(end, end1);
    }
    delete[] begin1;
}

…and here’s my try


 max% = 10-1 : REM amount of integers to sort
 DIM begin    (max%+1)*4
 DIM begin1   (max%+1)*4
 DIM count         256*4
 DIM bucket        256*4
 FOR i%=0 TO max%
 a%=RND(1)*100: begin!(i%<<2) = a%: begin1!(i%<<2) = a%
 PRINT;begin!(i%<<2);" ";
 NEXT
     FOR shift% = 0 TO 24 STEP 8
        FOR i%=0 TO 255: count!(i%<<2)=0: NEXT
        FOR i%=0 TO max%
           a% = ((begin!(i%<<2))>>shift%) AND &FF
           count!(a%<<2)+= 1
        NEXT
        q = begin1
        FOR i%=0 TO 255
           q+= count!(i%<<2)
           bucket!(i%<<2) = q
        NEXT
        FOR i%=0 TO max%
           a%  = ((begin!(i%<<2))>>shift%) AND &FF
           d   = bucket!(a%<<2)
           d!0 = begin!(i%<<2)
           bucket!(a%<<2) = bucket!(a%<<2) + 1
        NEXT
        SWAP begin,begin1
     NEXT shift%
PRINT: FOR i%=0 TO max%: PRINT;begin!(i%<<2);" ";:NEXT

…but it doesn’t work…kind of stuck now in that pointer or byte/word chaos…may be somebody got a clue.

Oct 20, 2014 9:41am

Chris Hall (132) 3554 posts

For sorting arrays and strings in BASIC, I think you are best using ArmSort. It is excellent.

Oct 20, 2014 11:50am

Holger Palmroth (487) 115 posts

Two things:


max% = 10-1 : REM amount of integers to sort
 DIM begin    (max%+1)*4
 DIM begin1   (max%+1)*4
 DIM count         256*4
 DIM bucket        256*4
 FOR i%=0 TO max%
 a%=RND(1)*100: begin!(i%<<2) = a%: begin1!(i%<<2) = a%
 PRINT;begin!(i%<<2);" ";
 NEXT
     FOR shift% = 0 TO 24 STEP 8
        FOR i%=0 TO 255: count!(i%<<2)=0: NEXT
        FOR i%=0 TO max%
           a% = ((begin!(i%<<2))>>shift%) AND &FF

You are dealing with 4 byte wide addresses here, so just incrementing by 1 won’t do it


           count!(a%<<2)+= 4
        NEXT
        q = begin1

Here you were tricked by the C syntax, q will be incremented after at the end of the loop.


        FOR i%=0 TO 255
           bucket!(i%<<2) = q
           q+= count!(i%<<2)
	NEXT
        FOR i%=0 TO max%
           a%  = ((begin!(i%<<2))>>shift%) AND &FF
           d   = bucket!(a%<<2)
           d!0 = begin!(i%<<2)

Addresses!! :)


           bucket!(a%<<2) = bucket!(a%<<2) + 4		   
        NEXT
        SWAP begin,begin1
     NEXT shift%
PRINT: FOR i%=0 TO max%: PRINT;begin!(i%<<2);" ";:NEXT

Thank you, that was fun!

Oct 20, 2014 4:58pm

André Timmermans (100) 655 posts

The following C line is puzzling:

*bucket[(*p >> shift) & 0xFF]++ = *p;

I would decompose it as:

tmp = bucket[(*p >> shift) & 0xFF];
*tmp = *p;
tmp++;

which means that the ++ serves no purpose.

On second thought, considering the manipulation of count[] a few lines above, I guess it is:

*bucket[(*p >> shift) & 0xFF] = *p;
bucket[(*p >> shift) & 0xFF] = bucket[(*p >> shift) & 0xFF] + 1;

Oct 20, 2014 7:52pm

Kuemmel (439) 384 posts

@Chris: Thanks for the link, I’ll check that out, but it seems when it comes to speed for sorting huge amount of integers the radix can’t be beaten…
@Holger: Thanks a lot, great !!! Works perfect, quite obvious now.
@Andre: Got me puzzled in the beginning, too.

At first I also tried not to use the dummy “d”, but basic doesn’t like pointer of a pointer, as

(bucket!(a%<<2))!0 = begin!(i%<<2)

gives an error.

…now it’s time to port it to assembler :-)

Oct 21, 2014 6:51am

Chris Hall (132) 3554 posts

BASIC is quite happy with pointer to pointer (to pointer, to pointer … ) if you get the arguments right.

Oct 21, 2014 5:22pm

Martin Avison (27) 1494 posts

@kuemell Thanks for the link, I’ll check that out, but it seems when it comes to speed for sorting huge amount of integers the radix can’t be beaten…

I would be astonished if a radix sort written in Basic beats ArmSort (written in assembler) used from Basic!

But I have been astonished before.

Oct 21, 2014 5:34pm

Holger Palmroth (487) 115 posts

I would be astonished if a radix sort written in Basic beats ArmSort (written in assembler) used from Basic!

Obviously …

…now it’s time to port it to assembler :-)

… porting it to assembler is Kuemmel’s next step.

Oct 21, 2014 5:50pm

Rick Murray (539) 13840 posts

I had a lame bubble-sort that moved 512 byte blocks. Written in BASIC, it took a rather long time. I don’t recall the time exactly (it was around twenty years ago) but I think sorting a 200 entry array in 40-odd seconds on an ARM3 was about right.
I rewrote the exact same crappy bubble sort in C. It was done in approximately one blink. In time terms, a little under a second.

The problem with BASIC is that it doesn’t really have any “memory”, so the same block of code will be parsed, interpreted, and executed again and again; instead of, say, interpret once, execute many which is a real basic description of a compiler.

I do not believe that assembler would have offered much benefit. It would be cap

Oct 21, 2014 5:50pm

Rick Murray (539) 13840 posts

I had a lame bubble-sort that moved 512 byte blocks. Written in BASIC, it took a rather long time. I don’t recall the time exactly (it was around twenty years ago) but I think sorting a 200 entry array in 40-odd seconds on an ARM3 was about right.
I rewrote the exact same crappy bubble sort in C. It was done in approximately one blink. In time terms, a little under a second.

The problem with BASIC is that it doesn’t really have any “memory”, so the same block of code will be parsed, interpreted, and executed again and again; instead of, say, interpret once, execute many which is a real basic description of a compiler.

I do not believe that assembler would have offered much benefit. It would be capable of running faster, certainly, but compare that against the extra time it would have taken to write and test. The leap of ~40s → <1s is significant. It is harder to justify trimming one “under a second” value to a different “under a second” value.

Oct 21, 2014 7:00pm

Steve Pampling (1551) 8170 posts

It is harder to justify trimming one “under a second” value to a different “under a second” value.

ping response 400ms naff link
ping response 30ms VPN via 15Mb/s ADSL
ping response 8ms Google via NHS gateway¹
ping response <1ms My desktop to PC at Rugby (15 miles) via private link.

All < 1 second. Harder to justify trimming? I must introduce you to a few of my 9000 userbase, I’m sure you’d have an interesting debate.²

¹ Location of selected gateway varies depending on BT load
² Bear in mind that some of these are doctors who are widely known (to themselves at least) to know all about everything.

Oct 21, 2014 9:01pm

Paul Sprangers (346) 524 posts

Since you want to sort integers in a memory block in Basic, is there any particular reason for not using OS_HeapSort? It’s easy to implement and very fast. Actually, I wouldn’t be surprised if ArmSort were wrapped around that very call. But like Martin, I have been surprised before.

Oct 21, 2014 9:23pm

Rick Murray (539) 13840 posts

All < 1 second. Harder to justify trimming?

If you are going to comment on a cherry-picked quote, some context might demonstrate that I wasn’t saying what you thought I was saying.

I was referring to a specific case where compiler code would have been unlikely to be that much slower than pure assembler when implementing the same algorithm. In this case, a specific event was considered (on a much older machine) to take “less than a second”, as was its potential replacement. The envisaged difference did not justify the time expended on creating a third implementation. It would have been unlikely to offer the same dramatic difference as was offered by moving from BASIC to C. This was also for irregular events, so the gains from doing the same job twenty thousand times would be dispersed over several lifetimes of the software.
Do I need to add disclaimers that “these things that I say refer to the context of the post and that alone”? :)

Yes, if you look around you can find many examples of where trimming milliseconds counts. That internet with the treacle-slow ping, how fast is it going to load a web page that refers to numerous files on numerous servers, as opposed to the one that can manage the trip in 8ms? As a cumulative effect, it will add up. Also, the freakishly long ping indicates a potential connection problem so whilst the link may remain active, it could be retrying over and over in the background, net result being that “it’s slow”.

Another example? My PVR claims that it is capable of D1 recording (which means it should cope with 720×576@25fps). It doesn’t. It manages 640×480, and I think part of the image decode resizes the vertical to 288px in order to reduce bandwidth, though I have not confirmed this. Perhaps optimised assembler might help eke out a bit more from this oversold and underpowered device? When you are trying to record live video to H.263 in real time and encode the audio samples as AAC at the same time and run some sort of OS to tie it all together, optimisations of microseconds could make big differences.

Away from tech, differences of less than a second can be the difference between being a winner and being a corpse. Most forms of racing fall into this category. Did you ever watch “Tokyo Drift” (one of the fast/furious franchise)? There’s a scene where the cars “drift” through a road crossing with loads of people around (probably supposed to be that infamous crossing in Shibuya). A difference of less than a second would be the difference between drifting the crossing…and wrecking out having slaughtered a significant proportion of the pedestrians.

Oct 21, 2014 10:02pm

Steve Pampling (1551) 8170 posts

If you are going to comment on a cherry-picked quote, some context might demonstrate that I wasn’t saying what you thought I was saying.
I was referring to a specific case where compiler code would have been unlikely to be that much slower than pure assembler when implementing the same algorithm.

I was pointing out that everything is relative.

I could equally have pointed to an instance of a database query which contained a leftover from a test setup that queried a non-existent test table that didn’t abort, rather it timed out.
No problem, the small delay (fractions of a second) didn’t slow things down…
… at least not until the live database had enough data elements to cause the delay to occur enough times to cause a total delay of about 10 seconds.¹

Yes, if you look around you can find many examples of where trimming milliseconds counts. That internet with the treacle-slow ping, how fast is it going to load a web page that refers to numerous files on numerous servers, as opposed to the one that can manage the trip in 8ms? As a cumulative effect, it will add up.

Also, the freakishly long ping indicates a potential connection problem so whilst the link may remain active, it could be retrying over and over in the background, net result being that “it’s slow”.

Duplex mis-matches causing collisions and back off, high traffic and prioritised traffic in a shaped link, traffic policing capping things (great fun – silent drops when the limit is reached) QoS slowing less important traffic where someone deems your ICMP echo to be less important. Yeah, played with a few variants.

Which was the exact point I was making. Cumulative effect.

Did you ever watch “Tokyo Drift” (one of the fast/furious franchise)?

I have seen TV clips of stuff from the genre. Strikes me as simulated game footage using live action figures. I don’t do games – built a games machine a few decades ago and handed it over to my sisters as soon as the fun bit was done.

¹ The authors claimed it was a network problem, our capture of traffic showed the delay element happening repeatedly and generating the total delay. It was of course a quick fix since we had already done the diagnosis.

Oct 21, 2014 10:38pm

Martin Avison (27) 1494 posts

@PaulSprangers: …is there any particular reason for not using OS_HeapSort? It’s easy to implement and very fast. Actually, I wouldn’t be surprised if ArmSort were wrapped around that very call.

In fact ArmSort can use OS_HeapSort, but for single, ascending, large Integer arrays it is slower than ArmSort code! ArmSort also starts with Basic Arrays … and there can be several of them included, with different data types and sort requirements. Try sorting a set of arrays by Age%(), FirstName$(), Surname$(), Address$() using HeapSort!

It is worth noting that sort times can vary dramatically depending on original sequence of data, the number of duplicate keys, and the length of the keys.

I was also amused to discover that the radix sort is the algorithm for a sort used by a mechanical card sorter to sort 80-column punched cards … which I first used in 1969!! One big disadvantage is that the time taken increases with the length of the sort key.

Oct 21, 2014 11:16pm

Kuemmel (439) 384 posts

@Martin: Just for the comparison, can you check how fast your ArmSort can do the task of sorting 1000000 32bit random integers ? To create them I use RND(1)*2^31. Would be interesting to know if the radix is really the way to go or not, even if the code here is very specialised now of course.

I managed to do the assembler port, will put it online soon after cleaning up the code. The results on my Panda (1200 MHz) are

Assembler...  0,47 s
Basic....... 12,44 s

In command line mode (no desktop yet) on my IGEPv5 (1000 MHz) the speed up of the assembler version is like “wow”

Assembler...  0,09 s
Basic....... 12,86 s

Oct 22, 2014 9:32am

Paul Sprangers (346) 524 posts

Just for your interest: Basic with HeapSort does it in 1.60 seconds. So, that’s a big win for Assembler – and I’ve only timed the sorting, not the creation of the memory block. Here’s the code (without the printing to screen):

max%=1000000
DIM array% max%*4
FOR i%=0 TO max%-1 : array%!(i%*4)=RND(1)*2^31 : NEXT
SYS "OS_HeapSort",max%,array%,1

Timings are done on an ARMiniX.

Oct 22, 2014 10:31am

Martin Avison (27) 1494 posts

@Kuemmel: For 1,000,000 random Integers in a Basic array, generated with none deliberately set the same by:
FOR C1% = 0 TO DIM(A%(),1): A%(C1%)=RND: NEXT

Times with ArmSort:
2.8 seconds for Iyonix
1.7 seconds for RPi
1.8 seconds for BeagleBoard XM revC.

Note these times are for a Basic array, not from an array of integers in storage (eg DIM array% max%*4)

Oct 22, 2014 8:19pm

Kuemmel (439) 384 posts

Find my code here

It runs 3 assembler versions (basic implementation and 2 different ways of loop unrolling, not too much faster), the basic version and also for comparison the OS_HeapSort.

You can change the maximum amount of data at the top of the basic code and toggle s% to show the values…what doesn’t make much sense for high numbers, just basically for debugging purpose.

I’m still trying to find a better way of loop unrolling and creating a flow of indepedant instructions, but due to the nature of the algorithm most of the LDR/STR’s depend on each other and the content…let’s see if I can find something.

@Martin can you run it on your machines, maybe something interesting comes up. I’m still impressed with the Cortex A-15 five times speed up.

Oct 23, 2014 11:38am

Martin Avison (27) 1494 posts

@Kuemmel: I downloaded your sort test.

Note that the data used by each sort is random (ie different) … for comparisons it is better to use the same fixed seed for each test, so I have added x=RND(-1234567).
I have also added an extra sort test using ArmSort of Basic arrays for a more direct comparison.

Here are the results on my Iyonix.

radix ASMv1..: Time taken [s]   2.98
radix ASMv2..: Time taken [s]   1.97   Note
radix ASMv3..: Time taken [s]   1.97   Note
radix BASIC..: Time taken [s]  35.03
OS_HeapSort..: Time taken [s]   7.74
ArmSort......: Time taken [s]   3.09

Note: These sorts cheat! … the numbers after sorting are not the same as before sorting … so their times should be discounted.

Oct 23, 2014 12:31pm

Jeffrey Lee (213) 6048 posts

So who’s going to be first to try something like this? :-)

http://researcher.watson.ibm.com/researcher/view_person_subpage.php?id=3990

(Note that the multi-core part isn’t a requirement… but you’ll obviously get bonus points if you get a multi-core implementation working on RISC OS ;-))

Oct 23, 2014 6:12pm

Kuemmel (439) 384 posts

@Martin: The v2 and v3 are not meant to cheat…must be something wrong with the code, I’ll check that out.

@Jeffrey: Found that before on the net also. Not clearly sure if NEON would help as there’s so much memory indexed addressing…but it would be worth a try someday. Nowdays you can google for graphic card enhanced sort algorithms that are up to 100 times faster with big amounts of data.

Oct 23, 2014 9:12pm

Kuemmel (439) 384 posts

@Martin: Here Link I got a new version to test for you.

I didn’t find any error for the 1000000 value test in the last one but made some loop code more “safe”. I also put in some comparison of all sorted data to OS_heapSort to check if my code produces the same exact results. May be you can check again, then I’ll update my original link. I also made sure each routine gets the same random values as you proposed.

If successfull, can you also run it on the Beagle and RPi ?

P.S.: I speed up the BASIC version a bit as it doesn’t like long variable names.

Oct 23, 2014 9:21pm

Martin Avison (27) 1494 posts

@Kuemmel: I will try your revised tests … but may not be until Monday after I have prepared for, attended & recovered from the show.

Oct 24, 2014 10:27am

Steve Drain (222) 1620 posts

I speed up the BASIC version a bit as it doesn’t like long variable names.

I realise that any assembler or C version is going to far outstrip anything in BASIC, but it is possible to speed up your routine in many ways as well as the length of variable names. However, such tweaks are not very significant in comparison to re-assessing the alogorithm to use the strengths of BASIC, rather than transliterating C code.

This runs about twice as fast as your routine and there is still scope for quite a few easy tweeks before going into full crunching.

DEFPROCsort(array%())
 LOCAL i%,size%,shift%,byte%,index%,swap%(),count%(),bucket%()
 size%=DIM(array%(),1)
 DIM swap%(size%),count%(255),bucket%(255)
 FOR shift%=0 TO 31 STEP 8
  count%()=0
  FOR i%=0 TO size%
   byte%=array%(i%)>>shift% AND &FF
   count%(byte%)+=1
  NEXT i%
  index%=0
  FOR i%=0 TO 255
   bucket%(i%)=index%
   index%+=count%(i%)
  NEXT i%
  FOR i%=0 TO size%
   byte%=array%(i%)>>shift% AND &FF
   swap%(bucket%(byte%))=array%(i%)
   bucket%(byte%)+=1
  NEXT i%
  SWAP array%(),swap%()
 NEXT shift%
ENDPROC

C to BASIC struggle

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Oct 19, 2014 10:15pm Kuemmel (439) 384 posts	I recently tried to port some nice radix integer sort code from Link written in C to BBC Basic. Here’s the C code: void radix_sort(unsigned begin, unsigned end) { unsigned begin1 = new unsigned[end - begin]; unsigned end1 = begin1 + (end - begin); for (unsigned shift = 0; shift < 32; shift += 8) { size_t count[0x100] = {}; for (unsigned p = begin; p != end; p++) count[(p >> shift) & 0xFF]++; unsigned bucket[0x100], q = begin1; for (int i = 0; i < 0x100; q += count[i++]) bucket[i] = q; for (unsigned p = begin; p != end; p++) bucket[(p >> shift) & 0xFF]++ = p; std::swap(begin, begin1); std::swap(end, end1); } delete[] begin1; } …and here’s my try max% = 10-1 : REM amount of integers to sort DIM begin (max%+1)4 DIM begin1 (max%+1)4 DIM count 2564 DIM bucket 2564 FOR i%=0 TO max% a%=RND(1)*100: begin!(i%<<2) = a%: begin1!(i%<<2) = a% PRINT;begin!(i%<<2);" "; NEXT FOR shift% = 0 TO 24 STEP 8 FOR i%=0 TO 255: count!(i%<<2)=0: NEXT FOR i%=0 TO max% a% = ((begin!(i%<<2))>>shift%) AND &FF count!(a%<<2)+= 1 NEXT q = begin1 FOR i%=0 TO 255 q+= count!(i%<<2) bucket!(i%<<2) = q NEXT FOR i%=0 TO max% a% = ((begin!(i%<<2))>>shift%) AND &FF d = bucket!(a%<<2) d!0 = begin!(i%<<2) bucket!(a%<<2) = bucket!(a%<<2) + 1 NEXT SWAP begin,begin1 NEXT shift% PRINT: FOR i%=0 TO max%: PRINT;begin!(i%<<2);" ";:NEXT …but it doesn’t work…kind of stuck now in that pointer or byte/word chaos…may be somebody got a clue.

Oct 20, 2014 9:41am Chris Hall (132) 3554 posts	For sorting arrays and strings in BASIC, I think you are best using ArmSort. It is excellent.

Oct 20, 2014 11:50am Holger Palmroth (487) 115 posts	Two things: `max% = 10-1 : REM amount of integers to sort DIM begin (max%+1)4 DIM begin1 (max%+1)4 DIM count 2564 DIM bucket 2564 FOR i%=0 TO max% a%=RND(1)*100: begin!(i%<<2) = a%: begin1!(i%<<2) = a% PRINT;begin!(i%<<2);" "; NEXT FOR shift% = 0 TO 24 STEP 8 FOR i%=0 TO 255: count!(i%<<2)=0: NEXT FOR i%=0 TO max% a% = ((begin!(i%<<2))>>shift%) AND &FF` You are dealing with 4 byte wide addresses here, so just incrementing by 1 won’t do it `count!(a%<<2)+= 4 NEXT q = begin1` Here you were tricked by the C syntax, q will be incremented after at the end of the loop. `FOR i%=0 TO 255 bucket!(i%<<2) = q q+= count!(i%<<2) NEXT FOR i%=0 TO max% a% = ((begin!(i%<<2))>>shift%) AND &FF d = bucket!(a%<<2) d!0 = begin!(i%<<2)` Addresses!! :) `bucket!(a%<<2) = bucket!(a%<<2) + 4 NEXT SWAP begin,begin1 NEXT shift% PRINT: FOR i%=0 TO max%: PRINT;begin!(i%<<2);" ";:NEXT` Thank you, that was fun!

Oct 20, 2014 4:58pm André Timmermans (100) 655 posts	The following C line is puzzling: `bucket[(p >> shift) & 0xFF]++ = p;` I would decompose it as: `tmp = bucket[(p >> shift) & 0xFF]; tmp = p; tmp++;` which means that the ++ serves no purpose. On second thought, considering the manipulation of count[] a few lines above, I guess it is: `bucket[(p >> shift) & 0xFF] = p; bucket[(p >> shift) & 0xFF] = bucket[(*p >> shift) & 0xFF] + 1;`

Oct 20, 2014 7:52pm Kuemmel (439) 384 posts	@Chris: Thanks for the link, I’ll check that out, but it seems when it comes to speed for sorting huge amount of integers the radix can’t be beaten… @Holger: Thanks a lot, great !!! Works perfect, quite obvious now. @Andre: Got me puzzled in the beginning, too. At first I also tried not to use the dummy “d”, but basic doesn’t like pointer of a pointer, as `(bucket!(a%<<2))!0 = begin!(i%<<2)` gives an error. …now it’s time to port it to assembler :-)

Oct 21, 2014 6:51am Chris Hall (132) 3554 posts	BASIC is quite happy with pointer to pointer (to pointer, to pointer … ) if you get the arguments right.

Oct 21, 2014 5:22pm Martin Avison (27) 1494 posts	@kuemell Thanks for the link, I’ll check that out, but it seems when it comes to speed for sorting huge amount of integers the radix can’t be beaten… I would be astonished if a radix sort written in Basic beats ArmSort (written in assembler) used from Basic! But I have been astonished before.

Oct 21, 2014 5:34pm Holger Palmroth (487) 115 posts	I would be astonished if a radix sort written in Basic beats ArmSort (written in assembler) used from Basic! Obviously … …now it’s time to port it to assembler :-) … porting it to assembler is Kuemmel’s next step.

Oct 21, 2014 5:50pm Rick Murray (539) 13840 posts	I had a lame bubble-sort that moved 512 byte blocks. Written in BASIC, it took a rather long time. I don’t recall the time exactly (it was around twenty years ago) but I think sorting a 200 entry array in 40-odd seconds on an ARM3 was about right. I rewrote the exact same crappy bubble sort in C. It was done in approximately one blink. In time terms, a little under a second. The problem with BASIC is that it doesn’t really have any “memory”, so the same block of code will be parsed, interpreted, and executed again and again; instead of, say, interpret once, execute many which is a real basic description of a compiler. I do not believe that assembler would have offered much benefit. It would be cap

Oct 21, 2014 5:50pm Rick Murray (539) 13840 posts	I had a lame bubble-sort that moved 512 byte blocks. Written in BASIC, it took a rather long time. I don’t recall the time exactly (it was around twenty years ago) but I think sorting a 200 entry array in 40-odd seconds on an ARM3 was about right. I rewrote the exact same crappy bubble sort in C. It was done in approximately one blink. In time terms, a little under a second. The problem with BASIC is that it doesn’t really have any “memory”, so the same block of code will be parsed, interpreted, and executed again and again; instead of, say, interpret once, execute many which is a real basic description of a compiler. I do not believe that assembler would have offered much benefit. It would be capable of running faster, certainly, but compare that against the extra time it would have taken to write and test. The leap of ~40s → <1s is significant. It is harder to justify trimming one “under a second” value to a different “under a second” value.

Oct 21, 2014 7:00pm Steve Pampling (1551) 8170 posts	It is harder to justify trimming one “under a second” value to a different “under a second” value. ping response 400ms naff link ping response 30ms VPN via 15Mb/s ADSL ping response 8ms Google via NHS gateway¹ ping response <1ms My desktop to PC at Rugby (15 miles) via private link. All < 1 second. Harder to justify trimming? I must introduce you to a few of my 9000 userbase, I’m sure you’d have an interesting debate.² ¹ Location of selected gateway varies depending on BT load ² Bear in mind that some of these are doctors who are widely known (to themselves at least) to know all about everything.

Oct 21, 2014 9:01pm Paul Sprangers (346) 524 posts	Since you want to sort integers in a memory block in Basic, is there any particular reason for not using OS_HeapSort? It’s easy to implement and very fast. Actually, I wouldn’t be surprised if ArmSort were wrapped around that very call. But like Martin, I have been surprised before.

Oct 21, 2014 9:23pm Rick Murray (539) 13840 posts	All < 1 second. Harder to justify trimming? If you are going to comment on a cherry-picked quote, some context might demonstrate that I wasn’t saying what you thought I was saying. I was referring to a specific case where compiler code would have been unlikely to be that much slower than pure assembler when implementing the same algorithm. In this case, a specific event was considered (on a much older machine) to take “less than a second”, as was its potential replacement. The envisaged difference did not justify the time expended on creating a third implementation. It would have been unlikely to offer the same dramatic difference as was offered by moving from BASIC to C. This was also for irregular events, so the gains from doing the same job twenty thousand times would be dispersed over several lifetimes of the software. Do I need to add disclaimers that “these things that I say refer to the context of the post and that alone”? :) Yes, if you look around you can find many examples of where trimming milliseconds counts. That internet with the treacle-slow ping, how fast is it going to load a web page that refers to numerous files on numerous servers, as opposed to the one that can manage the trip in 8ms? As a cumulative effect, it will add up. Also, the freakishly long ping indicates a potential connection problem so whilst the link may remain active, it could be retrying over and over in the background, net result being that “it’s slow”. Another example? My PVR claims that it is capable of D1 recording (which means it should cope with 720×576@25fps). It doesn’t. It manages 640×480, and I think part of the image decode resizes the vertical to 288px in order to reduce bandwidth, though I have not confirmed this. Perhaps optimised assembler might help eke out a bit more from this oversold and underpowered device? When you are trying to record live video to H.263 in real time and encode the audio samples as AAC at the same time and run some sort of OS to tie it all together, optimisations of microseconds could make big differences. Away from tech, differences of less than a second can be the difference between being a winner and being a corpse. Most forms of racing fall into this category. Did you ever watch “Tokyo Drift” (one of the fast/furious franchise)? There’s a scene where the cars “drift” through a road crossing with loads of people around (probably supposed to be that infamous crossing in Shibuya). A difference of less than a second would be the difference between drifting the crossing…and wrecking out having slaughtered a significant proportion of the pedestrians.

Oct 21, 2014 10:02pm Steve Pampling (1551) 8170 posts	If you are going to comment on a cherry-picked quote, some context might demonstrate that I wasn’t saying what you thought I was saying. I was referring to a specific case where compiler code would have been unlikely to be that much slower than pure assembler when implementing the same algorithm. I was pointing out that everything is relative. I could equally have pointed to an instance of a database query which contained a leftover from a test setup that queried a non-existent test table that didn’t abort, rather it timed out. No problem, the small delay (fractions of a second) didn’t slow things down… … at least not until the live database had enough data elements to cause the delay to occur enough times to cause a total delay of about 10 seconds.¹ Yes, if you look around you can find many examples of where trimming milliseconds counts. That internet with the treacle-slow ping, how fast is it going to load a web page that refers to numerous files on numerous servers, as opposed to the one that can manage the trip in 8ms? As a cumulative effect, it will add up. Also, the freakishly long ping indicates a potential connection problem so whilst the link may remain active, it could be retrying over and over in the background, net result being that “it’s slow”. Duplex mis-matches causing collisions and back off, high traffic and prioritised traffic in a shaped link, traffic policing capping things (great fun – silent drops when the limit is reached) QoS slowing less important traffic where someone deems your ICMP echo to be less important. Yeah, played with a few variants. Which was the exact point I was making. Cumulative effect. Did you ever watch “Tokyo Drift” (one of the fast/furious franchise)? I have seen TV clips of stuff from the genre. Strikes me as simulated game footage using live action figures. I don’t do games – built a games machine a few decades ago and handed it over to my sisters as soon as the fun bit was done. ¹ The authors claimed it was a network problem, our capture of traffic showed the delay element happening repeatedly and generating the total delay. It was of course a quick fix since we had already done the diagnosis.

Oct 21, 2014 10:38pm Martin Avison (27) 1494 posts	@PaulSprangers: …is there any particular reason for not using OS_HeapSort? It’s easy to implement and very fast. Actually, I wouldn’t be surprised if ArmSort were wrapped around that very call. In fact ArmSort can use OS_HeapSort, but for single, ascending, large Integer arrays it is slower than ArmSort code! ArmSort also starts with Basic Arrays … and there can be several of them included, with different data types and sort requirements. Try sorting a set of arrays by Age%(), FirstName$(), Surname$(), Address$() using HeapSort! It is worth noting that sort times can vary dramatically depending on original sequence of data, the number of duplicate keys, and the length of the keys. I was also amused to discover that the radix sort is the algorithm for a sort used by a mechanical card sorter to sort 80-column punched cards … which I first used in 1969!! One big disadvantage is that the time taken increases with the length of the sort key.

Oct 21, 2014 11:16pm Kuemmel (439) 384 posts	@Martin: Just for the comparison, can you check how fast your ArmSort can do the task of sorting 1000000 32bit random integers ? To create them I use `RND(1)*2^31`. Would be interesting to know if the radix is really the way to go or not, even if the code here is very specialised now of course. I managed to do the assembler port, will put it online soon after cleaning up the code. The results on my Panda (1200 MHz) are `Assembler... 0,47 s Basic....... 12,44 s` In command line mode (no desktop yet) on my IGEPv5 (1000 MHz) the speed up of the assembler version is like “wow” `Assembler... 0,09 s Basic....... 12,86 s`

Oct 22, 2014 9:32am Paul Sprangers (346) 524 posts	Just for your interest: Basic with HeapSort does it in 1.60 seconds. So, that’s a big win for Assembler – and I’ve only timed the sorting, not the creation of the memory block. Here’s the code (without the printing to screen): `max%=1000000 DIM array% max%4 FOR i%=0 TO max%-1 : array%!(i%4)=RND(1)*2^31 : NEXT SYS "OS_HeapSort",max%,array%,1` Timings are done on an ARMiniX.

Oct 22, 2014 10:31am Martin Avison (27) 1494 posts	@Kuemmel: For 1,000,000 random Integers in a Basic array, generated with none deliberately set the same by: `FOR C1% = 0 TO DIM(A%(),1): A%(C1%)=RND: NEXT` Times with ArmSort: 2.8 seconds for Iyonix 1.7 seconds for RPi 1.8 seconds for BeagleBoard XM revC. Note these times are for a Basic array, not from an array of integers in storage (eg DIM array% max%*4)

Oct 22, 2014 8:19pm Kuemmel (439) 384 posts	Find my code here It runs 3 assembler versions (basic implementation and 2 different ways of loop unrolling, not too much faster), the basic version and also for comparison the OS_HeapSort. You can change the maximum amount of data at the top of the basic code and toggle s% to show the values…what doesn’t make much sense for high numbers, just basically for debugging purpose. I’m still trying to find a better way of loop unrolling and creating a flow of indepedant instructions, but due to the nature of the algorithm most of the LDR/STR’s depend on each other and the content…let’s see if I can find something. @Martin can you run it on your machines, maybe something interesting comes up. I’m still impressed with the Cortex A-15 five times speed up.

Oct 23, 2014 11:38am Martin Avison (27) 1494 posts	@Kuemmel: I downloaded your sort test. Note that the data used by each sort is random (ie different) … for comparisons it is better to use the same fixed seed for each test, so I have added `x=RND(-1234567)`. I have also added an extra sort test using ArmSort of Basic arrays for a more direct comparison. Here are the results on my Iyonix. `radix ASMv1..: Time taken [s] 2.98 radix ASMv2..: Time taken [s] 1.97 Note radix ASMv3..: Time taken [s] 1.97 Note radix BASIC..: Time taken [s] 35.03 OS_HeapSort..: Time taken [s] 7.74 ArmSort......: Time taken [s] 3.09` Note: These sorts cheat! … the numbers after sorting are not the same as before sorting … so their times should be discounted.

Oct 23, 2014 12:31pm Jeffrey Lee (213) 6048 posts	So who’s going to be first to try something like this? :-) http://researcher.watson.ibm.com/researcher/view_person_subpage.php?id=3990 (Note that the multi-core part isn’t a requirement… but you’ll obviously get bonus points if you get a multi-core implementation working on RISC OS ;-))

Oct 23, 2014 6:12pm Kuemmel (439) 384 posts	@Martin: The v2 and v3 are not meant to cheat…must be something wrong with the code, I’ll check that out. @Jeffrey: Found that before on the net also. Not clearly sure if NEON would help as there’s so much memory indexed addressing…but it would be worth a try someday. Nowdays you can google for graphic card enhanced sort algorithms that are up to 100 times faster with big amounts of data.

Oct 23, 2014 9:12pm Kuemmel (439) 384 posts	@Martin: Here Link I got a new version to test for you. I didn’t find any error for the 1000000 value test in the last one but made some loop code more “safe”. I also put in some comparison of all sorted data to OS_heapSort to check if my code produces the same exact results. May be you can check again, then I’ll update my original link. I also made sure each routine gets the same random values as you proposed. If successfull, can you also run it on the Beagle and RPi ? P.S.: I speed up the BASIC version a bit as it doesn’t like long variable names.

Oct 23, 2014 9:21pm Martin Avison (27) 1494 posts	@Kuemmel: I will try your revised tests … but may not be until Monday after I have prepared for, attended & recovered from the show.

Oct 24, 2014 10:27am Steve Drain (222) 1620 posts	I speed up the BASIC version a bit as it doesn’t like long variable names. I realise that any assembler or C version is going to far outstrip anything in BASIC, but it is possible to speed up your routine in many ways as well as the length of variable names. However, such tweaks are not very significant in comparison to re-assessing the alogorithm to use the strengths of BASIC, rather than transliterating C code. This runs about twice as fast as your routine and there is still scope for quite a few easy tweeks before going into full crunching. `DEFPROCsort(array%()) LOCAL i%,size%,shift%,byte%,index%,swap%(),count%(),bucket%() size%=DIM(array%(),1) DIM swap%(size%),count%(255),bucket%(255) FOR shift%=0 TO 31 STEP 8 count%()=0 FOR i%=0 TO size% byte%=array%(i%)>>shift% AND &FF count%(byte%)+=1 NEXT i% index%=0 FOR i%=0 TO 255 bucket%(i%)=index% index%+=count%(i%) NEXT i% FOR i%=0 TO size% byte%=array%(i%)>>shift% AND &FF swap%(bucket%(byte%))=array%(i%) bucket%(byte%)+=1 NEXT i% SWAP array%(),swap%() NEXT shift% ENDPROC`