RISC OS Open: Forum: Boolean arrays in BBC BASIC

Apr 19, 2021 6:33pm

Steve Drain (222) 1620 posts

IF ... <> 0 THEN

That just makes explicit how IF works, but uses a boolean argument.

How would you deal with IF screenwidth% < 480 THEN ...

Which is just the same. The comparison yields a boolean result, ‘true’ or ‘false’, -1 or 0.

I’ll patiently wait for Steve to come along and explain how even though FALSE is tokenised, it’s quicker to use zero.

If that is so – I have not checked – and you are testing it in a shortish FOR..NEXT loop then it is a result of the “synergistic cache”. A literal value of 0 is cached, but the value of the FALSE token is not. Without that I suspect that it is the other way round.

there’s also all the processing overheads in coercing your TRUE to be a float, and fudging it every time a comparison is required.

It is not ‘coercing’ and ‘fudging’. It is called ‘casting’ and BASIC does it all without fuss. Of course, there is a time penalty, but not huge. All floating point systems can hold any 32-bit integer exactly, which is handy when you want to deal with unsigned integers in BASIC. I feel a great weight hanging over my head. ;-)

Edit: Not all floating point systems. Not single precision as with NEON.

I find the shifts clearer…

Would you actually find bit%-((bit%>>3)<<3) clearer than bit% MOD 8? The priority of the shift operators make them quite tricky in expressions, and Syntax Error seems to crop up in code that looks perfectly fine.

Apr 20, 2021 5:27am

Frank de Bruijn (160) 228 posts

A literal value of 0 is cached, but the value of the FALSE token is not. Without that I suspect that it is the other way round.

And you would be right, because it is.
Well, at least it was last time I checked, years ago. With a simple FOR..NEXT loop…

Apr 20, 2021 7:23am

Steve Drain (222) 1620 posts

I have edited my last post to remove a glitch.

I should make clear that bit% MOD 8 is simpler as bit% AND 7, but you did say that shifts are clearer. ;-)

Apr 20, 2021 8:20am

Clive Semmens (2335) 3276 posts

I think, but haven’t checked, that shifts and AND and OR are also quite a bit quicker than DIV and MOD – which when you could end up doing hundreds of millions of them ain’t a trivial matter.

Apr 20, 2021 9:24am

Steve Drain (222) 1620 posts

AFAICT the BASIC DIV|MOD – a single routine – does not optimise for powers of 2, so the bitwise operators are certainly faster. However, they do not express what is going on in the same way.

There is a lovely phrase I have picked up, premature optimisation, that warns against making these changes at the wrong stage of programming.

Here are the optimised routines:

DEFPROCsetBit(a%,b%)PROCresBit:?a%=?a%OR1<<b%:ENDPROC
DEFPROCresetBit(a%,b%)PROCresBit:?a%=?a%ANDNOT1<<b%:ENDPROC
DEFFNgetBit(a%,b%)PROCresBit:=?a%AND1<<b%
DEFPROCresBit:a%+=b%>>3:b%=b%AND7:ENDPROC

Apr 20, 2021 9:32am

Steve Drain (222) 1620 posts

Rambling on …

The BASIC integer type is used in three domains: signed numbers, bit fields and boolean values. If the use of a variable is confined exclusively to one of those there is never a problem. However, there are gains to be made employing the functions of one domain in another. It is in this cross-over that premature optimisation is a danger and care must be taken. ;-)

Apr 20, 2021 9:44am

Clive Semmens (2335) 3276 posts

Your optimized routines look even more like what I’d done already, except that as I wrote, I’d already optimized them further by inlining them. Sadly, still too slow for the biggest arrays I’d like to try. Not to worry. I’ve not run out of memory with byte-sized booleans yet (which is an order of magnitude faster…), and when/if I do, I’m obviously going to have to go to assembler – which might still be too slow. Maybe book myself in for a spell in cryogenic hibernation. Or give up on this ridiculous project. Or bite the AArch64 assembler bullet and get an M1-powered Mac 8~)

As for the nice ramble – spot on.

Apr 20, 2021 4:59pm

Rick Murray (539) 13840 posts

If that is so – I have not checked – and you are testing it in a shortish FOR..NEXT loop then it is a result of the “synergistic cache”.

I don’t know if it’s true. I just can’t help but think that the creator of the language used checks against zero instead of FALSE for better reasons than “it’s quicker to type”.

It is not ‘coercing’ and ‘fudging’. It is called ‘casting’

I’m thinking of things from the point of view of what the machine is doing inside, which is probably promoting the integer to a float and then comparing them.
Yes, I’m aware that BASIC handles it without complaint, and that it’s a doddle to cast all sorts of things in C. However, a value that is supposed to be a fixed “this or that” being converted to floating point, and then subjected to floating point comparison, is a fudge. Floats shouldn’t enter into the discussion at all when talking about booleans.

I should make clear that bit% MOD 8 is simpler as bit% AND 7, but you did say that shifts are clearer. ;-)

I was going to say, reading it, that shifting down and shifting back up by the same amount looks like an AND might do the job… Glad I read ahead.

Let’s put it differently. I suck at maths. I’m quite good at binary. So yes, shifts are easier for me. But you can throw in OR, AND, NOT, and EOR too. I can work with those.

I’ve not had to work with bit arrays larger than a word. So value |= 1UL << bit; would set the ‘bit’ specified. value &= ~(1UL << bit); will clear the bit. And replacing |= in the first example with ^= will toggle the bit. Finally, result = (value >> bit) & 1U; will set ‘result’ to 0 or 1 according to whether or not the given bit is set.
Not so different to your optimised routines.

And, since I’ve given my code in C, there’s always the other option. Epic cheat and use a bitfield. ;-)

Apr 20, 2021 7:41pm

GavinWraith (26) 1563 posts

Thinking of things from the point of view of what the machine is doing inside …

I have a small query, which is easier to ask here than look up in an ARM^n document. It concerns parallel execution. If {Ra, Rb, Rc} and {Rd, Re, Rf} are disjoint register sets, then for XYZ some binary operation, the ordering of

XYZ Ra,Rb,Rc
XYZ Rd,Re,Rf

clearly does not matter. Do modern ARMs execute them in parallel? I would expect this to be the case. Indeed, the whole question of parallelizability is obviously a very big deal, but I have not seen any documents with explicit details. Can anybody quote chapters and verses?

Apr 20, 2021 8:08pm

Rick Murray (539) 13840 posts

I think it depends upon the processor. This talks about the A8 – https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/3919/cortex-a8-instruction-fetch-for-dual-issue/10314#10314

Later cores use out of order execution, to further muddle things up. ;-)

Apr 20, 2021 8:21pm

Rick Murray (539) 13840 posts

Class lecture slides. More to do with Intel, but the concepts are broadly the same.
https://passlab.github.io/CSCE513/notes/lecture18_ILP_SuperscalarAdvancedARMIntel.pdf

Apr 20, 2021 9:48pm

Sprow (202) 1158 posts

Indeed, the whole question of parallelizability is obviously a very big deal, but I have not seen any documents with explicit details. Can anybody quote chapters and verses?

It turned out that even ARM’s own documents (for the Cortex-A8 at least) didn’t contain a full cycle description of how the various instructions get pipelined, especially when one pipeline gets stalled because it’s waiting for a result. In other words you can’t just write XYZ Ra,Rb,Rc in one column and XYZ Rd,Re,Rf in the other, because they may take different amounts of time to complete.

Recent-ish versions of the DDE include an application that does the modelling for you (A8time), accepting arbitrary code in ObjAsm format as input, and outputting an annotated diagram of what happens when. Chapter 7 of the Desktop Tools manual explains what it all means.

As the name suggests, A8time models a Cortex-A8. I suspect fiddling round with instruction and register orders is a law of diminishing returns, and that some reordering is better than none, but to not worry too much beyond that.

Apr 20, 2021 10:23pm

GavinWraith (26) 1563 posts

Thanks for that. I had not really noticed A8time in the DDE. Very interesting. It certainly brings home the gulf between the early ARM CPUs and present-day ones.

Apr 21, 2021 7:49am

David J. Ruck (33) 1635 posts

Given the wide range of ARM cores that RISC OS runs on, I would not worry about superscalar instruction ordering. Use the generic compiler options, and if you are still writing assembler by hand, aim for maintainability. Only if you are targetting one specific hardware platform considerer using the compiler tuning options for its CPU.

Generally you are only going to have to worry about instruction level optimisation on the oldest slowest processors which have little or no superscalar capabilities (the awful XScale had so many additional delays compared to every other ARM, generic code or even StrongARM optimised code performed poorly).

You don’t have to worry about it at all on the fastest cores now, as out-of-order execution allows the processor to re-order instructions in an optimal way for it’s internal architecture at run-time, which gives better results than doing it at compile time, even targetting a specific core.

Unless your problem is unrealistically CPU bound pure computation, you are always going to get more gains in feeding the data to it in a way which avoids cache misses, than worrying about instruction ordering.

Apr 21, 2021 4:48pm

Steve Drain (222) 1620 posts

This is for Clive really, but I have been playing around with some code to implement actual bit arrays, not really BASIC, but very nearly. I am not absolutely sure I have got the offset calculation right, but my head hurts, so I though I would offer it for view at::

http://www.kappa.me.uk/Miscellaneous/swBitArray002.zip

Calling it looks like:

CALLbitDim:a`(),1,2,3
PRINT DIM(a`(),1)
CALLbitSet:a`(),0,1,0
PRINT USRbitGet:(a`(),0,1,0)
CALLbitReset:a`(),0,1,0
PRINT USRbitGet:(a`(),0,1,0)

As always, comments are welcome. ;-)

Apr 21, 2021 4:59pm

Clive Semmens (2335) 3276 posts

I do believe you’re putting more effort into this than I am, Steve!

For the moment I’m staying entirely in BASIC and living with the limited array size – I’ve got this horrid suspicion it’s all too slow however I do it if I try to do array sizes that won’t fit in my Pi at a byte per point.

Unless your problem is unrealistically CPU bound pure computation, you are always going to get more gains in feeding the data to it in a way which avoids cache misses

I don’t have much hope – no, I don’t have any hope – of avoiding cache misses in these operations, more’s the pity.

Apr 21, 2021 7:07pm

David J. Ruck (33) 1635 posts

I don’t have much hope – no, I don’t have any hope – of avoiding cache misses in these operations, more’s the pity.

But by reducing the data size from integers to bytes, you are reducing the cache misses by a factor 4. If you use bits, you reduce the cache misses by a factor of 32. When dealing with large amounts of data, the more efficient use of cache far outweigh the more complex code to extract the correct bits.

Apr 22, 2021 7:51am

Clive Semmens (2335) 3276 posts

But by reducing the data size from integers to bytes, you are reducing the cache misses by a factor 4. If you use bits, you reduce the cache misses by a factor of 32.

Except that it’s not so much reducing the cache misses as increasing the cache hits – from almost exactly zero to 4 or 32 times almost exactly zero.

I’m doing random accesses into arrays as big as available memory allows. Cache hits are extremely improbable.

Apr 23, 2021 11:22am

Steve Drain (222) 1620 posts

I do believe you’re putting more effort into this than I am, Steve!

Quite possibly, but I am enjoying it. Your challenge tied in with something I was fiddling with and gave me a point of focus. The code has advanced and has nearly all the checks and errors. It looks like this:

PROCbitAssemble
CALLbit:DIM a`(1,2,3)
PRINT DIM(a`(),1)
CALLbit:a`(0,1,0)=1
PRINT USRbit:a`(0,1,0)
CALLbit:a`(0,1,0)=0
PRINT USRbit:a`(0,1,0)
CALLbit:a`(0,1,0)=2:REM toggle
PRINT USRbit:a`(0,1,0)

It is at:

http://www.kappa.me.uk/Miscellaneous/swBitArray003.zip

Byte arrays would only require small changes. ;-)

Apr 25, 2021 10:16am

Steve Drain (222) 1620 posts

Byte arrays would only require small changes.

Having fixed acouple of bugs I have added byte arrays for anyone still interested. ;-)

http://www.kappa.me.uk/Miscellaneous/swBitArray004.zip

Boolean arrays in BBC BASIC

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Apr 19, 2021 6:33pm Steve Drain (222) 1620 posts	`IF ... <> 0 THEN` That just makes explicit how `IF` works, but uses a boolean argument. How would you deal with `IF screenwidth% < 480 THEN ...` Which is just the same. The comparison yields a boolean result, ‘true’ or ‘false’, `-1` or `0`. I’ll patiently wait for Steve to come along and explain how even though FALSE is tokenised, it’s quicker to use zero. If that is so – I have not checked – and you are testing it in a shortish `FOR..NEXT` loop then it is a result of the “synergistic cache”. A literal value of 0 is cached, but the value of the `FALSE` token is not. Without that I suspect that it is the other way round. there’s also all the processing overheads in coercing your TRUE to be a float, and fudging it every time a comparison is required. It is not ‘coercing’ and ‘fudging’. It is called ‘casting’ and BASIC does it all without fuss. Of course, there is a time penalty, but not huge. All floating point systems can hold any 32-bit integer exactly, which is handy when you want to deal with unsigned integers in BASIC. I feel a great weight hanging over my head. ;-) Edit: Not all floating point systems. Not single precision as with NEON. I find the shifts clearer… Would you actually find `bit%-((bit%>>3)<<3)` clearer than `bit% MOD 8`? The priority of the shift operators make them quite tricky in expressions, and `Syntax Error` seems to crop up in code that looks perfectly fine.

Apr 20, 2021 5:27am Frank de Bruijn (160) 228 posts	A literal value of 0 is cached, but the value of the FALSE token is not. Without that I suspect that it is the other way round. And you would be right, because it is. Well, at least it was last time I checked, years ago. With a simple FOR..NEXT loop…

Apr 20, 2021 7:23am Steve Drain (222) 1620 posts	I have edited my last post to remove a glitch. I should make clear that `bit% MOD 8` is simpler as `bit% AND 7`, but you did say that shifts are clearer. ;-)

Apr 20, 2021 8:20am Clive Semmens (2335) 3276 posts	I think, but haven’t checked, that shifts and `AND` and `OR` are also quite a bit quicker than `DIV` and `MOD` – which when you could end up doing hundreds of millions of them ain’t a trivial matter.

Apr 20, 2021 9:24am Steve Drain (222) 1620 posts	AFAICT the BASIC `DIV\|MOD` – a single routine – does not optimise for powers of 2, so the bitwise operators are certainly faster. However, they do not express what is going on in the same way. There is a lovely phrase I have picked up, premature optimisation, that warns against making these changes at the wrong stage of programming. Here are the optimised routines: DEFPROCsetBit(a%,b%)PROCresBit:?a%=?a%OR1<<b%:ENDPROC DEFPROCresetBit(a%,b%)PROCresBit:?a%=?a%ANDNOT1<<b%:ENDPROC DEFFNgetBit(a%,b%)PROCresBit:=?a%AND1<<b% DEFPROCresBit:a%+=b%>>3:b%=b%AND7:ENDPROC

Apr 20, 2021 9:32am Steve Drain (222) 1620 posts	Rambling on … The BASIC integer type is used in three domains: signed numbers, bit fields and boolean values. If the use of a variable is confined exclusively to one of those there is never a problem. However, there are gains to be made employing the functions of one domain in another. It is in this cross-over that premature optimisation is a danger and care must be taken. ;-)

Apr 20, 2021 9:44am Clive Semmens (2335) 3276 posts	Your optimized routines look even more like what I’d done already, except that as I wrote, I’d already optimized them further by inlining them. Sadly, still too slow for the biggest arrays I’d like to try. Not to worry. I’ve not run out of memory with byte-sized booleans yet (which is an order of magnitude faster…), and when/if I do, I’m obviously going to have to go to assembler – which might still be too slow. Maybe book myself in for a spell in cryogenic hibernation. Or give up on this ridiculous project. Or bite the AArch64 assembler bullet and get an M1-powered Mac 8~) As for the nice ramble – spot on.

Apr 20, 2021 4:59pm Rick Murray (539) 13840 posts	If that is so – I have not checked – and you are testing it in a shortish FOR..NEXT loop then it is a result of the “synergistic cache”. I don’t know if it’s true. I just can’t help but think that the creator of the language used checks against zero instead of FALSE for better reasons than “it’s quicker to type”. It is not ‘coercing’ and ‘fudging’. It is called ‘casting’ I’m thinking of things from the point of view of what the machine is doing inside, which is probably promoting the integer to a float and then comparing them. Yes, I’m aware that BASIC handles it without complaint, and that it’s a doddle to cast all sorts of things in C. However, a value that is supposed to be a fixed “this or that” being converted to floating point, and then subjected to floating point comparison, is a fudge. Floats shouldn’t enter into the discussion at all when talking about booleans. I should make clear that bit% MOD 8 is simpler as bit% AND 7, but you did say that shifts are clearer. ;-) I was going to say, reading it, that shifting down and shifting back up by the same amount looks like an AND might do the job… Glad I read ahead. Let’s put it differently. I suck at maths. I’m quite good at binary. So yes, shifts are easier for me. But you can throw in OR, AND, NOT, and EOR too. I can work with those. I’ve not had to work with bit arrays larger than a word. So `value \|= 1UL << bit;` would set the ‘bit’ specified. `value &= ~(1UL << bit);` will clear the bit. And replacing `\|=` in the first example with `^=` will toggle the bit. Finally, `result = (value >> bit) & 1U;` will set ‘result’ to 0 or 1 according to whether or not the given bit is set. Not so different to your optimised routines. And, since I’ve given my code in C, there’s always the other option. Epic cheat and use a bitfield. ;-)

Apr 20, 2021 7:41pm GavinWraith (26) 1563 posts	Thinking of things from the point of view of what the machine is doing inside … I have a small query, which is easier to ask here than look up in an ARM^n document. It concerns parallel execution. If {`Ra, Rb, Rc`} and {`Rd, Re, Rf`} are disjoint register sets, then for XYZ some binary operation, the ordering of `XYZ Ra,Rb,Rc XYZ Rd,Re,Rf` clearly does not matter. Do modern ARMs execute them in parallel? I would expect this to be the case. Indeed, the whole question of parallelizability is obviously a very big deal, but I have not seen any documents with explicit details. Can anybody quote chapters and verses?

Apr 20, 2021 8:08pm Rick Murray (539) 13840 posts	I think it depends upon the processor. This talks about the A8 – https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/3919/cortex-a8-instruction-fetch-for-dual-issue/10314#10314 Later cores use out of order execution, to further muddle things up. ;-)

Apr 20, 2021 8:21pm Rick Murray (539) 13840 posts	Class lecture slides. More to do with Intel, but the concepts are broadly the same. https://passlab.github.io/CSCE513/notes/lecture18_ILP_SuperscalarAdvancedARMIntel.pdf

Apr 20, 2021 9:48pm Sprow (202) 1158 posts	Indeed, the whole question of parallelizability is obviously a very big deal, but I have not seen any documents with explicit details. Can anybody quote chapters and verses? It turned out that even ARM’s own documents (for the Cortex-A8 at least) didn’t contain a full cycle description of how the various instructions get pipelined, especially when one pipeline gets stalled because it’s waiting for a result. In other words you can’t just write `XYZ Ra,Rb,Rc` in one column and `XYZ Rd,Re,Rf` in the other, because they may take different amounts of time to complete. Recent-ish versions of the DDE include an application that does the modelling for you (A8time), accepting arbitrary code in ObjAsm format as input, and outputting an annotated diagram of what happens when. Chapter 7 of the Desktop Tools manual explains what it all means. As the name suggests, A8time models a Cortex-A8. I suspect fiddling round with instruction and register orders is a law of diminishing returns, and that some reordering is better than none, but to not worry too much beyond that.

Apr 20, 2021 10:23pm GavinWraith (26) 1563 posts	Thanks for that. I had not really noticed A8time in the DDE. Very interesting. It certainly brings home the gulf between the early ARM CPUs and present-day ones.

Apr 21, 2021 7:49am David J. Ruck (33) 1635 posts	Given the wide range of ARM cores that RISC OS runs on, I would not worry about superscalar instruction ordering. Use the generic compiler options, and if you are still writing assembler by hand, aim for maintainability. Only if you are targetting one specific hardware platform considerer using the compiler tuning options for its CPU. Generally you are only going to have to worry about instruction level optimisation on the oldest slowest processors which have little or no superscalar capabilities (the awful XScale had so many additional delays compared to every other ARM, generic code or even StrongARM optimised code performed poorly). You don’t have to worry about it at all on the fastest cores now, as out-of-order execution allows the processor to re-order instructions in an optimal way for it’s internal architecture at run-time, which gives better results than doing it at compile time, even targetting a specific core. Unless your problem is unrealistically CPU bound pure computation, you are always going to get more gains in feeding the data to it in a way which avoids cache misses, than worrying about instruction ordering.

Apr 21, 2021 4:48pm Steve Drain (222) 1620 posts	This is for Clive really, but I have been playing around with some code to implement actual bit arrays, not really BASIC, but very nearly. I am not absolutely sure I have got the offset calculation right, but my head hurts, so I though I would offer it for view at:: http://www.kappa.me.uk/Miscellaneous/swBitArray002.zip Calling it looks like: CALLbitDim:a`(),1,2,3 PRINT DIM(a`(),1) CALLbitSet:a`(),0,1,0 PRINT USRbitGet:(a`(),0,1,0) CALLbitReset:a`(),0,1,0 PRINT USRbitGet:(a`(),0,1,0) As always, comments are welcome. ;-)

Apr 21, 2021 4:59pm Clive Semmens (2335) 3276 posts	I do believe you’re putting more effort into this than I am, Steve! For the moment I’m staying entirely in BASIC and living with the limited array size – I’ve got this horrid suspicion it’s all too slow however I do it if I try to do array sizes that won’t fit in my Pi at a byte per point. Unless your problem is unrealistically CPU bound pure computation, you are always going to get more gains in feeding the data to it in a way which avoids cache misses I don’t have much hope – no, I don’t have any hope – of avoiding cache misses in these operations, more’s the pity.

Apr 21, 2021 7:07pm David J. Ruck (33) 1635 posts	I don’t have much hope – no, I don’t have any hope – of avoiding cache misses in these operations, more’s the pity. But by reducing the data size from integers to bytes, you are reducing the cache misses by a factor 4. If you use bits, you reduce the cache misses by a factor of 32. When dealing with large amounts of data, the more efficient use of cache far outweigh the more complex code to extract the correct bits.

Apr 22, 2021 7:51am Clive Semmens (2335) 3276 posts	But by reducing the data size from integers to bytes, you are reducing the cache misses by a factor 4. If you use bits, you reduce the cache misses by a factor of 32. Except that it’s not so much reducing the cache misses as increasing the cache hits – from almost exactly zero to 4 or 32 times almost exactly zero. I’m doing random accesses into arrays as big as available memory allows. Cache hits are extremely improbable.

Apr 23, 2021 11:22am Steve Drain (222) 1620 posts	I do believe you’re putting more effort into this than I am, Steve! Quite possibly, but I am enjoying it. Your challenge tied in with something I was fiddling with and gave me a point of focus. The code has advanced and has nearly all the checks and errors. It looks like this: PROCbitAssemble CALLbit:DIM a`(1,2,3) PRINT DIM(a`(),1) CALLbit:a`(0,1,0)=1 PRINT USRbit:a`(0,1,0) CALLbit:a`(0,1,0)=0 PRINT USRbit:a`(0,1,0) CALLbit:a`(0,1,0)=2:REM toggle PRINT USRbit:a`(0,1,0) It is at: http://www.kappa.me.uk/Miscellaneous/swBitArray003.zip Byte arrays would only require small changes. ;-)

Apr 25, 2021 10:16am Steve Drain (222) 1620 posts	Byte arrays would only require small changes. Having fixed acouple of bugs I have added byte arrays for anyone still interested. ;-) http://www.kappa.me.uk/Miscellaneous/swBitArray004.zip