RISC OS Open: Forum: BASIC Assembler & Service calls

Aug 11, 2019 3:11pm

When writing a module in BASIC Assembler including a Service Call Handler, I tend to refer to the OS SWIs StrongHelp manual v3.39 OS → Module → Module Format → Service handler as a guide.

This page includes the statements It is recommended that you write service handling code so as to be cache-line aligned for cache optimisation, and that on reception of service calls that are not catered for by your module your code returns as soon as possible. This means that the entry point for the service call code should be at 16n-4 from the base of the module (modules are always loaded at offset 4).

The example code given includes the line
]:P%=(P%+15) AND NOT 15:O%=(O%+15) AND NOT 15:[ OPT pass%
which seems to be attempting to implement the alignment.

When offset assembly is used with an initial P%=0 and O%=DIMmed memory, the memory block is only word aligned. I believe that the difference between P% and O% should be constant, as the assembler increases them in step.

However, surely the above line will add differing amounts to P% and O%, unless the initial O% is 16-byte aligned?

This caused me significant and obscure code problems (aborts and lockups), until I realised what was happening.

Therefore I suspect that the example code given is dangerously wrong.
But it led me to wonder whether any alignment would depend on the processor, and to question whether the alignment advice is still valid?

I can find no reliable other reference to this.

Aug 11, 2019 3:42pm

Rick Murray (539) 13850 posts

Seems to be mentioned here – http://www.riscos.com/support/developers/strongarm/perf.htm
But note that the cache alignment thing refers to a potential future alignment of 32, not the current 16 (so even if that code worked, it’s wrong).

Personally, I’d not bother. I was actually unaware of this advice so I put my service call handler wherever it came up, and everything appears to be as it should be. Any lag is likely to be in the order of fractions of a nanosecond on modern cores so nothing to lose too much sleep over.

Hmm, does CMHG follow this advice, I wonder?

Aug 11, 2019 3:55pm

Martin Avison (27) 1494 posts

I have certainly stopped bothering. But I wanted to warn others, and see the consensus before I asked for the StrongHelp manual to be changed.

Aug 11, 2019 3:59pm

Frank de Bruijn (160) 228 posts

Never really bothered with it either, except in one module, where I used this:


] X%=P%AND15 : IFX% P%+=16-X% : O%+=16-X%
[OPT pass%

O% is just where the code is placed during assembly. P% is the actual value to check.

Aug 11, 2019 8:53pm

Dave Higton (1515) 3534 posts

The advice is many years out of date. It might have made a small amount of sense in the days of ARM2 and ARM3, but it doesn’t these days; the time saving would be imperceptible in any reasonable scenario.

Aug 12, 2019 9:31am

Rick Murray (539) 13850 posts

in the days of ARM2 and ARM3

ARM 2 doesn’t have a cache, and it’s written in a document about the StrongARM. Asides from that, yes, it’s many years out of date.

I suspect better overall savings might be possible by writing code to maximise the superscalar behaviour of modern cores rather than worry about exactly where the service call handler lies.

Aug 12, 2019 10:40am

nemo (145) 2554 posts

]:P%=(P%+15) AND NOT 15:O%=(O%+15) AND NOT 15:[ OPT pass%

Not only wrong, but unless you’d arranged to align P% with O% at the start, utterly useless. I’d do:

DEFFNass(L%,S%):REM L%=length, S%=start address
LOCALM%,O%,P%,Q%:DIMM%L%-1,L%-1
FORQ%=12TO14STEP2:O%=M%:P%=S%:[OPTQ%
...
           FNalign(16)
...
]:NEXT:=M%

DEFFNalign(S%):S%-=1:IF(S%+1)ANDS%:ERROR99,"Bad alignment"
IFQ%AND4:O%-=P%
P%=(P%+S%)ANDNOTS%:IFQ%AND4:O%+=P%
=""

If you wanted FNalign to be so portable it doesn’t even rely on the OPT variable Q%, you’d need to add:

LOCALQ%:Q%=?(&86E0+4*(!&86E0+0*RND-!&86E0=0))

Aug 12, 2019 10:16pm

Martin Avison (27) 1494 posts

<smiles evilly>

<shudders> no No No NO NO

I have arranged that the next version of the OS StrongHelp manual will not have these references to cache alignment. (thanks to Sprow).

Aug 13, 2019 12:56am

nemo (145) 2554 posts

no No No NO NO

If people didn’t keep moving stuff around it wouldn’t be necessary to be quite that flaming clever.

Oh it’s a pity OPT isn’t a pseudo variable.

Aug 13, 2019 3:58pm

Rick Murray (539) 13850 posts

I have certainly stopped bothering.

Just had a look at various compiled modules. The compiler doesn’t bother to align the service call code to 16, 32 or anything specific.

Aug 16, 2019 6:12am

Jon Abbott (1421) 2651 posts

note that the cache alignment thing refers to a potential future alignment of 32, not the current 16 (so even if that code worked, it’s wrong)

So long as code is aligned to a 32 byte boundary, it will be cache aligned for all current ARM CPU’s.

Just had a look at various compiled modules. The compiler doesn’t bother to align the service call code to 16, 32 or anything specific.

Even if it did align entries, RISCOS no longer loads Modules at a predictable cache alignment offset, so none of the header entry point can be cache aligned.

Aug 16, 2019 11:07am

Rick Murray (539) 13850 posts

So long as code is aligned to a 32 byte boundary, it will be cache aligned for all current ARM CPU’s.

Now, which would be better to align, should one bother to do so? The ServiceCall handler code, or the preceding ServiceCall table? ;-)

Aug 16, 2019 1:20pm

nemo (145) 2554 posts

You know there’s two separate caches, yes?

Aug 16, 2019 4:29pm

Rick Murray (539) 13850 posts

You know there’s two separate caches, yes?

Since forever. But one can’t normally engineer both to be on a 32 byte boundary. :-)

Aug 17, 2019 6:08pm

nemo (145) 2554 posts

Some confusion. The Service entry is (if signposted by a NOP) preceded by a pointer to the Service Table – so both the entry and the table can be aligned but, the table is only accessed once anyway, so alignment is irrelevant. It is used to add the module to a list for each* service call it supports.

*actually they’re shared, but that’s an implementation detail.

Aug 18, 2019 7:02am

Jon Abbott (1421) 2651 posts

if signposted by a NOP

MOV R0, R0

NOP could encode to something different, depending on the platform.

On the main point, I would say that aligning anything that isn’t entered millions of times a second is a not required, the lengths required to align Module code are not pretty.

I’d like to see a Module flag bit that forces the OS to load the Module cache aligned.

Aug 19, 2019 12:18pm

nemo (145) 2554 posts

NOP could encode to something different, depending on the platform.

ORLY? ;-)

I’d like to see a Module flag bit that forces the OS to load the Module cache aligned.

Unless it’s been changed very recently, this is nonsense. Modules are loaded 16-byte aligned as they have always been.

Anyone changing that doesn’t know what they’re doing. It’s part of the API contract and isn’t necessarily anything to do with cache optimisation in an individual module, which may be relying on the bottom bits of certain addresses being zero for algorithmic reasons.

<checks> Well it was still aligned as of 5.24 – and now to a 32byte boundary (which I presume is an RO5 innovation).

Aug 20, 2019 5:49am

Jon Abbott (1421) 2651 posts

Modules are loaded 16-byte aligned as they have always been.

Not quite correct, the PRM states Modules load at XXXXXXX4. Back in the day, that meant cache aligned+4 – which was Acorn’s intention as the PRM specifically mentions using the fact to aligning IRQ entry points.

and now to a 32byte boundary

If that’s the case, Module load alignment has changed recently as when I last checked, Modules did not load at a known cache alignment.

There was a conversation around 5 years ago about the pro’s/con’s for changing the Module load alignment, along with the potential knock on effect to variations of LDR/STR. I don’t recall the outcome though.

It raised its head again when it was observed a particular piece of code ran 50% quicker when run at a particular cache alignment offset. I don’t recall the CPU this was noted on, may have been a Pi3. Either way, it appears some modern CPU are particularly cache alignment sensitive in some scenarios.

Aug 20, 2019 10:22am

Jeffrey Lee (213) 6048 posts

There was a conversation around 5 years ago about the pro’s/con’s for changing the Module load alignment, along with the potential knock on effect to variations of LDR/STR. I don’t recall the outcome though.

Not sure about the thread from 5 years ago – but there was one from 7 months ago

https://www.riscosopen.org/forum/forums/11/topics/13957

TL;DR is that OS_Module alignment is poorly specified, poorly implemented (e.g. using OS_Heap to allocate from the RMA will break the alignment of a future OS_Module call), and awkward for modern uses.

If nemo’s assertion that there are modules which store flags in the low bits of addresses is true, then perhaps the safest solution would be to add some new OS_Module reason codes for properly aligned memory allocation, and retain &xxxxxxx4 alignment for the existing reason codes (+ fix it to actually guarantee the alignment instead of relying on the alignment of the previous block).

Aug 20, 2019 6:34pm

Rick Murray (539) 13850 posts

which store flags in the low bits of addresses is true,

On some cores – wouldn’t that either branch into thumb mode, or throw an exception due to a bogus address?
Case in point, ZapObey, which set the offset address of the init code, but forget to ALIGN beforehand, so the code (auto aligned) was at, say offset +124, but the address held was following the string, so would have been offset +122. Seemingly worked fine up until the Pi2 (ARMv7) where a non aligned address was faulted.

perhaps the safest solution would be to

…find any code that stores flags in addresses, take it out back, and shoot it.
Twice.
Head and heart. Double tap.
Make sure it’s dead.

Aug 20, 2019 9:09pm

Rick Murray (539) 13850 posts

From five years ago – https://www.riscosopen.org/forum/forums/11/topics/2982

Aug 22, 2019 6:05pm

nemo (145) 2554 posts

Jon observed

Not quite correct, the PRM states Modules load at XXXXXXX4

Quite so. The block is at ..0, but then there’s the length word.

I had misremembered the “32 byte alignment” – it’s still 16 byte aligned, but the length is now rounded to 32 bytes

If nemo’s assertion that there are modules which store flags in the low bits of addresses is true

I didn’t assert there are, but no one can assert there are not.

fix it to actually guarantee the alignment instead of relying on the alignment of the previous block

Absolutely. Which is an interesting exercise in itself. Probably OS_Heap will have to enforce a suitable granularity for the RMA, as trying to adapt heap blocks after allocation is tricky, what with the implicit guarantee of what an RMA module address is.

BASIC Assembler & Service calls

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Aug 11, 2019 3:11pm Martin Avison (27) 1494 posts	When writing a module in BASIC Assembler including a Service Call Handler, I tend to refer to the OS SWIs StrongHelp manual v3.39 OS → Module → Module Format → Service handler as a guide. This page includes the statements It is recommended that you write service handling code so as to be cache-line aligned for cache optimisation, and that on reception of service calls that are not catered for by your module your code returns as soon as possible. This means that the entry point for the service call code should be at 16n-4 from the base of the module (modules are always loaded at offset 4). The example code given includes the line `]:P%=(P%+15) AND NOT 15:O%=(O%+15) AND NOT 15:[ OPT pass%` which seems to be attempting to implement the alignment. When offset assembly is used with an initial P%=0 and O%=DIMmed memory, the memory block is only word aligned. I believe that the difference between P% and O% should be constant, as the assembler increases them in step. However, surely the above line will add differing amounts to P% and O%, unless the initial O% is 16-byte aligned? This caused me significant and obscure code problems (aborts and lockups), until I realised what was happening. Therefore I suspect that the example code given is dangerously wrong. But it led me to wonder whether any alignment would depend on the processor, and to question whether the alignment advice is still valid? I can find no reliable other reference to this.

Aug 11, 2019 3:42pm Rick Murray (539) 13850 posts	Seems to be mentioned here – http://www.riscos.com/support/developers/strongarm/perf.htm But note that the cache alignment thing refers to a potential future alignment of 32, not the current 16 (so even if that code worked, it’s wrong). Personally, I’d not bother. I was actually unaware of this advice so I put my service call handler wherever it came up, and everything appears to be as it should be. Any lag is likely to be in the order of fractions of a nanosecond on modern cores so nothing to lose too much sleep over. Hmm, does CMHG follow this advice, I wonder?

Aug 11, 2019 3:55pm Martin Avison (27) 1494 posts	I have certainly stopped bothering. But I wanted to warn others, and see the consensus before I asked for the StrongHelp manual to be changed.

Aug 11, 2019 3:59pm Frank de Bruijn (160) 228 posts	Never really bothered with it either, except in one module, where I used this: `] X%=P%AND15 : IFX% P%+=16-X% : O%+=16-X% [OPT pass%` O% is just where the code is placed during assembly. P% is the actual value to check.

Aug 11, 2019 8:53pm Dave Higton (1515) 3534 posts	The advice is many years out of date. It might have made a small amount of sense in the days of ARM2 and ARM3, but it doesn’t these days; the time saving would be imperceptible in any reasonable scenario.

Aug 12, 2019 9:31am Rick Murray (539) 13850 posts	in the days of ARM2 and ARM3 ARM 2 doesn’t have a cache, and it’s written in a document about the StrongARM. Asides from that, yes, it’s many years out of date. I suspect better overall savings might be possible by writing code to maximise the superscalar behaviour of modern cores rather than worry about exactly where the service call handler lies.

Aug 12, 2019 10:40am nemo (145) 2554 posts	]:P%=(P%+15) AND NOT 15:O%=(O%+15) AND NOT 15:[ OPT pass% Not only wrong, but unless you’d arranged to align P% with O% at the start, utterly useless. I’d do: DEFFNass(L%,S%):REM L%=length, S%=start address LOCALM%,O%,P%,Q%:DIMM%L%-1,L%-1 FORQ%=12TO14STEP2:O%=M%:P%=S%:[OPTQ% ... FNalign(16) ... ]:NEXT:=M% DEFFNalign(S%):S%-=1:IF(S%+1)ANDS%:ERROR99,"Bad alignment" IFQ%AND4:O%-=P% P%=(P%+S%)ANDNOTS%:IFQ%AND4:O%+=P% ="" If you wanted `FNalign` to be so portable it doesn’t even rely on the OPT variable Q%, you’d need to add: LOCALQ%:Q%=?(&86E0+4(!&86E0+0RND-!&86E0=0)) <smiles evilly>

Aug 12, 2019 10:16pm Martin Avison (27) 1494 posts	`<smiles evilly>` `<shudders>` no No No NO NO I have arranged that the next version of the OS StrongHelp manual will not have these references to cache alignment. (thanks to Sprow).

Aug 13, 2019 12:56am nemo (145) 2554 posts	no No No NO NO If people didn’t keep moving stuff around it wouldn’t be necessary to be quite that flaming clever. Oh it’s a pity `OPT` isn’t a pseudo variable.

Aug 13, 2019 3:58pm Rick Murray (539) 13850 posts	I have certainly stopped bothering. Just had a look at various compiled modules. The compiler doesn’t bother to align the service call code to 16, 32 or anything specific.

Aug 16, 2019 6:12am Jon Abbott (1421) 2651 posts	note that the cache alignment thing refers to a potential future alignment of 32, not the current 16 (so even if that code worked, it’s wrong) So long as code is aligned to a 32 byte boundary, it will be cache aligned for all current ARM CPU’s. Just had a look at various compiled modules. The compiler doesn’t bother to align the service call code to 16, 32 or anything specific. Even if it did align entries, RISCOS no longer loads Modules at a predictable cache alignment offset, so none of the header entry point can be cache aligned.

Aug 16, 2019 11:07am Rick Murray (539) 13850 posts	So long as code is aligned to a 32 byte boundary, it will be cache aligned for all current ARM CPU’s. Now, which would be better to align, should one bother to do so? The ServiceCall handler code, or the preceding ServiceCall table? ;-)

Aug 16, 2019 1:20pm nemo (145) 2554 posts	You know there’s two separate caches, yes?

Aug 16, 2019 4:29pm Rick Murray (539) 13850 posts	You know there’s two separate caches, yes? Since forever. But one can’t normally engineer both to be on a 32 byte boundary. :-)

Aug 17, 2019 6:08pm nemo (145) 2554 posts	Some confusion. The Service entry is (if signposted by a NOP) preceded by a pointer to the Service Table – so both the entry and the table can be aligned but, the table is only accessed once anyway, so alignment is irrelevant. It is used to add the module to a list for each* service call it supports. *actually they’re shared, but that’s an implementation detail.

Aug 18, 2019 7:02am Jon Abbott (1421) 2651 posts	if signposted by a NOP MOV R0, R0 NOP could encode to something different, depending on the platform. On the main point, I would say that aligning anything that isn’t entered millions of times a second is a not required, the lengths required to align Module code are not pretty. I’d like to see a Module flag bit that forces the OS to load the Module cache aligned.

Aug 19, 2019 12:18pm nemo (145) 2554 posts	NOP could encode to something different, depending on the platform. ORLY? ;-) I’d like to see a Module flag bit that forces the OS to load the Module cache aligned. Unless it’s been changed very recently, this is nonsense. Modules are loaded 16-byte aligned as they have always been. Anyone changing that doesn’t know what they’re doing. It’s part of the API contract and isn’t necessarily anything to do with cache optimisation in an individual module, which may be relying on the bottom bits of certain addresses being zero for algorithmic reasons. <checks> Well it was still aligned as of 5.24 – and now to a 32byte boundary (which I presume is an RO5 innovation).

Aug 20, 2019 5:49am Jon Abbott (1421) 2651 posts	Modules are loaded 16-byte aligned as they have always been. Not quite correct, the PRM states Modules load at XXXXXXX4. Back in the day, that meant cache aligned+4 – which was Acorn’s intention as the PRM specifically mentions using the fact to aligning IRQ entry points. and now to a 32byte boundary If that’s the case, Module load alignment has changed recently as when I last checked, Modules did not load at a known cache alignment. There was a conversation around 5 years ago about the pro’s/con’s for changing the Module load alignment, along with the potential knock on effect to variations of LDR/STR. I don’t recall the outcome though. It raised its head again when it was observed a particular piece of code ran 50% quicker when run at a particular cache alignment offset. I don’t recall the CPU this was noted on, may have been a Pi3. Either way, it appears some modern CPU are particularly cache alignment sensitive in some scenarios.

Aug 20, 2019 10:22am Jeffrey Lee (213) 6048 posts	There was a conversation around 5 years ago about the pro’s/con’s for changing the Module load alignment, along with the potential knock on effect to variations of LDR/STR. I don’t recall the outcome though. Not sure about the thread from 5 years ago – but there was one from 7 months ago https://www.riscosopen.org/forum/forums/11/topics/13957 TL;DR is that OS_Module alignment is poorly specified, poorly implemented (e.g. using OS_Heap to allocate from the RMA will break the alignment of a future OS_Module call), and awkward for modern uses. If nemo’s assertion that there are modules which store flags in the low bits of addresses is true, then perhaps the safest solution would be to add some new OS_Module reason codes for properly aligned memory allocation, and retain &xxxxxxx4 alignment for the existing reason codes (+ fix it to actually guarantee the alignment instead of relying on the alignment of the previous block).

Aug 20, 2019 6:34pm Rick Murray (539) 13850 posts	which store flags in the low bits of addresses is true, On some cores – wouldn’t that either branch into thumb mode, or throw an exception due to a bogus address? Case in point, ZapObey, which set the offset address of the init code, but forget to ALIGN beforehand, so the code (auto aligned) was at, say offset +124, but the address held was following the string, so would have been offset +122. Seemingly worked fine up until the Pi2 (ARMv7) where a non aligned address was faulted. perhaps the safest solution would be to …find any code that stores flags in addresses, take it out back, and shoot it. Twice. Head and heart. Double tap. Make sure it’s dead.

Aug 20, 2019 9:09pm Rick Murray (539) 13850 posts	From five years ago – https://www.riscosopen.org/forum/forums/11/topics/2982

Aug 22, 2019 6:05pm nemo (145) 2554 posts	Jon observed Not quite correct, the PRM states Modules load at XXXXXXX4 Quite so. The block is at ..0, but then there’s the length word. I had misremembered the “32 byte alignment” – it’s still 16 byte aligned, but the length is now rounded to 32 bytes If nemo’s assertion that there are modules which store flags in the low bits of addresses is true I didn’t assert there are, but no one can assert there are not. fix it to actually guarantee the alignment instead of relying on the alignment of the previous block Absolutely. Which is an interesting exercise in itself. Probably OS_Heap will have to enforce a suitable granularity for the RMA, as trying to adapt heap blocks after allocation is tricky, what with the implicit guarantee of what an RMA module address is.