BASIC Assembler & Service calls
Martin Avison (27) 1494 posts |
When writing a module in BASIC Assembler including a Service Call Handler, I tend to refer to the OS SWIs StrongHelp manual v3.39 OS → Module → Module Format → Service handler as a guide. This page includes the statements It is recommended that you write service handling code so as to be cache-line aligned for cache optimisation, and that on reception of service calls that are not catered for by your module your code returns as soon as possible. This means that the entry point for the service call code should be at 16n-4 from the base of the module (modules are always loaded at offset 4). The example code given includes the line When offset assembly is used with an initial P%=0 and O%=DIMmed memory, the memory block is only word aligned. I believe that the difference between P% and O% should be constant, as the assembler increases them in step. However, surely the above line will add differing amounts to P% and O%, unless the initial O% is 16-byte aligned? This caused me significant and obscure code problems (aborts and lockups), until I realised what was happening. Therefore I suspect that the example code given is dangerously wrong. I can find no reliable other reference to this. |
Rick Murray (539) 13850 posts |
Seems to be mentioned here – http://www.riscos.com/support/developers/strongarm/perf.htm Personally, I’d not bother. I was actually unaware of this advice so I put my service call handler wherever it came up, and everything appears to be as it should be. Any lag is likely to be in the order of fractions of a nanosecond on modern cores so nothing to lose too much sleep over. Hmm, does CMHG follow this advice, I wonder? |
Martin Avison (27) 1494 posts |
I have certainly stopped bothering. But I wanted to warn others, and see the consensus before I asked for the StrongHelp manual to be changed. |
Frank de Bruijn (160) 228 posts |
Never really bothered with it either, except in one module, where I used this:
O% is just where the code is placed during assembly. P% is the actual value to check. |
Dave Higton (1515) 3534 posts |
The advice is many years out of date. It might have made a small amount of sense in the days of ARM2 and ARM3, but it doesn’t these days; the time saving would be imperceptible in any reasonable scenario. |
Rick Murray (539) 13850 posts |
ARM 2 doesn’t have a cache, and it’s written in a document about the StrongARM. Asides from that, yes, it’s many years out of date. I suspect better overall savings might be possible by writing code to maximise the superscalar behaviour of modern cores rather than worry about exactly where the service call handler lies. |
nemo (145) 2554 posts |
]:P%=(P%+15) AND NOT 15:O%=(O%+15) AND NOT 15:[ OPT pass% Not only wrong, but unless you’d arranged to align P% with O% at the start, utterly useless. I’d do: DEFFNass(L%,S%):REM L%=length, S%=start address LOCALM%,O%,P%,Q%:DIMM%L%-1,L%-1 FORQ%=12TO14STEP2:O%=M%:P%=S%:[OPTQ% ... FNalign(16) ... ]:NEXT:=M% DEFFNalign(S%):S%-=1:IF(S%+1)ANDS%:ERROR99,"Bad alignment" IFQ%AND4:O%-=P% P%=(P%+S%)ANDNOTS%:IFQ%AND4:O%+=P% ="" If you wanted LOCALQ%:Q%=?(&86E0+4*(!&86E0+0*RND-!&86E0=0)) <smiles evilly> |
Martin Avison (27) 1494 posts |
I have arranged that the next version of the OS StrongHelp manual will not have these references to cache alignment. (thanks to Sprow). |
nemo (145) 2554 posts |
If people didn’t keep moving stuff around it wouldn’t be necessary to be quite that flaming clever. Oh it’s a pity |
Rick Murray (539) 13850 posts |
Just had a look at various compiled modules. The compiler doesn’t bother to align the service call code to 16, 32 or anything specific. |
Jon Abbott (1421) 2651 posts |
So long as code is aligned to a 32 byte boundary, it will be cache aligned for all current ARM CPU’s.
Even if it did align entries, RISCOS no longer loads Modules at a predictable cache alignment offset, so none of the header entry point can be cache aligned. |
Rick Murray (539) 13850 posts |
Now, which would be better to align, should one bother to do so? The ServiceCall handler code, or the preceding ServiceCall table? ;-) |
nemo (145) 2554 posts |
You know there’s two separate caches, yes? |
Rick Murray (539) 13850 posts |
Since forever. But one can’t normally engineer both to be on a 32 byte boundary. :-) |
nemo (145) 2554 posts |
Some confusion. The Service entry is (if signposted by a NOP) preceded by a pointer to the Service Table – so both the entry and the table can be aligned but, the table is only accessed once anyway, so alignment is irrelevant. It is used to add the module to a list for each* service call it supports. *actually they’re shared, but that’s an implementation detail. |
Jon Abbott (1421) 2651 posts |
MOV R0, R0 NOP could encode to something different, depending on the platform. On the main point, I would say that aligning anything that isn’t entered millions of times a second is a not required, the lengths required to align Module code are not pretty. I’d like to see a Module flag bit that forces the OS to load the Module cache aligned. |
nemo (145) 2554 posts |
ORLY? ;-)
Unless it’s been changed very recently, this is nonsense. Modules are loaded 16-byte aligned as they have always been. Anyone changing that doesn’t know what they’re doing. It’s part of the API contract and isn’t necessarily anything to do with cache optimisation in an individual module, which may be relying on the bottom bits of certain addresses being zero for algorithmic reasons. <checks> Well it was still aligned as of 5.24 – and now to a 32byte boundary (which I presume is an RO5 innovation). |
Jon Abbott (1421) 2651 posts |
Not quite correct, the PRM states Modules load at XXXXXXX4. Back in the day, that meant cache aligned+4 – which was Acorn’s intention as the PRM specifically mentions using the fact to aligning IRQ entry points.
If that’s the case, Module load alignment has changed recently as when I last checked, Modules did not load at a known cache alignment. There was a conversation around 5 years ago about the pro’s/con’s for changing the Module load alignment, along with the potential knock on effect to variations of LDR/STR. I don’t recall the outcome though. It raised its head again when it was observed a particular piece of code ran 50% quicker when run at a particular cache alignment offset. I don’t recall the CPU this was noted on, may have been a Pi3. Either way, it appears some modern CPU are particularly cache alignment sensitive in some scenarios. |
Jeffrey Lee (213) 6048 posts |
Not sure about the thread from 5 years ago – but there was one from 7 months ago https://www.riscosopen.org/forum/forums/11/topics/13957 TL;DR is that OS_Module alignment is poorly specified, poorly implemented (e.g. using OS_Heap to allocate from the RMA will break the alignment of a future OS_Module call), and awkward for modern uses. If nemo’s assertion that there are modules which store flags in the low bits of addresses is true, then perhaps the safest solution would be to add some new OS_Module reason codes for properly aligned memory allocation, and retain &xxxxxxx4 alignment for the existing reason codes (+ fix it to actually guarantee the alignment instead of relying on the alignment of the previous block). |
Rick Murray (539) 13850 posts |
On some cores – wouldn’t that either branch into thumb mode, or throw an exception due to a bogus address?
…find any code that stores flags in addresses, take it out back, and shoot it. |
Rick Murray (539) 13850 posts |
From five years ago – https://www.riscosopen.org/forum/forums/11/topics/2982 |
nemo (145) 2554 posts |
Jon observed
Quite so. The block is at ..0, but then there’s the length word. I had misremembered the “32 byte alignment” – it’s still 16 byte aligned, but the length is now rounded to 32 bytes
I didn’t assert there are, but no one can assert there are not.
Absolutely. Which is an interesting exercise in itself. Probably OS_Heap will have to enforce a suitable granularity for the RMA, as trying to adapt heap blocks after allocation is tricky, what with the implicit guarantee of what an RMA module address is. |