RISC OS Open: Forum: Norcroft 64

Dec 17, 2023 8:22pm

Colin Ferris (399) 1814 posts

Looks like the Norcroft C compiler can output ARM 64bit code :-)

Dec 17, 2023 8:31pm

Rick Murray (539) 13840 posts

Brilliant! Now we just need to tweak the MakeFiles and RISC OS will be 64 bit…

…oh, wait.

Dec 17, 2023 10:34pm

Stuart Swales (8827) 1357 posts

“This was once revealed to me in a dream”

Dec 17, 2023 11:08pm

Cameron Cawley (3514) 158 posts

I’m guessing that the main focus is to allow AArch64 applications to run within AArch32 RISC OS on existing platforms like the Pi 4. That feels like a more achievable goal (especially if we stick to ILP32 so that we don’t need as many API changes), while still laying foundations for Cortex-A76 support later on.

That said, I don’t have any insight into ROOL’s plans (and the Icon Bar article is a bit light on details), but it’s good to see that progress is being made.

Dec 17, 2023 11:13pm

Simon Willcocks (1499) 513 posts

That’s not how it works; a 32-bit core can’t run 64-bit code. Updating Norcroft seems like a waste of effort, to me, given gcc.

Dec 17, 2023 11:17pm

Steffen Huber (91) 1953 posts

Looks like the Norcroft C compiler can output ARM 64bit code :-)

Probably the last mover amongst the currently available C compilers to add a Aarch64 backend.

Dec 17, 2023 11:22pm

Michael Grunditz (8594) 259 posts

@Simon

But a 64-bit core can.. just send off your norcroft compiled 64-bit program to a 64-bit running core. Perfectly doable,, but not very useful as it is now. Requires a bit of infrastructure.

Dec 17, 2023 11:22pm

Cameron Cawley (3514) 158 posts

That’s not how it works; a 32-bit core can’t run 64-bit code.

Fair enough. Could you elaborate a bit on how interactions between 32-bit and 64-bit mode works?

Dec 17, 2023 11:41pm

Simon Willcocks (1499) 513 posts

OK, so basically (it’s been a few years since I played with it), 64-bit EL2 can manage 32-bit code by giving it a memory space that looks to the 32-bit code like physical memory. The 32-bit code can manage that memory as if it were real. Interrupts, aborts, etc. come up to the 64-bit code that can pass them back down to 32-bit code.

What can potentially be done is have 64-bit code request services (like Wimp_Poll) from 32-bit code. At the moment, I’m hoping that we can transition to processor agnostic communication mechanisms (pipes and task queues) that allow for multiple cores co-operating with 32 and 64 bit code.

Dec 17, 2023 11:48pm

Paolo Fabio Zaino (28) 1882 posts

Fair enough. Could you elaborate a bit on how interactions between 32-bit and 64-bit mode works?

AFAIR, a core can only change state at reset (at least on ARMv8 and v9). However, for the famous AArch64 in kernel mode and AArch32 in user mode, things have a different rule:

“an AArch64 entity can host AArch32 children, but not the other way round”

This means that an AArch64 hypevisor can host an AArch32 VM, but an AArch32 hypervisor cannot host an AArch64 VM. Note the terms hypervisor and VMs (so EL2 and EL1/EL0). AFAIU, this is how RISC OS on Linux should make it work on the Pi 5 to run RISC OS in user-mode AArch32, while Linux runs in privileged mode AArch64.

However, as mentioned by Mciahel, you can send the AArch64 code to a core that IS in AArch64 mode (set in such a mode at boot or reset). However, as also pointed out by Michael, that code will probably need a kernel or something too (to do something useful). So, what could possibly work in such a situation is a similar approach as it was used on Hydra, where minimal kernels where used on the extra CPUs. But that would only allow execution of things like math processing etc. So, still limited.

HTH

[edit]
Note that the switching is possible only on CPUs that have both AArch32 user-mode and AArch64 privileged mode through an hypervisor. CPUs that only have AArch32 and AArch64 without EL2, can only be set in one or the other at boot or reset time AFAIR. I am sure Jeffrey or Sprow will know more details!
[/edit]

Dec 17, 2023 11:54pm

Michael Grunditz (8594) 259 posts

You can PCIRamAlloc memory and use some of it as a workspace shared with riscos and the rest with 64bit code. That scenario could easily work for a game. allthough with this simple setup game needs to poll input from its workspace.

Dec 18, 2023 6:02am

Rick Murray (539) 13840 posts

but it’s good to see that progress is being made.

<shrug>

Every RISC OS machine that isn’t a dinosaur has three 32 bit cores doing exactly nothing, the compiler screws everybody with the emulated FP rather than the onboard FP that’s been around on RISC OS hardware for a decade and a half, no SIMD without resorting to assembler…

Dec 18, 2023 11:01am

Colin Ferris (399) 1814 posts

Didn’t Stuart (the one who spills tea when I mention 512Mb for RPC)

Do a hardware FPU Library for C users?

Dec 18, 2023 11:43am

Stuart Swales (8827) 1357 posts

Didn’t Stuart do a hardware FPU Library for C users?

There are quite useful gains to be had by using said library with Norcroft, but that’s only a halfway house. A compiler with true VFP support would knock it out of the park for properly FP intensive work. Some authors might be willing to support two builds of applications – one just for ARMv7 w/VFP and one for ARMv3. I probably would fall into that camp.

Dec 18, 2023 12:00pm

Rick Murray (539) 13840 posts

As Stuart says, proper VFP support would be the way forward.
His library is useful, certainly, but make no mistake – it exists entirely because the compiler can’t do this stuff for itself.

Dec 18, 2023 12:20pm

Colin Ferris (399) 1814 posts

I don’t know how Stuart has worked it – but some C progs load like a module at startup – I wonder if that system could be used here.
Like a BASIC lib one for software FP or another lib for Hardware.

Dec 18, 2023 12:30pm

Stuart Swales (8827) 1357 posts

I don’t know how Stuart has worked it …

I haven’t gone to that extreme. If the system has VFP, then it will be used, otherwise it falls back to using the equivalent FPA instructions in most cases.

[See https://www.riscosopen.org/forum/forums/2/topics/3457?page=12#posts-146247]

Dec 18, 2023 2:15pm

Rick Murray (539) 13840 posts

falls back to using the equivalent FPA

Ideally Norcroft would have the following options:

FPA
VFP
VFP with FPA fallback (probably the best option for most of RISC OS)

Question is, how to you handle the data? Annoyingly the byte order of the words is back to front (VFP vs FPA).

Dec 18, 2023 2:25pm

Stuart Swales (8827) 1357 posts

Annoyingly the byte order of the words is back to front (VFP vs FPA).

It’s the word order that’s different in VFP (and in most other IEEE754 double implementations)

VFP with FPA fallback

This would produce bonkers code if done natively by the compiler; you’d really need an implementation like mine.

Dec 18, 2023 5:39pm

Rick Murray (539) 13840 posts

It’s the word order that’s different in VFP

Hmm, I knew something was the other way around…

This would produce bonkers code if done natively

Not necessarily.

FPA – use the traditional FP ops
VFP – use VFP ops
Choose at runtime – don’t use any ops, call library routines and let the library work out what to do.

The choose option will always be slower (at least a branch ¹ followed by a load ² followed by a compare ³ followed by the op) but it’ll be far quicker than assuming FPA and it’ll work on old and new systems alike by just “doing the useful thing”.

Yes, it would be lead to crazy code if the compiler tried to sort it out for itself every time. So it shouldn’t.

In essence, it’s like using your library, but doing so while writing normal C code…

¹ To the library.

² The “which FP to use” variable.

³ To choose what to do, one would expect to then branch to FPA and fall through to VFP for speed.

Dec 20, 2023 12:23pm

Graeme (8815) 106 posts

Annoyingly the byte order of the words is back to front

You can use two VLDR/VSTR instructions with 32-bit (S) registers to store in the reverse order because D0-D15 are the same registers as S0-S31.

Dec 20, 2023 12:33pm

Stuart Swales (8827) 1357 posts

You can use two VLDR/VSTR instructions with 32-bit (S) registers to store in the reverse order because D0-D15 are the same registers as S0-S31.

Helpfully in the -apcs /softfp world (as indeed with FPA since a long time) floating function parameters (and return value) are passed in ARM registers, so all you do is twizzle during double loading/saving.

        VMOV        d0, r1, r0 ; x in {r0,r1}, FPA ordered
        VMOV        d1, r3, r2 ; y in {r2,r3}, FPA ordered
...
        VMOV        r1, r0, d0 ; result in {r0,r1}, FPA ordered

Dec 20, 2023 7:17pm

David J. Ruck (33) 1635 posts

Obviously you want to use the default word ordering for which ever FP system you are using at runtime for the best performance, however you do have a problem if the program either stores doubles as part of a file format or sends them over the network in binary. The saver/sender and loader/receiver could be using different word ordering, which will make life rather interesting as there are only a few values out of the 64 bit space which are not valid when the ordering is wrong, although it will be very obvious to the user.

Dec 20, 2023 7:31pm

Rick Murray (539) 13840 posts

Is it possible to get VFP to swap word order on loading FP values from memory ¹, or must it be done via ARM registers?

¹ Sounds like the sort of weird things that NEON can do.

Dec 21, 2023 12:41am

Martin Philips (9013) 48 posts

Is there an ARM64 assembler, that uses the same syntax as the DDE assembler
- or can the DDE assembler be made to output ARM64 instructions?

Norcroft 64

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Dec 17, 2023 8:22pm Colin Ferris (399) 1814 posts	Looks like the Norcroft C compiler can output ARM 64bit code :-)

Dec 17, 2023 8:31pm Rick Murray (539) 13840 posts	Brilliant! Now we just need to tweak the MakeFiles and RISC OS will be 64 bit… …oh, wait.

Dec 17, 2023 10:34pm Stuart Swales (8827) 1357 posts	“This was once revealed to me in a dream”

Dec 17, 2023 11:08pm Cameron Cawley (3514) 158 posts	I’m guessing that the main focus is to allow AArch64 applications to run within AArch32 RISC OS on existing platforms like the Pi 4. That feels like a more achievable goal (especially if we stick to ILP32 so that we don’t need as many API changes), while still laying foundations for Cortex-A76 support later on. That said, I don’t have any insight into ROOL’s plans (and the Icon Bar article is a bit light on details), but it’s good to see that progress is being made.

Dec 17, 2023 11:13pm Simon Willcocks (1499) 513 posts	That’s not how it works; a 32-bit core can’t run 64-bit code. Updating Norcroft seems like a waste of effort, to me, given gcc.

Dec 17, 2023 11:17pm Steffen Huber (91) 1953 posts	Looks like the Norcroft C compiler can output ARM 64bit code :-) Probably the last mover amongst the currently available C compilers to add a Aarch64 backend.

Dec 17, 2023 11:22pm Michael Grunditz (8594) 259 posts	@Simon But a 64-bit core can.. just send off your norcroft compiled 64-bit program to a 64-bit running core. Perfectly doable,, but not very useful as it is now. Requires a bit of infrastructure.

Dec 17, 2023 11:22pm Cameron Cawley (3514) 158 posts	That’s not how it works; a 32-bit core can’t run 64-bit code. Fair enough. Could you elaborate a bit on how interactions between 32-bit and 64-bit mode works?

Dec 17, 2023 11:41pm Simon Willcocks (1499) 513 posts	OK, so basically (it’s been a few years since I played with it), 64-bit EL2 can manage 32-bit code by giving it a memory space that looks to the 32-bit code like physical memory. The 32-bit code can manage that memory as if it were real. Interrupts, aborts, etc. come up to the 64-bit code that can pass them back down to 32-bit code. What can potentially be done is have 64-bit code request services (like Wimp_Poll) from 32-bit code. At the moment, I’m hoping that we can transition to processor agnostic communication mechanisms (pipes and task queues) that allow for multiple cores co-operating with 32 and 64 bit code.

Dec 17, 2023 11:48pm Paolo Fabio Zaino (28) 1882 posts	Fair enough. Could you elaborate a bit on how interactions between 32-bit and 64-bit mode works? AFAIR, a core can only change state at reset (at least on ARMv8 and v9). However, for the famous AArch64 in kernel mode and AArch32 in user mode, things have a different rule: “an AArch64 entity can host AArch32 children, but not the other way round” This means that an AArch64 hypevisor can host an AArch32 VM, but an AArch32 hypervisor cannot host an AArch64 VM. Note the terms hypervisor and VMs (so EL2 and EL1/EL0). AFAIU, this is how RISC OS on Linux should make it work on the Pi 5 to run RISC OS in user-mode AArch32, while Linux runs in privileged mode AArch64. However, as mentioned by Mciahel, you can send the AArch64 code to a core that IS in AArch64 mode (set in such a mode at boot or reset). However, as also pointed out by Michael, that code will probably need a kernel or something too (to do something useful). So, what could possibly work in such a situation is a similar approach as it was used on Hydra, where minimal kernels where used on the extra CPUs. But that would only allow execution of things like math processing etc. So, still limited. HTH [edit] Note that the switching is possible only on CPUs that have both AArch32 user-mode and AArch64 privileged mode through an hypervisor. CPUs that only have AArch32 and AArch64 without EL2, can only be set in one or the other at boot or reset time AFAIR. I am sure Jeffrey or Sprow will know more details! [/edit]

Dec 17, 2023 11:54pm Michael Grunditz (8594) 259 posts	You can PCIRamAlloc memory and use some of it as a workspace shared with riscos and the rest with 64bit code. That scenario could easily work for a game. allthough with this simple setup game needs to poll input from its workspace.

Dec 18, 2023 6:02am Rick Murray (539) 13840 posts	but it’s good to see that progress is being made. <shrug> Every RISC OS machine that isn’t a dinosaur has three 32 bit cores doing exactly nothing, the compiler screws everybody with the emulated FP rather than the onboard FP that’s been around on RISC OS hardware for a decade and a half, no SIMD without resorting to assembler…

Dec 18, 2023 11:01am Colin Ferris (399) 1814 posts	Didn’t Stuart (the one who spills tea when I mention 512Mb for RPC) Do a hardware FPU Library for C users?

Dec 18, 2023 11:43am Stuart Swales (8827) 1357 posts	Didn’t Stuart do a hardware FPU Library for C users? There are quite useful gains to be had by using said library with Norcroft, but that’s only a halfway house. A compiler with true VFP support would knock it out of the park for properly FP intensive work. Some authors might be willing to support two builds of applications – one just for ARMv7 w/VFP and one for ARMv3. I probably would fall into that camp.

Dec 18, 2023 12:00pm Rick Murray (539) 13840 posts	As Stuart says, proper VFP support would be the way forward. His library is useful, certainly, but make no mistake – it exists entirely because the compiler can’t do this stuff for itself.

Dec 18, 2023 12:20pm Colin Ferris (399) 1814 posts	I don’t know how Stuart has worked it – but some C progs load like a module at startup – I wonder if that system could be used here. Like a BASIC lib one for software FP or another lib for Hardware.

Dec 18, 2023 12:30pm Stuart Swales (8827) 1357 posts	I don’t know how Stuart has worked it … I haven’t gone to that extreme. If the system has VFP, then it will be used, otherwise it falls back to using the equivalent FPA instructions in most cases. [See https://www.riscosopen.org/forum/forums/2/topics/3457?page=12#posts-146247]

Dec 18, 2023 2:15pm Rick Murray (539) 13840 posts	falls back to using the equivalent FPA Ideally Norcroft would have the following options: FPA VFP VFP with FPA fallback (probably the best option for most of RISC OS) Question is, how to you handle the data? Annoyingly the byte order of the words is back to front (VFP vs FPA).

Dec 18, 2023 2:25pm Stuart Swales (8827) 1357 posts	Annoyingly the byte order of the words is back to front (VFP vs FPA). It’s the word order that’s different in VFP (and in most other IEEE754 double implementations) VFP with FPA fallback This would produce bonkers code if done natively by the compiler; you’d really need an implementation like mine.

Dec 18, 2023 5:39pm Rick Murray (539) 13840 posts	It’s the word order that’s different in VFP Hmm, I knew something was the other way around… This would produce bonkers code if done natively Not necessarily. FPA – use the traditional FP ops VFP – use VFP ops Choose at runtime – don’t use any ops, call library routines and let the library work out what to do. The choose option will always be slower (at least a branch ¹ followed by a load ² followed by a compare ³ followed by the op) but it’ll be far quicker than assuming FPA and it’ll work on old and new systems alike by just “doing the useful thing”. Yes, it would be lead to crazy code if the compiler tried to sort it out for itself every time. So it shouldn’t. In essence, it’s like using your library, but doing so while writing normal C code… ¹ To the library. ² The “which FP to use” variable. ³ To choose what to do, one would expect to then branch to FPA and fall through to VFP for speed.

Dec 20, 2023 12:23pm Graeme (8815) 106 posts	Annoyingly the byte order of the words is back to front You can use two VLDR/VSTR instructions with 32-bit (S) registers to store in the reverse order because D0-D15 are the same registers as S0-S31.

Dec 20, 2023 12:33pm Stuart Swales (8827) 1357 posts	You can use two VLDR/VSTR instructions with 32-bit (S) registers to store in the reverse order because D0-D15 are the same registers as S0-S31. Helpfully in the -apcs /softfp world (as indeed with FPA since a long time) floating function parameters (and return value) are passed in ARM registers, so all you do is twizzle during double loading/saving. VMOV d0, r1, r0 ; x in {r0,r1}, FPA ordered VMOV d1, r3, r2 ; y in {r2,r3}, FPA ordered ... VMOV r1, r0, d0 ; result in {r0,r1}, FPA ordered

Dec 20, 2023 7:17pm David J. Ruck (33) 1635 posts	Obviously you want to use the default word ordering for which ever FP system you are using at runtime for the best performance, however you do have a problem if the program either stores doubles as part of a file format or sends them over the network in binary. The saver/sender and loader/receiver could be using different word ordering, which will make life rather interesting as there are only a few values out of the 64 bit space which are not valid when the ordering is wrong, although it will be very obvious to the user.

Dec 20, 2023 7:31pm Rick Murray (539) 13840 posts	Is it possible to get VFP to swap word order on loading FP values from memory ¹, or must it be done via ARM registers? ¹ Sounds like the sort of weird things that NEON can do.

Dec 21, 2023 12:41am Martin Philips (9013) 48 posts	Is there an ARM64 assembler, that uses the same syntax as the DDE assembler - or can the DDE assembler be made to output ARM64 instructions?