VFP Execution Environment in Library

category: Specification

<div id="toc_heading"></div><div id="toc"></div>

h2. Goals

There is currently a bias towards using the older FPA instruction set for floating point operations ostensibly because the majority of RISC OS systems were based around the ARM7500FE or lacked floating point hardware entirely - and in that situation a floating point emulator was a good way to have code be able to run in either scenario.

Since ARM11 the scales have tipped the other way as these cores implement one variant or another of the VFP hardware macrocell, Arm's replacement for the FPA. Ignoring VFPv1, since RISC OS never ran on any of the chips and version 1 has since been obsoleted, this means VFPv2 or VFPv3 or VFPv4.

This proposal looks at how to better integrate hardware floating point support within RISC OS, primarily for those clients of the Shared C Library, but also setting out examples by which hardware floating point support can be detected so that when multiple variants are supplied the correct one is run.

h2. Existing documentation

h3. Relevant specifications

"IHI0042":https://github.com/ARM-software/abi-aa/releases - Procedure Call Standard for the Arm 32-bit Architecture
"DUI0041":https://developer.arm.com/documentation/dui0041/c/ARM-Object-Format/ARM-Object-Format - ARM Object Format (AOF)

h3. Relevant forum threads

"VFP implementations of transcendental functions and decimal conversion":/forum/forums/2/topics/9917
"BASICVFP":/forum/forums/3/topics/9081
"Taking FP seriously":/forum/forums/5/topics/9032
"FP support":/forum/forums/2/topics/3457
"VFP context switching":/forum/forums/3/topics/401

h2. Terminology

*FPA* is the original Floating Point Accelerator, coprocessor numbers 1 & 2 for the ARM.
*VFP* is the Vector Floating Point unit, coprocessor numbers 10 & 11 for the ARM.
The *APCS* is the ARM Procedure Calling Standard, treated as a synonym here for ATPCS (for ARM/Thumb PCS), and AAPCS (for ARM Architecture PCS).
An *ABI* defines an Application Binary Interface, comprising a base ABI for the core-registers only and basis for several variants.

h2. Detail

h3. APCS amendments

h4. Recap of existing calling standards

The original APCS had 4 independent choices, leading to a potential of 16 combinations
* 32b versus 26b addressing mode (whether LR preserves a function's flags)
* Implicit versus explicit stack limit checking
* How floating point arguments are passed
* Whether code is reentrant or not

Of those 16 combinations the two most commonly encountered on RISC OS are

table(bordered).
|_=. Name   |_=. PC |_=. Stack checking       |_=. FP arguments        |_=. Reentrancy   |_=. Equivalent tool setting  |
|=. APCS-R  |=. 26b |=. Explicit, register SL |=. In integer registers |=. Not reentrant |<. @-apcs 3/26/swst/nofpregargs/nonreent@ |
|=. APCS-32 |=. 32b |=. Explicit, register SL |=. In integer registers |=. Not reentrant |<. @-apcs 3/32/swst/nofpregargs/nonreent@ |

APCS-U and APCS-M were never used on RISC OS, and APCS-A support was withdrawn in "SharedCLibrary 5.06":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/tree/RISC_OSLib-5_06.

h4. Change for VFP

In the context of this proposal the only one to review is the choice of how to pass FP arguments. With APCS-R and APCS-32 this is a compromise for the lack of hardware floating point: moving floating point values into F0-F3 before making a function call would cause extra instruction aborts for FPEmulator to need to decode and emulate, when in fact F0-F3 aren't real registers anyway (they're emulated). Instead, the values are moved into integer registers, such that FPEmulator can recover them without having to do another round of emulated moves.

With _hardware_ VFP it is advantageous to switch to passing floating point values in FP registers, not least because the register bank is larger (either 16 or 32 double precision registers) with D0-D7 available for argument passing. This reduces register pressure on the integer register set too, so a function could have up to 4 integer parameters and 8 double (or 16 single) precision arguments and pass them all in registers.

By toggling only a single design choice a new APCS-VFP is defined. Notice that provided no floating point is used it degrades to the base standard and is ABI compatible with APCS-32.

h4. No further changes

This proposal seeks to make the minimum changes to yield most of the benefit with a manageable amount of work, in light of the original VFPv1 documentation being made public in by Arm in 2000 yet no ABI defined prior to this document. Therefore it does not change:
* Stack alignment requirements - while this does provide a performance increase, it is expected to be dwarfed by the performance gains by using hardware floating point more generally, since in a _typical_ function 16 or 32 FP registers are probably enough to avoid needing to spills into memory. What this does prevent is direct use of the @LDREXD@ and @STREXD@ (and @LDRD@ and @STRD@ on the ARMv5TE *only*) instructions which require 8 byte stack alignment.
* Position independence - code will continue to be statically linked within ROM and relocated when RAM loaded, data is not reentrant
* High numbered registers - unchanged
** R12 is intra procedure scratch
** R11 the frame pointer (when enabled)
** R10 the stack limit (when enabled)
** R9 free for use except within the HAL which uses it as a static base pointing to the HAL workspace
* No interworking with Thumb

h4. Linking with other code

The "GCCSDK":https://www.riscos.info/index.php/GCCSDK is a collection of compilers derived from the "GNU Compiler Collection":https://gcc.gnu.org/. As at version 4.7.4 rel 6 of the RISC OS port when selecting to target a VFP floating point unit the compiler emits code which uses a 'softfp' calling convention. That is: intermediate calculations are use native VFP instructions but any function calls transfer the values back into integer registers, which is not ABI compatible with the APCS choice proposed. This means programs compiled with GCCSDK couldn't use the SharedCLibrary for support, they must use their own matching C libraries.

However, there is no double word stack pointer alignment requirement either, so in future if the RISC OS port opted to support a 'hardfp' calling convention they would become ABI compatible. Note that since GCCSDK no longer supports AOF object files it's not possible to link objects together even if they were following the same APCS, they would first need to be linked and then call from one binary to another.

h3. The Shared C Library

h4. Client registration

Code which was built to use VFP, and thus follows the new APCS-VFP, will need to call a new library initialisation SWI during startup. One SWI for application clients and one for module clients (cf. SharedCLibrary_LibInitAPCS_32 and SharedCLibrary_LibInitModuleAPCS_32). If the SharedCLibrary is too old the unknown SWI error returned will prevent the application or module from starting.

The C library runtime will [[VFPSupport_CreateContext|create a VFP context]], since there isn't a global context as for FPA, and [[VFPSupport_DestroyContext|destroy it]] on library close down if the application or module quits along with file handles and the other usual exit clear up.

h4. Code which is not a client of the Shared C Library

Code which doesn't use the Shared C Library is responsible for managing its own context. For example BASIC compiled using the ABC compiler uses its own ABCLibrary support module and uses [[VFPSupport]] directly to manage contexts.

h4. Functions affected

A number of the C functions from <math.h> are currently implemented in assembler which uses FPA opcodes, and some are directly inlined by the compiler as FPA opcodes. The bulk of the functions are in "cl_body.s":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/clib/s/cl_body "cxsupport.s":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/s/cxsupport "fenv.s":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/s/fenv and "mathasm.s":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/s/mathasm, plus a couple of lines in "longlong.s":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/s/longlong and "k_body.s":https://gitlab.riscosopen.org/RiscOS/Sources/Lib/RISC_OSLib/-/blob/master/kernel/s/k_body.

For ease of maintenance moving the floating point library code out of cl_body.s would be worthwhile. They will each need an equivalent implementation in VFP, or some of the more obscure functions could be veneered by swapping the word order from VFP to FPA and calling the existing proven code.

Variadic functions such as printf() receive all their arguments in integer registers. This causes an issue for floating point values which are always widened to doubles, since doubles for an FPA client have the word order reversed compared with VFP. By the time the variadic function is called the type information which was to hand at the compiling stage has been lost so this will either require two copies of the variadic functions, or some abstraction at the point va_arg() is used to retrieve a double.

h4. Exceptions

Some later variants of the VFP macrocell don't raise an abort for invalid operations such as the square root of a negative number. The compiler may opt to insert checks of the FPSCR after certain operations or on return from a function for example in order to prevent invalid results from silently being propagated. This will need support functions in the Shared C Library to report them to the signal handler in the C environment, so that they may be handled in a controlled manner.

h3. Toolchain switches

Selecting an environment using !Builder runs the corresponding "ToolOptions":https://gitlab.riscosopen.org/RiscOS/Library/-/tree/master/ToolOptions for the selected APCS to alias the CC and ObjAsm commands.

Besides CC and ObjAsm the CMHG, resgen, and DefMod commands will need updating to ensure they at least set the VFP area flag in the object file. They don't output floating point instructions themselves of course, but without the matching area flag _many_ link time warnings will be produced when trying to mix object files with conflicting FP calling standards.

h4. ROM based components

The target for a complete operating system ROM is known and fixed, therefore for a ROM build a specific combination of CPU, FPU, and APCS can be selected to take full advantage of the instruction set available.

For example a Cortex-A15 based system is new enough to have VFPv4 and uses the ARMv7 opcodes, its memory system permits halfword loads and stores but forbids unaligned loads, so the APCS settings might look like:

bc. CC -apcs 3/32bit/fpregargs/vfp -cpu 7 -fpu VFPv4 -memaccess +L22+S22-L41
ObjAsm -apcs 3/32bit/fpregargs/vfp -cpu 7 --fpu VFPv4

Assuming the other options take their default values.

h4. Disc based components

Any disc based component will need to adopt the lowest common denominator settings for any processor supported by RISC OS which also runs VFP opcodes, potentially under emulation. If support for FPA is also required, the source code will need to be compiled/assembled twice and two binaries produced.

That would mean disc based components could only assume VFPv2 and continue to target a generic ARMv3, its memory system forbids halfword loads and stores and unaligned loads, so the APCS settings might look like:

bc. CC -apcs 3/32bit/fpregargs/vfp -arch 3 -fpu VFPv2 -memaccess -L22-S22-L41
ObjAsm -apcs 3/32bit/fpregargs/vfp -arch 3 --fpu VFPv2

Assuming the other options take their default values.

h3. VFP Emulator

On platforms which predate the VFP coprocessor an attempt to execute VFP instructions causes an undefined instruction abort. This could be used to trap attempts to run code which was written to use VFP and instead emulate it using the integer register set.

The existing [[VFPSupport]] module already contains a partial integer-only floating point engine based on John Hauser's "softfloat":https://gitlab.riscosopen.org/RiscOS/Sources/HWSupport/VFPSupport/-/tree/master/softfloat.

Because all of the steps: abort-decode-emulate-return are the same as currently performed by the FPEmulator, we would reasonably expect it to be no worse performance than emulating FPA is. This would mean it would be possible to deliver a single binary to a user and they would be able to run it regardless, much as is done with programs written to use FPA, except that now most users would see a considerable performance boost on platforms containing a VFP macrocell (and no change in performance if not).

A copy of the VFPSupport module would also need to be present in order to provide management of contexts and similar [[VFPSupport SWI Calls|SWIs]] to client programs in addition to the emulator itself.

h4. NEON and the NeVer condition code

All of the Single Instruction Multiple Data (SIMD) operations are fitted into the portion of the instruction space previously used by the @NV@ condition code. Prior to ARMv5 this was treated as a no-operation and would be silently stepped over regardless of the bit pattern in the remaining 28 bits.

This means any platforms prior to ARMv5 could not trap NEON instructions for emulation, therefore any VFP Emulator would be bound by only using the floating point instructions. This is convenient because it removes a great deal of complexity in attempting to emulate the whole of the SIMD operations, and is consistent with the 'lowest common denominator' toolchain switches for disc components covered earlier.

h4. Why not retrofit FPEmulator?

Conversely the FP operations invoked by executing an FPA opcode (such as @ADFS@ to add two single precision numbers) could in theory be sped up by doing the addition using the corresponding VFP operation within FPEmulator (a @VADD.F32@ in this example).

However this concept suffers from a number of drawbacks, negating "some benefit":https://www.riscosopen.org/forum/forums/3/topics/2172?page=5#posts-29897 at the expense of considerable engineering effort:
* As there is no global VFP context, a context would need activating/deactivating around each FPA emulated opcode sequence being emulated
* The overhead of taking an instruction abort and decoding the FPA opcode would still be required, which is a significant fraction of the overall time
* It removes some of the incentive to update software to use the preferred APCS-VFP
* The handling of FP exceptions when an illegal result is calculated differs between VFP versions, some do not throw exceptions at all, requiring an amount of post processing of results to faithfully replicate what the FPA would have done
* There is no 80-bit extended precision with VFP, so any 'E' precision ops would still need to be emulated with the integer registers
* There is no 'P' format packed decimal with VFP, so any 'P' precision ops would still need to be emulated with the integer registers
* It requires careful validation to ensure that it is at least as accurate as the previous implementation

h3. Selecting a !RunImage

At a minimum applications should RMEnsure VFPSupport 0.18, and if that is found to be absent attempt to load it, and if still found to be absent refuse to continue - conclude that the system does not support floating point on a VFP coprocessor:

bc. RMEnsure VFPSupport 0.00 RMLoad System:Modules.VFPSupport
RMEnsure VFPSupport 0.18 Error This application requires VFPSupport 0.18 or later to run

Rather than throwing an error, this technique can be extended to fallback to an FPA copy of the application if it is found that VFP isn't available. A typical application directory for  _MyApp_ might contain the files shown below.

[[DualApp.png|An application tuned for both FPA and VFP use:pic]]

Notice a second !RunImage is present with a 'V' suffix, now the obey file !Run can automatically select the most appropriate executable:

bc. RMensure SharedCLibrary 0.00 RMLoad System:Modules.CLib
RMensure SharedCLibrary 7.00 Error MyApp needs SharedCLibrary 7.00 or later to run
RMensure FPEmulator 0.00 RMLoad System:Modules.FPEmulator
RMensure FPEmulator 4.09 Error MyApp needs FPEmulator 4.09 or later to run
Set MyApp$Suffix V
RMensure VFPSupport 0.18 Unset MyApp$Suffix
Run <Obey$Dir>.!RunImage<MyApp$Suffix>

Notes:
* While earlier versions of VFPSupport do exist, version 0.18 is the first to contain a full set of transcendental functions so would be considered the minimum required if functions such as @sin()@ or @log()@ have been used
* The check on FPEmulator corresponds to the version from 1998 which was shipped in the RISC OS 4 ROMs
* This sequence assumes the SharedCLibrary will have a major version number bump when the extra APCS-VFP initialisation is integrated
* "Registering":/content/allocate your application _MyApp_ also reserves system variables beginning with _MyApp$_

It is assumed that a softloading version of VFPSupport, which would be required on systems where the VFP opcodes are trapped and emulated by a VFPEmulator, would load the prerequisite VFPEmulator itself during module initialisation. Therefore there is no check on the presence of VFPEmulator required.

h2. Implementation progress

table(bordered).
|_<. Phase                 |_=. Status     |_=. Completion |_<. Latest updates |
|<. Conceptual design      |=. Complete    |=. 100%        |<. 11-Nov-2023 Proposed design filled |
|<. Mock ups/visualisation |=. Complete    |=. 100%        |<. 11-Nov-2023 Example application directory added |
|<. Prototype coding       |=. In progress |=. 5%          |<. 11-Nov-2023 DefMod prototyped to test required ToolOptions changes |
|<. Final implementation   |=. -           |=. -           |<. - |
|<. Testing/integration    |=. -           |=. -           |<. - |

h2. Document history

v1.00 - 26-Dec-2021

* Outline added

v1.01 - 11-Nov-2023

* All sections populated with proposed solution