Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → General →

Taking FP seriously

30 posts, 9 voices

Pages: 1 2

Jan 28, 2017 1:28pm Theo Markettos (89) 919 posts	A side note for the historians… I wonder to what extent the FPA instruction set was influenced by the ISA of the AT&T (Western Electric) FPUs? Sophie states that the ‘fifth member of the four chip set’ was an interface chip to the WE32206. In that kind of chip in that kind of technology, you don’t want to be doing complicated instruction transformations. The 3B2 assembly manual describes the instruction set (same FP ISA as the WE32106, the WE32206 also has SIN, COS, ATAN and PI support). In the case of the podule, there was still an FPEmulator – I wonder how much work this has to do to convert, or whether FPA is basically a clone of the 3B2 Math Acceleration Unit (MAU) architecture (sadly I don’t think the car-sized 3B15 depicted shares many features, fun though it would be to say FPA descended from minicomputers)

Jan 28, 2017 1:55pm Theo Markettos (89) 919 posts	What I was asking was to compare what was with what is, for example how does ARM rationalise FP operations, given the potential for stalling awaiting a result. Is it more efficient to string together FP instructions, or to interleave FP and ARM so the ARM has something to do while the FP is busy? I can’t speak for specific ARM cores, but it depends on what kind of core you have: In-order: instructions executed one-by-one, no parallelism Out-of-order with separate FP and integer pipelines: can dispatch FP, get on with executing integer until the results come back. Probably want to carefully mix FP and int for best performance. Superscalar: extracts parallelism from instruction stream, can execute multiple instructions at once on different units (eg several iterations of an unrolled loop) You’d need to look at the architecture of the specific core you want to use and its instruction timings to get specific information. Only in out-of-order is it likely to matter a lot, though superscalar implementations have their limitations. In the historical case, the system had separate integer and float units (ARM2 and WE FPU) but the FPEmulator mechanism prevented parallel dispatch. In other contemporary systems you could dispatch an operation to the FP unit and get on with something else on the integer CPU. In the FPEmulator case you can’t make forward progress in the application until you have the FP result. Even if your code was 100% float, you still pay a penalty because you can’t dispatch more than one FP operation at a time. Basically FP has to be integrated into instruction dispatch inside the core – you can’t farm it out to a third party (hardware or software).

Jan 28, 2017 2:52pm Jeffrey Lee (213) 6048 posts	I’m not quite sure I agree with Theo’s classifications, but “modern” ARMs definitely make use of multiple, independent pipelines. For example, the venerable old VFP11 (as used in the ARM11) has three separate execution pipelines: FMAC for basic ALU operations (add, subtract, multiply, etc.). Has a high throughput so you can continually push instructions into the pipeline with minimal stalling. Divide/square root. Only has a low throughput (the block diagram shows it using a loop to calculate the result), so you want to avoid executing multiple divide/square root instructions back-to-back. Load/store pipeline. Throughput will mainly be governed by the memory/cache performance. It looks like the ARM11 has a couple of pipelines of its own (ALU pipeline and load/store pipeline). So you should be able to have a sequence of mixed integer and FP ALU operations interspersed with (integer/FP) loads/stores and the occasional FP divide/square root without hitting any pipeline stalls. I.e. the CPU will be dispatching instructions at its full speed of one instruction per clock cycle. ARM continue to evolve the pipeline structures, so for example the instruction dispatcher in the Cortex-A8 feeds NEON instructions into a 16-entry queue in order to help hide any pipeline stalls that short/medium-length NEON instruction sequences might otherwise introduce. Of course the downside of the A8 design is that you’d then get horrible delays if you wanted to transfer values from NEON registers back to ARM registers. So I think that they ditched that idea in their later in-order designs, e.g. the A7. Although I guess the true successor to the A8 would be the horsepower-oriented out-of-order chips like the A9, while the A7 and friends are aimed more at low die space and low power consumption, so there are undoubtedly other factors at work which influence ARM’s choices in pipeline structure.

Jan 28, 2017 9:58pm Chris Evans (457) 1614 posts	Wasn’t the FPA10 restricted to 25MHz? Was a faster FPA ever released? FPA11 was 33Mhz and the FPA macrocell in an ARM7500FE was 56MHz!

Feb 19, 2017 2:10am Jeffrey Lee (213) 6048 posts	The original FPA co-processor that was available for the ARM3 implemented all the arithmetic functions in hardware. But for the FPA in the ARM7500FE, there’s no hardware support for the trig or power functions (except square root) – POW, RPW, POL, LOG, LGN, EXP, SIN, COS, TAN, ASN, ACS, and ATN are handled by FPEmulator (see section 10.2.3 of the ARM7500FE data sheet). Correction: In addition to the instructions listed above, LDFP, SDFP, SQT, RMF and RND are handled by FPEmulator (sections 10.1.1 and 10.5). Also the FPA10 didn’t implement everything in hardware; the same set of instructions that ARM7500FE left unimplemented were also unimplemented in FPA10 (datasheet available from Chris’s Acorns). It looks like, from a functional perspective, FPA10, FPA11 and ARM7500FE are actually identical (The ARM7500FE datasheet mentions the FPU is basically just an FPA11, and the instruction timings and 90% of the manual text are identical to FPA10). So I’m guessing it’s the WE32206 podule which implemented everything in hardware (with a little help from FPEmulator to translate the instructions), apart from the LFM/SFM instructions which were invented for the FPA hardware. I’d never heard of the WE32206 podule until Theo mentioned it, so this has been a bit of an eye-opener for me.

Pages: 1 2

Reply

To post replies, please first log in.

Forums → General →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

General discussions.

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails