Taking FP seriously
Pages: 1 2
Theo Markettos (89) 919 posts |
A side note for the historians… I wonder to what extent the FPA instruction set was influenced by the ISA of the AT&T (Western Electric) FPUs? Sophie states that the ‘fifth member of the four chip set’ was an interface chip to the WE32206. In that kind of chip in that kind of technology, you don’t want to be doing complicated instruction transformations. The 3B2 assembly manual describes the instruction set (same FP ISA as the WE32106, the WE32206 also has SIN, COS, ATAN and PI support). In the case of the podule, there was still an FPEmulator – I wonder how much work this has to do to convert, or whether FPA is basically a clone of the 3B2 Math Acceleration Unit (MAU) architecture (sadly I don’t think the car-sized 3B15 depicted shares many features, fun though it would be to say FPA descended from minicomputers) |
Theo Markettos (89) 919 posts |
I can’t speak for specific ARM cores, but it depends on what kind of core you have: In-order: instructions executed one-by-one, no parallelism You’d need to look at the architecture of the specific core you want to use and its instruction timings to get specific information. Only in out-of-order is it likely to matter a lot, though superscalar implementations have their limitations. In the historical case, the system had separate integer and float units (ARM2 and WE FPU) but the FPEmulator mechanism prevented parallel dispatch. In other contemporary systems you could dispatch an operation to the FP unit and get on with something else on the integer CPU. In the FPEmulator case you can’t make forward progress in the application until you have the FP result. Even if your code was 100% float, you still pay a penalty because you can’t dispatch more than one FP operation at a time. Basically FP has to be integrated into instruction dispatch inside the core – you can’t farm it out to a third party (hardware or software). |
Jeffrey Lee (213) 6048 posts |
I’m not quite sure I agree with Theo’s classifications, but “modern” ARMs definitely make use of multiple, independent pipelines. For example, the venerable old VFP11 (as used in the ARM11) has three separate execution pipelines:
It looks like the ARM11 has a couple of pipelines of its own (ALU pipeline and load/store pipeline). So you should be able to have a sequence of mixed integer and FP ALU operations interspersed with (integer/FP) loads/stores and the occasional FP divide/square root without hitting any pipeline stalls. I.e. the CPU will be dispatching instructions at its full speed of one instruction per clock cycle. ARM continue to evolve the pipeline structures, so for example the instruction dispatcher in the Cortex-A8 feeds NEON instructions into a 16-entry queue in order to help hide any pipeline stalls that short/medium-length NEON instruction sequences might otherwise introduce. Of course the downside of the A8 design is that you’d then get horrible delays if you wanted to transfer values from NEON registers back to ARM registers. So I think that they ditched that idea in their later in-order designs, e.g. the A7. Although I guess the true successor to the A8 would be the horsepower-oriented out-of-order chips like the A9, while the A7 and friends are aimed more at low die space and low power consumption, so there are undoubtedly other factors at work which influence ARM’s choices in pipeline structure. |
Chris Evans (457) 1614 posts |
FPA11 was 33Mhz and the FPA macrocell in an ARM7500FE was 56MHz! |
Jeffrey Lee (213) 6048 posts |
The original FPA co-processor that was available for the ARM3 implemented all the arithmetic functions in hardware. But for the FPA in the ARM7500FE, there’s no hardware support for the trig or power functions (except square root) – POW, RPW, POL, LOG, LGN, EXP, SIN, COS, TAN, ASN, ACS, and ATN are handled by FPEmulator (see section 10.2.3 of the ARM7500FE data sheet). Correction: In addition to the instructions listed above, LDFP, SDFP, SQT, RMF and RND are handled by FPEmulator (sections 10.1.1 and 10.5). Also the FPA10 didn’t implement everything in hardware; the same set of instructions that ARM7500FE left unimplemented were also unimplemented in FPA10 (datasheet available from Chris’s Acorns). It looks like, from a functional perspective, FPA10, FPA11 and ARM7500FE are actually identical (The ARM7500FE datasheet mentions the FPU is basically just an FPA11, and the instruction timings and 90% of the manual text are identical to FPA10). So I’m guessing it’s the WE32206 podule which implemented everything in hardware (with a little help from FPEmulator to translate the instructions), apart from the LFM/SFM instructions which were invented for the FPA hardware. I’d never heard of the WE32206 podule until Theo mentioned it, so this has been a bit of an eye-opener for me. |
Pages: 1 2