DivEmulator
Adrian Lees (1349) 122 posts |
Amusing myself for a couple of days, I’ve written a Module akin to the FPEmulator which implements the OPTIONAL ARMv7 SDIV/UDIV hardware divide instructions. It runs on ARMv7 without that hardware and ARMv5 and likely all 32-bit OS targets. I shall make the full source and test application available soon, once I’ve done a bit more testing, as a freebie. The point of this post was both to make people aware that a solution to ARMv7-targetted div/non-div binaries exists, and to solicit thoughts of whether this should be extended to other instructions. Of course the DDE tools do not yet make full benefit ot the ARMv6/7 extensions, but we should be moving towards that, and I thought such a module could help? |
Theo Markettos (89) 919 posts |
What is the performance hit these days of emulated instructions v library calls? Does that make a soft instruction something you can put up with, or something to avoid at all costs? Given that Cortex A7 and A15 support hardware divide, any feelings for how it would be if code was compiled with that by default? |
Ben Avison (25) 445 posts |
Interesting idea. Also, it occurs to me that since a division by zero still throws an undefined instruction exception on Cortex-A15 and later, presumably you could re-use such a module to throw a more RISC OS like Divide-by-zero error? Though for backward compatibility’s sake I suspect you should really enable the zero case to branch into the C library’s existing cleanup code (or BASIC’s, for that matter). Perhaps there’s a case for adding a new OS_ChangeEnvironment handler, with the default handler throwing the error and the C and BASIC runtimes installing their own? |
Jeffrey Lee (213) 6048 posts |
If we’re adding a new environment handler, then I think we should bite the bullet and make some more fundamental changes to how program environment is handled, e.g. so that system() and friends don’t need to do quite so much low-level work to save and restore their state. Also, a new environment handler won’t work if you’re using DivEmulator on a machine with an older kernel. Although I guess it might be possible for something like the CallASWI module to add support for it. Regarding SDIV/UDIV use in the C compiler, we could easily get some gains for existing code by updating the CLib __rt_udiv/__rt_sdiv functions to use the instructions. Although it is a bit disappointing that those functions require the remainder to be returned. Sure, calculating the remainder is easy, but I’d imagine it would also add an extra stall of a few cycles while it waits for the divide result to be calculated (on Cortex-A7, at least; on A15 I guess it would be hidden by the out-of-order execution). |
Theo Markettos (89) 919 posts |
CLib stubs is a good idea. I wonder if it’s possible to do the CPU test in the CLib jump table initialisation code, so that the jump table points to the right routine for your architecture. It could also work for ROM linking, if the ROM linker can be made CPU aware. My gut feeling is that the double branch used in calling an SCL function is probably less expensive (thanks to branch prediction) than an undefined instruction trap, though I have no data for that. |
Adrian Lees (1349) 122 posts |
From my reading of the ARM documentation, the Cortex-A15 implements the ARMv7-A profile, which does not raise exceptions when a divide-by-zero occurs; instead it is defined as returning a result of zero. Only for ARMv7-R is there a configurable exception. So, my module returns zero too. Modifying CLib would give a transparent gain for older applications, and yes, the check should only be done at initialisation. That would perhaps mostly benefit ported applications, division being frowned upon in the RISC OS/ARM world, generally. Ultimately when the compiler is updated, which I may be interested in tackling at some point since my final year project was writing a C compiler for RISC processors, it would be best to make it use the division instructions, IMHO, because there are hidden costs in calling into the CLib library, such as constraints upon register usage. The idea behind the DivEmulator module was to permit the use of a single binary on all ARMv7-A targets with the instructions used directly. The undefined instruction trap is definitely more expensive than calling into CLib because of the requirement to save/restore context registers and because I haven’t done any clever cacheing (unlike in Aemulor); it does simply fetch, decode and execute the instruction (as per the FPEmulator). I see no reason that both solutions cannot co-exist. |
Adrian Lees (1349) 122 posts |
First version uploaded |
Ben Avison (25) 445 posts |
Well, it makes perfect sense considering they were designed when it was unthinkable that an ARM would have division instructions. If you’re doing division longhand, then the remainder falls out as the value in one of the working registers used in the calculation, so is available with zero cycles cost. Given that it’s quite a common use case that a program requires both results of the division, getting both results from the same call will often halve the time you’d otherwise need. You could add new runtime functions which don’t bother with the remainder, to enable it to be returned a bit quicker on Cortex-A15 and later – but if timing was that critical, I expect you’d probably want to be inlining the instructions instead, wouldn’t you?
Good point. I never noticed that when they extended SDIV and UDIV to the ARMv7-A profiles, they removed the exception behaviour. I see that ARMv8-A follows the ARMv7-A behaviour, so I guess that’s here to stay. It’s a shame because it means that a general-purpose division routine needs to do TEQ denominator, #0 BEQ divide_by_zero UDIV quotient, numerator, denominator which increases the cycle count, whereas an exception (which wouldn’t be taken most of the time) would effectively turn this into a “lazy” check, and would avoid corrupting the flags for good measure. Still, we have no power over ARM’s design decision.
Hmm, the more I think about this, the more it looks like a can of worms. At the moment, the compiler will feel free to take advantage of any instructions introduced in the architecture version up to and including the one specified by the -arch command line option. Yes, at the moment it doesn’t go beyond ARMv6, but there’s a logical extension to v6K, v6T2, v7, v7-with-divide, v8 and so on. We use this switch in ROM builds, where you know the target architecture exactly, but don’t use it for any disc targets. Anyone wanting to use the switch for third-party software is on their own – if they want to include binaries optimised for different architectures, they have to sort out a way of selecting between different binaries themselves. Most developers will just want to leave things at their default settings, where the binary will work on all RISC OS machines. If you know exactly which architecture you’re targeting (as in ROM builds) then you already know whether or not you have divide instructions; either an inline divide instructions or BL __rt_udiv will be the clear winner. In either case, a DivEmulator module is unnecessary. If you’re thinking of a single binary for multiple CPUs, I assume you’re thinking about something soft-loadable. But something compiled with -arch Cortex-A15 could contain any other ARMv7 instructions. While you could potentially trap and emulate all the extra instructions on anything back to ARMv5, before that the “NV” instruction space didn’t generate exceptions and so you’d need to switch to a JIT approach. By their nature these older CPUs would have been slow and really don’t lend themselves to this sort of thing. So maybe you’d argue that the use of division instructions be controlled by an independent compiler switch. That’s a bit ugly since there are currently only two degrees of freedom in the API, the -arch switch and the floating point instruction set (controlled by the -apcs switch, though in objasm I adopted the -fpu switch in addition, to match ARM’s fork of the tools). But assume we go with that anyway. Next problem is that code compiled with -use-inline-divides (or whatever it’s called) has a dependency upon the DivEmulator module, but code that doesn’t doesn’t. Following the example of floating point support, the test to load the module would be performed by the stubs. So suddenly you need two separate versions of the stubs, with and without the relevant RMEnsure line. Since there have never previously been separate versions of the stubs, this would need documentation to avoid confusion. To deal with people inevitably not reading the documentation, you’d probably want a link-time check to ensure the correct version is used, which means a new AOF attribute analogous to the halfword attribute, so factor in updating the linker, objasm, cmhg, decaof, ResGen etc too. Now remember to update all the tool frontends and the documentation to match. I’m tempted to wonder, is it really worth the effort? Remember, the C library will be built into ROM on any machine new enough to include UDIV and SDIV, so can unconditionally use the instructions in __rt_udiv and __rt_sdiv. These function calls (in their longhand form) will beat taking an exception on older machines, and anything new enough to have UDIV and SDIV will have sophisticated branch predictors (including function call return prediction) so the function call overhead should be minimal for any soft-loadable software that uses them. |
Sprow (202) 1155 posts |
TEQ denominator, #0 UDIVNE quotient, numerator, denominator MLSNE remainder, quotient, denominator, numerator MOVNE pc, lr divide_by_zero : handler for that Can I have a banana for finding a use for MLS? |
Ben Avison (25) 445 posts |
Very good. Though if we’re playing optimisation games (and remember Cortex-A7 has UDIV but is in-order so it’s worth hand-scheduling for it) there will be loads of stalls between the UDIV and the MLS, so you’re better off with UDIV quotient, numerator, denominator TEQ denominator, #0 BEQ divide_by_zero ; there will be multiple division routines, they can't all fall through MLS remainder, quotient, denominator, numerator MOV pc, lr |
Colin (478) 2433 posts |
In that case I suggest
|
Adrian Lees (1349) 122 posts |
Ben: I accept your points, and shall simply leave the code available on my site ‘as is’ for now. It was never intended that it implement all the additional ARMv7 instructions for use on earlier platforms, but just to avoid the – IMHO – very fine-grained distinction between ARMv7-with-divide and ARMv7-without targets for third-party code. It’s an awkward situation, with that minimal extra hardware (division really isn’t all that expensive in a modern CPU) being ‘OPTIONAL’, because clearly the compiler should be able to use the hardware instructions directly. There are additional costs involved in calling into an APCS routine within CLib; for example, the compiler ought really to assume that R0-R3,R12,R14 are corrupted, in addition to likely needing to introduce extra code to move the numerator/denominator into the correct registers, and collecting the quotient. Plus, with the division being in a routine of its own which – currently – computes the remainder too, there is no way to avoid the result latency of the U/SDIV instruction. Perhaps a good alternative would be to introduce* some non-APCS stub entries that implement compiler primitives which may/may not be available as native instructions, with the compiler being aware of exactly which registers may be corrupted, and – as you suggested – there be entries which do not produce the remainder. Then, of course, CLib decides at startup which implementations shall be used for each of these primitives. This would at least remove some of the above objections. *I’m writing this suggestion ‘off-the-cuff’…perhaps the |__rt_…| routines already have interfaces that are more constrained than APCS, or could be redefined as such. |
Richard Keefe (1495) 81 posts |
What would happen under Aemulor if I replaced the divide functions in a still 26bit app with a div? Is there an approved light runtime way to confirm if the div instruction is available? Or the current arch version? |