[Cortex] Unaligned loads/stores
Jeffrey Lee (213) 6048 posts |
Over on Drobe Adrian Lees has suggested that we enable alignment checking in CP15 and use an exception handler to emulate ARMv5 unaligned load behaviour, in order to provide compatability with existing code. What are ROOL’s (and everyone else’s) thoughts on this? Obviously it would cause some problems if we want to use ARMv6 unaligned loads anywhere (or ARMv5 unaligned loads due to performance reasons), but I’m not sure if there’s much benefit to unaligned loads anyway on hardware that supports LDRH/STRH. |
Ben Avison (25) 445 posts |
And just when I’d finished putting build options in all the places in the ROM which used LDMs to facilitate unaligned loads to use LDR instead on ARMv6. What a can of worms. :( Trapping unaligned loads is indeed a possibility, and will give us better ARMv5 compatibility. (But we’ll never fully achieve that: things like ALU operations that target the PC effecting instruction set switches isn’t something we can trap, for a start.) But Adrian’s right about the compiler and halfword loads – there will be a substantial body of binary code out there that relies on pre-ARMv6 style unaligned loads for this. Having said that, on performance grounds, it’s very undesirable to do that. Running through the data abort system is a huge overhead at the best of times, and would be particularly painful for things like graphics renderers and video codec motion compensation routines, which will often make very heavy use of such instructions, and where more often than not what you actually want is the unaligned access which the hardware is now capable of giving you natively. Even aiming for architecture neutrality by replacing unaligned LDRs with a macro to LDM and then shift the two registers together is a huge performance hit (I estimate 10-20 times slower than a native unaligned LDR) on Cortex due to its pipeline structure, so it is to be avoided whenever possible. At the moment, the compiler appears to do:LDR Rd, (address EOR 2) MOV Rd, Rd, ASR/LSR #16 which loads the containing 32-bit word in a manner such that the required 16 bits are loaded to bits 16-31 of Rd, then it shifts them down to bits 0-15, sign-extending if appropriate. If you tell the compiler you’re compiling for a target CPU of v4 or later, it already knows to use LDRH or LDRSH instead. So what needs to be done? I propose
LDR Rd, (address AND NOT 2) IF (address AND 2)=0 MOV Rd, Rd, LSL #16 ENDIF MOV Rd, Rd, ASR/LSR #16 We haven’t got anyone at ROOL with much experience of changing the compiler, and I’ve a horrible feeling it’s going to end up being me having to make this change, by default… |
Ben Avison (25) 445 posts |
Of course, I was forgetting the Risc PC address bus prohibits using LDRH/STRH reliably as well… OK, the question is, how many developers doing Beagle builds are not using one of a) an Iyonix, b) a StrongARM Risc PC emulator or c) a Kinetic Risc PC as their build machine? |
Peter Naulls (143) 147 posts |
Maybe none. But if the question is one of native vs target tools, then that’s some thing definitely worth addressing. Someone might want to do for example an Iyonix build on RiscPC (kind of slow, but possible) or A9, where the native is ARMv4 and the target is ARMv5. I appreciate that this might be a non-trivial change to the build. |
Theo Markettos (89) 919 posts |
I don’t know about specifically Beagle developers, but I imagine the SA RPC is the most common machine out there by several orders of magnitude. So it might be a bit dangerous to start barring swathes of your target market if you can possibly avoid it. A low barrier to entry is what the shared source project is all about. As Peter says, fixing the build system would probably seem to make more sense. And would be something that needs doing in any crosscompile environment in any case. (For the record, I’m not a Beagle developer but have a SA RPC as my only build machine) On the unaligned loads issue, I always assumed this behaviour was a bit of a side effect and not to be relied upon. Would it be possible to somehow log problems so that they can be fixed later? So if you have a fixup routine on an abort, have a mode that logs the location of the instruction (‘module TrackerPlayer+&17C0’) so someone so-motivated can go in and change it. |
Jeffrey Lee (213) 6048 posts |
I’m starting to think it’s about time the 32bit compatability flag in modules/AOFs was expanded to include more information about the architecture features required to run the code, similar to the flags in Hdr:CPU.Arch. That way the OS will be able to much more easily determine whether code is safe to run or not, or whether emulation of old features is required. Of course it won’t help old OS’s detect new code very easily – the code would have to have some way of checking the machine’s suitability for itself, even if it’s as simple as calling a new SWI that performs the check using the information contained in the file header. This work could be made part of the work that’s being done to help code identify which OS fork features are available, thus killing two birds with one stone. Also it’s probably about time that the build system was updated to seperate the host compiler settings from the target! |
Steve Revill (20) 1361 posts |
An LDR (or STR) can take an address at any byte alignment in ARM v7 and it will do the right thing, unless the CPU has been (software) configured to abort unaligned loads. This isn’t “implementation defined” behaviour; it’s in the ARM v7 ARM. Thus, unaligned loads become one instruction on the Beagle, which is very handy for codecs (and BASIC indirection operators!). |
John-Mark Bell (94) 36 posts |
While it strikes me as being seriously awkward that the behaviour for unaligned addresses has changed, it seems sensible to minimise the amount of legacy behaviour we retain support for. With hindsight, the move to 32bit could have been more brutal in the kinds of features that suddenly became unsupported. I think, therefore, some kind of hybrid of Ben’s USR-mode abort handling and Jeffrey’s flags word extension is probably the most sane. With such a scheme, old applications should run without issue, and new applications should avoid any associated performance hit. This gives us the best of both worlds. Alternatively, new OS versions simply refuse to load applications and modules that don’t claim to be ARMv7 compatible and old application support is left up to legacy support code (such as an extended Aemulor). |
Adrian Lees (168) 23 posts |
The rotation behaviour undoubtedly comes from the implementation of LDRB, which needs the 4:1 MUXing on the 8 LSBs of the input data path anyway. Having that logic replicated on the other 4 bytes achieved greater flexibility with minimal extra hardware. I’m not really sure what the best solution is w.r.t. enabling/disabling the handling of unaligned loads; if modified OS code/SVC module code relies upon the new behaviour, then it seems that we’d have to not only decide on a task-by-task basis, but also have a range check in the exception handler that can work out whether the aborted instruction expects old or new behaviour. Martin Wuerthner has already reported that the ArtWorks modules presently rely upon the old behaviour, and there are undoubtedly others; remember many modules are compiled C too. Either way, provided that the dev tools can be changed not to rely upon the old behaviour, I think the exception handling should be in a separate module, not the OS core itself. I have most of the relevant code (fairly well optimised) from Geminus work, having implemented R/B colour swapping for 16bpp modes, for example. Having ARMv7-compliant apps and modules flagged as such would seem to make sense. My concern with this change of behaviour is that it will lead to far more subtle failures than the transition from 26- to 32-bit PC, and it will not be obvious to users that an application does need the exception handling enabled; it’d be better if this could happen automatically. |
Ben Avison (25) 445 posts |
It’s interesting to compare the situation with the 32-bit transition. Just looking at function call returns, all APCS-R applications and modules, and roughly half of function returns written in assembler, were not 32-bit compatible, and would fail catastrophically, often due to processor mode switches. By comparison, the unaligned load change may have no effect at all on many modules and applications – and when it does, as Adrian says, the effect can be subtle. Just loook at how functional the desktop already is on Beagle for evidence of this. There may be some value in Jeffrey’s suggestion of including bitfields in a module header, similar to Hdr:CPU.Arch, which says which architecture options are and are not supported by the binary – this would save duplicate architecture-check code in each module’s initialisation code. I still think that disc-based software should strive to support a reasonably wide range of CPUs if not all of them: we don’t have anything like the Apple concept of universal binaries, so if you’re going to have multiple binaries targetted for specific architectures, you’re going to have to hold them in separate files on disc and select the correct one at run time. But module flags and AIF/ELF flags can’t address the wider problem that we have – namely that they can only ever tell you about architecture options that were known at the time that it was built. The 32-bit change was special in that 99% of existing C and assembler modules and applications were going to break, so refusing to run anything that didn’t have the 32-bit flag set was a reasonable precaution. What we have here is less clear-cut: if we add a new flag to say “I’m aware of the new unaligned load behaviour” and require anything that we run to have set the flag, then we’ll be shutting the door on a lot of software that would otherwise work. I wonder how practical it would be to develop a set of heuristics to identify problematic modules and applications? For example, searching forLDR{<cond>}{T} Rd,[Rn,#offset_with_bit_1_set]{!}then searching up until the next branch, instruction that loads PC, or other instruction that writes to Rd, to see whether any intervening instructions with the matching <cond> do Rd, LSR #16 or Rd, ASR #16 . If a module/application had any such instructions and didn’t have a flag set to say “I know what I’m doing with unaligned loads” then we could refuse to run it – or maybe kick off some sort of Aemulor-style environment.
One other thing occurs to me – I think it would be valuable for developers to have a build switch for the Beagle OS to change it so that it doesn’t itself do any unaligned loads. This would enable the developer to globally enable the abort behaviour, to help them track down all the instances of unaligned loads in their own code. At this early stage, it would be fairly easy to add. |
Ben Avison (25) 445 posts |
After getting some surprising results from some unaligned load tests I’ve been doing on a Beagle, I’ve spent the last few hours staring at the kernel abort handler. It contains a huge mass of code whose purpose is initially obscure, but I think its main purpose is offer to a series of handlers the opportunity to patch up any data aborts. In particular I think it was used for VIDC1 emulation on VIDC20 machines. Among its complexities, it attempts to manage LDMs/STMs that extend across the boundary between aborting and non-aborting pages. That part of the load from the non-aborting page is effectively achieved via a software emulation – so it obeys all the word-alignement, rotation etc rules that you would expect from a pre-v6 CPU. This has the confusing implication that if you turn on unaligned aborts for LDR/STR on a Beagleboard (and also if you do an unaligned LDM/STM, which aborts either way on ARMv6+) then the kernel is already faking pre-ARMv6 behaviour for you, because it assumes that the instruction aborted because it partially accessed an aborting page! However, not all is good. As discussed above, I don’t think this is desirable long-term. Also, the implementation is strictly limited to LDRB, LDR, LDM, STRB, STR and STM instructions. LDRSB, LDRH, LDRSH, LDRD, LDREXB, LDREXH, LDREX, LDREXD, LDF, LFM, VLDR, VLDM, VLD, STRH, STRD, STREXB, STREXH, STREX, STREXD, STF, SFM, VSTR, VSTM, VST and even SWP and SWPB are actually handled wrongly. It can’t cope with Thumb code at all, let alone Thumb-2, Jazelle or ThumbEE. I’m not even sure that it’s aware of the changes to page table formats in ARMv6 (I’m pretty certain it never learned about the short-lived 1K pages). It’s telling that the lazy task swapping code does not use this mechanism – presumably because it’s so incomplete. I’d be tempted to drop the code altogether (in fact I think that’s the right course in the immediate short term) except that we’re going to want to add back in some patch-up code for legacy applications. Instead, I propose to create a vector to be called from the abort code, which would provide an integer register dump, information about whether unaligned loads are natively rotated or unaligned, the (late/restored) abort model, page table variant, and fault address and status registers, then leave an external module to decode the aborting instruction and page tables. The return code for the vector could indicate whether to a) restart the aborting instruction, b) reload from the integer register dump and resume the instruction after the aborting instruction, or c) call the data abort environment handler. The external module could then be written in C, making it hopefully more legible, extendable and maintainable than the existing assembler code. It would also enable a clean API for registering handlers: at present you have to add a link to a linked list at a magic address in the kernel workspace. (Bleurgh.) However, that’s still a fairly major step, so I thought I’d solicit opinions first. |
Jeffrey Lee (213) 6048 posts |
That wouldn’t surprise me. I’ve often seen the data abort handler get stuck in a loop by causing an abort inside itself (according to the DebugAborts output at least). But each time it always seemed to be something else causing the initial abort, e.g. dodgy lazy task swapping, so I’ve never actually looked into it.
There is OS_ClaimProcessorVector, but I guess it isn’t appropriate in this instance if you want to save the claimant from having to decide for himself what the abort model is, the page table location and format, etc. Taking the load/store fixup code out of the kernel does sound like a good idea, if only for the fact it allows the fixup code to be rewritten in a higher-level language. |
Jeffrey Lee (213) 6048 posts |
It looks like Castle already anticipated some of the potential problems with unaligned loads/stores: http://www.iyonix.com/tools/halfword.shtml So with ’-cpu 4 -memaccess +L22+S22-L41’ inserted somewhere sensible (e.g. Library.ToolOptions.APCS-32 for now) we should be able to force the C compiler to use LDRH/STRH where appropriate, and never ever use unaligned loads/stores. I’ll give this a go tonight and see what happens. |
Jeffrey Lee (213) 6048 posts |
After looking at some disassembly I can report that it certainly seems to convince the compiler to use LDRH/STRH wherever it should. But unfortunately the change doesn’t seem to have fixed any of the dubious bugs I know of (connect/disconnect messages from USB hubs, Territory_SetTime/OS_Word 14/15 not setting the correct time, command-line zip exiting because it’s been ‘interrupted’). So either it isn’t a complete solution or all of the above bugs are caused by something else (Territory_SetTime in particular, since that’s likely to be plain assembler). Also, for anyone looking to implement this change themselves:
|
Ben Avison (25) 445 posts |
Well spotted Jeffrey, that’s a compiler switch I wasn’t aware of. I suspect it will have been added – like most changes of that era (about the time C99 support was added) – to maintain compatibility with ARM’s ADS compiler, which shares a common heritage with the Acorn C compiler. In fact, there also seems to be a more cryptic way of specifying the same thing using the switch -zh2(+|-){3}. As I think I mentioned before, specifying -cpu 4 in ToolOptions will cause problems for people who want to build Beagle ROMs on Risc PCs. It appears to be permissable for -memaccess to be specified without -cpu, so for now I suggest just the option -memaccess -L22-S22-L41 for maximum compatibility. Since the compiler’s -cpu switch is nominally supposed to mean that the generated code will run on the specified architecture or any later version, I propose to change the default -memaccess setting in the next version of the compiler to +L22+S22-L41 (note that unless you specify -cpu 4 or later, it won’t try to use LDRH/STRH so the first two options are meaningless). Despite this, ToolOptions will need to change, at least for Beagle builds, to support people who haven’t upgraded to the latest compiler. The thing is, it would be a shame to turn off this optimisation for (say) Iyonix ROM builds, which would benefit from this (particularly because LSL #16 followed by LSR #16 tends to cause a pipeline stall on XScale), and where the ARMv6 incompatibility is a non-issue. But ToolOptions is currently switched based upon APCS, and arguably it needs a more advanced switch, like the ones I recently created in Hdr:CPU.Arch. But ToolOptions is an obey file, not an assembler source file. |
Ben Avison (25) 445 posts |
For those that haven’t been monitoring the CVS changes page, I’ve updated the default compiler options for OMAP3 builds to stop it trying to use pre-ARMv6 unaligned loads. I’ve also made the relevant changes to makefiles so that the tools built to run on the build host during the build use different compiler options from those for the target, which allowed me to change the default compiler options for OMAP3 ROM builds to use v4, v5 and v6 instructions. Having done this, I was able to scan the ROM image for remaining unaligned loads – and I found quite a lot of them, mostly mistakes. I’ve fixed them all (hence the flurry of commits), so running with unaligned data aborts enabled (for performing further bugfinding) is probably a practical possibility now. Anyone trying this should be sure to set the NoUnaligned option in Hdr:Machine.Machine to ensure that any optimisations that rely on ARMv6-style unaligned loads are disabled. |
Jeffrey Lee (213) 6048 posts |
Excellent! |
Steve Revill (20) 1361 posts |
I think this does mean that you need a reasonably recent C Tools release in order to build for Beagle now – one with which cc includes the ARMv6 stuff. Is that correct? |
Ben Avison (25) 445 posts |
Oh, did you have to find the flaw in my otherwise brilliant plan? Yes, anyone who’s compiling the ROM using a crusty old copy of Norcroft that doesn’t support the -cpu 6 switch won’t get very far. At least anyone in this situation only needs to be change one place – RiscOS.Library.ToolOptions.APCS-32. But be warned, if your compiler doesn’t at least support the -memaccess switch then you won’t be able to build a reliable Beagleboard ROM (though you’ll have more luck with Iyonix or IOMD ROMs). |
Jeffrey Lee (213) 6048 posts |
I’ve just tried grabbing a clean copy of the OMAP source tree from CVS, and it looks like it’s slightly broken when compiling on RISC OS 5.12/Norcroft 5.65. SpriteExtend is failing to build, because at several points s.putscaled literally contains the string ‘Undefined instruction’ instead of an assembler opcode! From looking at the makefile I can see that s.putscaled is generated by Norcroft, so I’ll leave it to you to work out exactly why Norcroft is capable of generating code that it can’t describe :) [edit] The source lines it breaks on are 265, 274, 301, 462, 771, 773, 778, 864, and 871. I’ll try getting it to assemble to an AOF and see if I can track down the opcodes it can’t produce a listing for. |
Jeffrey Lee (213) 6048 posts |
According to StrongEd (which I guess uses the Debugger module), the undefined instructions are: E6FF0070 E6FF0071 E6FF2070 E6BF1072 E6BF3071 E6BF1071 E6BF2072 E6FF0072 E6FF0072 ...which looks to be SXTAH and UXTAH. |
Ben Avison (25) 445 posts |
SXTH and UXTH actually (Rn=15 means no accumulator is used). When I cleverly worked around objasm not understanding SXTH and UXTH, I was forgetting that unless you have the latest version of CC, although it has code generation for ARMv6 instructions, the built-in disassembler doesn’t know how to disassemble them. Sorry about that. The effect is that you need CC 5.67 to compile a version of SpriteExtend that uses ARMv6 instructions. You can either downgrade your entire build to only use ARMv5, or override the -cpu switch in SpriteExtend’s makefile (but remembering that v4 and v5 instructions need to be disabled for IOMD builds). Ugh, this is messy. |
Ben Avison (25) 445 posts |
Or perhaps a sed script to replace the “Undefined instruction” lines with the appropriate UXTH or SXTH instruction, since there are so few of them to worry about in this case. |
Jeffrey Lee (213) 6048 posts |
Or perhaps an answer to the question of whether cheap/free upgrades for users of the Castle C/C++ tools are going to be made available :) |
Jan Rinze (235) 368 posts |
I am surprised the CC does not convert the Undefined instruction per default to ‘DCI &xxxx ; Undefined instruction.’ which would at least have kept it working for Ben. Also that would make it easier if there would be more instructions that the disassembler can’t understand. So that way the dissasembler allows for unknown code if necessary.. (if possible ofcourse.) Is there a AOF disassembler that can do this separately so that the disassembling step can be done with older compilers. And possibly patching methods? |