[Cortex] Unaligned loads/stores
Jan Rinze (235) 368 posts |
Come to think of it.. the macro solution of the ARMv6 instructions could be put in some nice header file. That way all assembler code would be able to use it and even compile correctly with older objasm. maybe even some C header files to use the macros on older C compilers :-) |
Ben Avison (25) 445 posts |
Hadn’t spotted that – I’m not accustomed to checking for comments in the news pages (we get so few). Thanks for pointing it out. As to upgrades, we haven’t finalised how to handle them. We don’t have lists of previous owners of the tools so it’ll probably involve people snailmailing their Castle C/C++ CDs to us as proof of purchase. There hasn’t really been any clear need for anyone to upgrade until this build issue became apparent, so we haven’t worked out a pricing policy yet – I’ll have to ask you to be patient until we sort it out.
That’s not a bad idea. In fact I may sneak that into the next version of the disassembler (shared by CC and decaof), thanks.
Well there’s decaof – but that uses the same library as CC and so has the same issue – and is under the same distribution restrictions. There might be an equivalent tool in GCCSDK if you’re lucky.
Theoretically yes, with sufficiently complicated macros. But the parsing could get quite complex, especially since a lot of the instructions don’t fall into the category of opcodes where all the parameters are register numbers (these can simply be shifted and ORed with a constant to create the instruction). Difficult to maintain, and potentially very slow.
I wish – sadly, inline assembler in C files can’t just accept random DCI opcodes, the compiler needs to do register allocation, peepholing, scheduling and so on after the stage where inline assembler is inserted into the flowgraph, so it needs at least some understanding of what each instruction does. |
Terje Slettebø (285) 275 posts |
Hi all. I’ve read this thread, and searched the net, as well as checked the ARM ARM (but I only have the ones for the latest ARM versions), but I haven’t found the answer to this: Could someone please tell me what the change in unaligned access is, and at what ARM version it changed? Regards, Terje |
Jeffrey Lee (213) 6048 posts |
The easiest way to explain it is probably via code. On ARMv5 and below, “LDR R1,[R0]” is essentially implemented as follows: BIC temp,R0,#3 LDR R1,[temp] AND temp,R0,#3 MOV temp,temp,LSL #3 MOV R1,R1,ROR temp I.e. it ignores the bottom two bits of the address, and then rotates the result so the data is in kinda the right place. I believe this behaviour was originally just a side-effect of the LDR/STR circuitry being shared with the LDRB/STRB circuitry (since the LSB always ends up containing the byte pointed to by R0). For ARMv7 it’s implemented as follows: LDRB R1,[R0] LDRB temp,[R0,#1] ORR R1,R1,temp,LSL #8 LDRB temp,[R0,#2] ORR R1,R1,temp,LSL #16 LDRB temp,[R0,#3] ORR R1,R1,temp,LSL #24 I.e. 4 sequential bytes are loaded and packed into R1. For ARMv6 things are a bit more complicated; I believe that some architecture versions allowed you to control which behaviour was used (via the ‘U’ bit of the system control register), while other versions were fixed in the ‘new’ behaviour. For STR, the old behaviour is to just treat the lowest two bits of the address as being 0. For the new behaviour, the data is stored to sequential memory locations, mirroring LDR. For LDM/STM it’s always been the case that LDM/STM treat the lowest two bits of the address as being 0. But for ARMv7 (and perhaps ARMv6?) any attempt to LDM/STM with an unaligned address will cause an abort, irrespective of the setting of the ‘alignment exceptions’ bit in the system control register. That’s the basics of it; if you want to know the specifics of all the other memory access instructions then it looks like section A3.2 (“Alignment support”) and appendices G and H (“ARMv6 Differences”, and “ARMv4 and ARMv5 Differences”) of the ARMv7 ARM contain most of the required information. |
Terje Slettebø (285) 275 posts |
Hi Jeffrey. Thanks for the explanation. It’s coming back to me, now. :) I.e. I remember that this was the way unaligned load/store used to work, and I understand that they work “properly” in ARMv7. Given that unaligned load/store worked kind of strangely (possibly as a side effect of the way LDRB was implemented, as you say), one would think that they wouldn’t be much used in pre-ARMv6 code. As for unaligned LDM/STM, I’d think that would be even rarer, since it had no effect before. The difference between before and now, when it comes to LDR, is that now, we do get the same bytes in the same positions as before, but we also get any bytes from the next word. This means that if the code just cares about the bytes it used to get, it should also work for ARMv7. I admit that’s a big “if”. :) The same goes for STR: ARMv7 writes the same bytes as before, without clobbering the rest of the word being written to (unlike pre-ARMv5), but it may write to the next one. Again, this might not matter, depending on the code. Just trying to get this clear in my mind (and it might help someone else, as well). :) I understand that this is a problem for some code, though. Since code like this may or may not work, depending on the ARM model, it’s probably best to avoid it completely (and instead use LDRB/LDRH, etc.), and enable unaligned data abort to identify potential problem code, as has been suggested. If code using unaligned load/store relies on pre-ARMv5 functioning, either patch it to work this way on any ARM version, or register it for unaligned trapping and pre-ARMv5 emulation by an Aemulor type of application (for those running ARMv5 and above). I know this has been suggested by others in this thread: Just giving my support to it. :) Regards, Terje |
John-Mark Bell (94) 36 posts |
No, it doesn’t. It simply discards the bottom two bits of the address, just like LDM/STM. Only LDR has the rotational behaviour.
Unfortunately, reality differs :) For starters, both Norcroft and GCC regularly make use of the rotational properties of unaligned LDR. This can be disabled (and is by default in GCC4), but that doesn’t change all the pre-existing binaries out there. The upshot of this is that some existing programs will appear to work as normal until an unaligned access occurs, when they’ll end up with junk data to work with. |
Jeffrey Lee (213) 6048 posts |
Whoops! Post should be fixed now, to avoid further confusion. Today I had a go at making a build with the NoUnaligned option turned on (and alignment exceptions enabled). It turned out there were a few bits of code that needed fixing, so I’ve checked the required fixes into CVS. Most of the code that needed fixing was trying to load data from arrays of halfwords, so I’ve added a new pair of macros to Hdr:Macros to deal with it in a clean manner. I’ve also changed the Cortex kernel so that it enables alignment exceptions automatically when NoUnaligned is set to TRUE, which should aid testing in the future. Also, unrelated to the above fixes, I’ve fixed the bug that was causing the register names to be corrupted in Debugger’s disassembly. It turns out it was a bug in OS_SetVarVal where the kernel wasn’t calling XOS_SynchroniseCodeAreas after copying a code variables code block into the heap :( |
Jeffrey Lee (213) 6048 posts |
In fact, I’ve just checked in another change to enable NoUnaligned (and therefore alignment exceptions) by default for the Cortex builds. This should help a lot in identifying which existing programs do/don’t work (and where they’re going wrong if they don’t) |
Jeffrey Lee (213) 6048 posts |
I’ve decided that getting this custom abort handler working is the next big thing I’m going to work on. Although I started work on writing an instruction decoder last week, I soon ran into the problem that the ARMv7 ARM doesn’t make things quite as clear or concise as the previous versions. The code I was writing was going to deal with all ARMv7 load/store instructions (including 16 & 32bit thumb encodings), but due to the large number of special cases I don’t really trust my decoding logic (it doesn’t really help that I was writing the code on a PDA and so couldn’t have multiple windows on screen at once!). So I’m going to start again with some code purely for detecting & decoding ARM load/store instructions, release that, and then (hopefully) come up with a general-purpose tool that will generate C (or assembler?) for decoding instruction sets. Using a tool to generate the decoding logic has a number of benefits to most hand-crafted decoders:
Of course writing a general-purpose tool for rapid decoding of instruction sets is a bit of a major task in itself, due to all the special cases that crop up, which is why I’m postponing it until the first version of the abort handler is working. |
Jeffrey Lee (213) 6048 posts |
It looks like LLVM uses a similar tool to my proposed disassembler generator (docs here and ARM specific source here) However it doesn’t strike me as a particularly great solution (or not a solution that’s great for us). In particular it doesn’t seem to come up with an elegant solution to all the special cases; although I can see where the unconditionally executed instructions are defined, I can’t see any of the rules that state that conditionally executed instructions never use cond=1111 – leading me to suspect that it’s enforced explicitly in some of the C++ source, or even worse, isn’t enforced at all. In fact, I’m not even sure that all the conditionally executed instructions support disassembly/assembly of the condition code field (it looks like only the branch instructions support it). |
Jeffrey Lee (213) 6048 posts |
After a couple of days of fruitless search attempts I found the magic search terms that would yield useful results from Google. This then led me to information about the ID3 algorithm, and the realisation that it’s that algorithm that I was attempting to reinvent for the decoder generator. I had done a bit about machine learning and ID3 while at Uni, but seem to have forgotten some of the finer points since then! So the long story short is that I’ve now got a fairly solid plan for how to write a decoder generator that will produce sensible decoding logic for the ARM/Thumb instruction sets, including a few different ideas on how it can handle the special cases. I’ve even got a proof-of-concept BASIC program for generating a decoder for the Thumb instruction set, but that code will soon be abandoned since it doesn’t deal with special cases in an elegant manner. Although I was toying with the idea of writing the generator in an interpreted language (to allow code snippets to be used to describe the constraints on particular encodings), I’ve decided that it’s best to just do it in C and implement a simple expression parser instead. This will allow the generator to evaluate the constraints itself, as well as allow it to translate the expression directly to C/assembler/whatever if it needs to evaluate it at as part of the decoding logic. And since I’ve worked out most of the design issues, don’t be surprised if I skip my plan of hand-coding an ARM decoder first and just go straight for the machine generated one! |
Steve Revill (20) 1361 posts |
To be honest, that sounds like a lot more fun. Plus, if we can use the resultant generated decoder code in the Debugger module as well, then it’s an added bonus. We also have the DecAOF program which does a similar job to the disassembler function of the Debugger module, and it’s decoder code is shared by the C compiler. This has had some updates recently, including adding a switch for disassembling into ARM’s UAL format rather than the traditional ARM assembler format we’re all familiar with. At some point, we’ll probably have to make objasm understand UAL. |
Jeffrey Lee (213) 6048 posts |
Exactly my thinking :) Although the more I think about the generator, the more I keep switching between the “Oh God, the code’s got to work with a data set containing 2^32 elements” viewpoint and the “Nah, everything will be fine!” viewpoint. |
Jeffrey Lee (213) 6048 posts |
One good day of coding and I’ve now got a working generator. So far I’ve only tested it on the 16bit Thumb instruction set (and only asked it to generate a decoder suitable for performing a handful of actions), but after a few tweaks it seems to generate sensible trees for dealing with (simple) constraints, and the self-verification reports that the trees are accurate. Next step is to try typing out the definitions for the full ARMv7 instruction set and see what happens! On an Iyonix it does take a couple of seconds to generate the top level of the decision tree (which is by far the slowest part of the tree generation), so I have a feeling at least some optimisation will be required when it’s asked to deal with the full instruction set. Luckily I already have a couple of ideas for how to speed it up. |
Jeffrey Lee (213) 6048 posts |
Things are progressing nicely. I’ve made numerous improvements to the generator, and have transcribed the ARM & Thumb instruction sets into a set of encoding definitions for use by the generator. The encodings were taken from the ARMv7 ARM, so will cover ARMv4-ARMv7, VFPv2, VFPv3, Advanced SIMD, and ThumbEE. Apart from needing a couple of minor changes the encodings are 100% unambiguous, so once I’ve filled in the few remaining undefined/unpredictable encodings I’ll be ready to test out the tree generation. At some point I’ll also need to go back and add support for the discontinued instructions (e.g. TEQP, and the FPA instruction set). At the moment it takes several minutes for an Iyonix to check a 2^32 instruction set (e.g. ARM or Thumb2) for ambiguities. Tree generation should hopefully be quicker, although I can’t say for certain until I’ve tried it. It’ll also be interesting to see how long it takes the brute-force tree verification to run – although since that’s mainly for debugging the generator I doubt we’d need to use it much. Of course the generator source code is fully portable, so it’s easy to get a meaty PC to verify the tree or to debug the code using GDB/Visual Studio (which I’ve already had to do once or twice so far). Also, although I intially said I was going to write the generator in C I quickly saw sense and realised that C++ with some STL usage would save a lot of time and effort. So no compiling it with Norcroft, I’m afraid! Although there are a few bits of the generator that I’m unhappy with, I’m expecting to have most of them fixed by the time the first release is made, except for the following two. Luckily they won’t affect our use of the decoder much.
|
Steve Revill (20) 1361 posts |
Sounds like good progress. When you say “tree”, this is presumably the data loaded into a general purpose parser algorithm so that the parser knows how to parse any input binaries (machine code) into human-readable output? If so, once a tree is generated, this can be saved out as a file or built into the parser module? E.g. you’d have the Debugger module which calls your parser routine saying “give me some ARMv7” and your parser just references the appropriate pre-built tree… In this case, I wonder how we’d implement things like the comments that say “Unsafe on ARMv5 or later” and stuff like that…? |
Jeffrey Lee (213) 6048 posts |
Not really. The tree representation is only used internally by the generator; instead of outputting the tree and requiring code to use a generic tree parser, it generates custom C code that has the effect of walking the tree for you. Once it reaches a leaf node it executes an instruction/encoding specific block of code (as specified by the ‘action files’ fed to the generator); this code can then either perform the entireity of the action (e.g. produce and output a line of text for the disassembler) or return a value to whatever called the tree function (e.g. a value from some big enum indicating what instruction it was). The former action is likely the easiest, because for each action you can specify which fields (i.e. register numbers and operands) of the instruction each action requires. The generator then outputs code to extract them into variables ready for use by the action.
I think you’ve touched on two things here – unpredictable/deprecated behaviour and handling different architecture variants. At the moment detecting unpredictable/deprecated instructions is best done in the ‘actions’. This is simply because it would have taken me too long to translate all the rules into the encoding files, and it would have slown down the generator by another order of magnitude (some of the rules are quite complex!) Furthermore not all code will be interested in which instructions are/aren’t unpredictable (e.g. an emulator isn’t likely to care if it’s performing an LDM with writeback and the base register in the register list; if that variant of the instruction is unpredictable then it doesn’t really matter what the emulator does when it sees it, as long as it doesn’t go rogue and delete all your files) The second thing you’ve touched on – handling architecture versions – is still a bit up in the air. Although it isn’t quite possible at the moment, you could have an encoding file which covers all architecture versions. Then you can either provide one action file for each architecture you’re interested in (producing multiple “trees” as output – which may be wasteful since there’d be a lot of duplicated code between them), or you could have one action file where each action checks the architecture version and performs a different behaviour as necessary (which isn’t always ideal either, since for situations where the instrucion has changed from one architecture to another – e.g. WFI is actually a NOP MRS/MSR – you’d have to extract the register numbers, etc. from the instruction manually instead of relying on the generated code to do it for you) Once the first version of the code is out I expect I’ll have to work quite a bit on making it easier to use with ‘flexible’ data sets. E.g. over the weekend I added simple macro support to the encoding files. So for each encoding in the FPA file I’ve added a macro which either expands to nothing or expands to a constraint which prevents the use of the NV condition code. This allows the same encoding file to be used for <=ARMv4 and >=ARMv5, just by specifying different options on the command line. I should have the first version of the code released sometime this week, so if none of the above makes any sense to you then enlightenment shouldn’t be too far off ;) |
Steve Revill (20) 1361 posts |
It mostly makes sense. I’ve written a number of (top-down, recursive descent) parsers in the past, and parser generators which take a form of EBNF as input. But what you’re describing sounds a lot more powerful in some respects and a lot more focused at assembly language in other respects. I’ll be interested to see the finished product. |
Jeffrey Lee (213) 6048 posts |
Still very much rough around the edges, but the first release of ‘decgen’ is now available here. Included are the set of encoding files I created based around the ARMv7 ARM, along with some ARMv2/2a/3/FPA encodings created from whatever old datasheets I could get my hands on. There’s also an example program, and a most likely incomprehensible manual/readme. Plus source and binaries for RISC OS/Windows/Linux. The example program is a hastily written 16bit Thumb ‘disassembler’. I verified the output of the disassembler against that of the Debugger module as a way of making sure the largely untested code generator was working fine (it wasn’t!). Note that although the disassembler will warn about some unpredictable instructions, it won’t warn against all of them, so don’t put too much trust in it (Although, the Debugger module doesn’t seem to warn about all of them either!) Now that I’ve verified the basic functionality of the program I’ll move onto the next stage and start work on the unaligned load/store exception handler. I expect I’ll be making various improvements to decgen along the way, so should you or anyone else start using it yourselves then feel free to send any suggestions/improvements/bug reports back my way. Regarding speed – tree/encoding verification can still be very slow. The ‘checkall’ script completes in under half an hour on both my Windows and Linux PCs. My (overclocked) beagleboard has just completed it in 57 minutes. An Iyonix, however, will take at least twice as long, most likely longer. The last time I tried verifying the most complex instruction set (ARMv7 + VFP + ASIMD) it took around 50 minutes; the ‘checkall’ script will verify much more than that (although I think that 50 minutes was before I added some more optimisations to the code). Since I haven’t yet tried generating a decoder for a 32bit instruction set I’m not sure how long that will take when compared to the time taken to verify the encodings. Encoding verification can be skipped, so if tree generation doesn’t take too long there shouldn’t be any problems with using decgen for building various RISC OS components. |
Jeffrey Lee (213) 6048 posts |
Not sure if anyone besides me has tried using it yet, but there were quite a few problems with that first version of decgen. Apart from the bad bugs, it turned out that the tree generator was horribly slow – taking over 18 hours for my Iyonix to generate a tree for the ARM instruction set (and then aborting halfway through due to a spurious assert) Luckily it wasn’t too hard to find a better algorithm, so tree generation now takes around 10 minutes. Several minutes is being spent evaluating the worth of just one constraint, so if I add some code to split up complex constraints then that should result in a big performance boost. IIRC verifying the tree took around an hour, which isn’t too bad, and encoding verification is now faster as well – the ‘checkall’ script completed in around 50 minutes on my Iyonix, which must be at least a 50% reduction judging by the time the BB took. There are also a few tree optimisations I want to try – e.g. at the moment it seems to favour placing the NV condition code check near the leaves, when ideally it should be placed further near the top. This would reduce the number of nodes in the tree, resulting in reduced code size, and therefore increased performance due to better I-cache utilisation. If all goes well I should have the new version up sometime over the weekend, so keep an eye on the website if you’re interested. |
Jeffrey Lee (213) 6048 posts |
After working my way through the ARMv7 ARM, I’ve now got some code that should allow fixup of unaligned accesses for instructions which (a) were present in ARMv5 or below and (b) would have behaved sensibly on ARMv5 (e.g. LDRH doesn’t count since it never supported the rotated load behaviour). So now I need to tackle the part of hooking it up to the OS. To do this I’m planning on following the basic formula that Ben suggested in this post – i.e. add a new vector (DataAbortV?) and have the kernel abort handler use the vector’s return value to decide wheter to retry/continue/call the data abort envionment handler. However I’m planning on being a bit more frugal with the information the vector is given – e.g. there’s no point telling it whether the late or restored abort model is in use, or the page table format, because those properties will be fixed and can just be read on startup via a SWI. Also, I’m thinking that in order to get the best performance we should use multiple vectors, one per fault reason code (32 vectors in total). We might want to use some kind of bespoke vector registration API for this, to make sure that the lazy task swapping code isn’t knocked off its perch as the first claimant of the relevant fault code(s) (just in case someone’s been naughty and registered an abort handler that’s in application space) Any thoughts? |
Jeffrey Lee (213) 6048 posts |
Actually, thinking about it now, if we were to use multiple vectors we’d certainly need a bespoke API, just to protect ourselves against problems if the number of fault status codes increases in the future. |
Ben Avison (25) 445 posts |
Fair enough – I see at least some of them are already available via OS_PlatformFeatures, so that would be the logical place to add any others.
Some context might help here. I assume you’re talking about separate handling for each combination of the 5 fault status bits in the DFSR (data fault status register)? My gut instinct is that the layout of the DFSR is the sort of thing that ARM is bound to change at some point in the future, so I’d be reluctant to use it as the basis of an API. Thinking about the ways in which this vector might be used, I think they basically falls into two classes – those which are going to cause the instruction to re re-run, and those which are going to do some other action instead. Examples of the former are lazy task swapping and a virtual memory system, and examples of the latter are old-style unaligned emulation and VIDC emulation. (At least we should never have to worry about emulation of hardware beyond VIDC/IOMD, as any later drivers should be using the HAL!) The former always has to be done before the latter – imagine if the first access to a page which is lazily swapped out is an unaligned access: in this case, both the lazy task swapping and unaligned handlers need to be called, in that order. Slightly more tricky would be something which I once implemented for the RMA, but which would be nice to have working in application space with lazy task swapping enabled – and that is watchpoints. This would need to have some way to resolve conflicts between the lazy fixup code and the debugger (by which I include the Debugger module, DDT, DeskDebug or any other debugger which might come along). Not something I have time to plan out at the moment, but worth bearing in mind when designing the abort API. Oh yes, and it occurred to me that it would be really nice if the DFSR could be used to give data abort error messages more context. Maybe something like “Abort on data transfer (type 1)” for alignment faults, or even more explicitly “Abort on data alignment fault”? |
Jeffrey Lee (213) 6048 posts |
Yes, that’s right.
Yes, it’s likely to change, but if we use a custom API for it then the API can change as well. We could try and defend against breakages by introducing our own translation layer to map RISC OS fault statuses to ARM ones, but at the end of a day if an abort handler doesn’t check for the correct CPU architecture before registering itself then it will break regardless of how many safeguards we put in place.
Yes, that’s true. Thinking about the issue further, what would happen if someone was to try emulating ARMv6+ unaligned load behaviour on pre-ARMv6? If the paging system contains its own disassembler (in order to catch LDMs/STMs that cross page boundaries) then it won’t have a clue that there’s an emulator running that causes unaligned LDRs/STRs to span two pages, and so it will fail to page in the second page, causing the emulator to fail. With that in mind, and with your example of watchpoints, it may be worth considering a system where any emulated memory access that an abort handler wants to perform has to go through a common API to allow on-demand paging systems, debuggers, etc. to react to the access.
Yes, that’s definitely something I was planning on implementing. |
Jeffrey Lee (213) 6048 posts |
I’m getting mixed success with my testing of the abort handler. Some programs work fine, others don’t. Unfortunately the programs that don’t work don’t appear to be failing due to any bugs in the abort handler. E.g. zip (compiled with GCC 3.4.6, IIRC) and egrep (from the ROOL BBE) both crash due to unintentionally entering Thumb mode. And while my copy of Norcroft seems to work, an OMAP3 ROM build falls over when building SuperSample because amu makes a mess of the ’${RUN}$* $@’ line in the makefile and tries to execute the command ’-o o.Matrix2 s.Matrix2’ (without triggering any alignment faults, so I know it’s not a bug in my code!) |