ARM on ARM discussion
Jeffrey Lee (213) 6048 posts |
It’s only the cases where the branch destination is in the same block that I’m interested in. If the branch leads to a different block (even if it’s one that already has code generated) the JIT will take the approach of writing back all the dirty registers and forcing the target block to re-load registers as required. Similarly, it’s only going to consider cases where the JIT was able to predict at code generation time that the branch was going to stay within the block. So B and BL instructions will be dealt with, but most other branches probably won’t. |
Jon Abbott (1421) 2651 posts |
If the block has a fixed set of register mappings, determined by the JIT when it analysed the source instructions, I don’t see an issue; just branch directly to the in-block address, there’s no requirement to either store or restore any registers.
That would probably be near optimal anyway, as with so few workings registers it would likely take more cycles to work out which to swap and which to store/restore. |
Jeffrey Lee (213) 6048 posts |
Yes, if the block has a fixed set of register mappings :-) Some loops could be annoying and simply use lots of registers, or some instructions could need lots of temporary registers (e.g. load/store) However a quick skim through the red dragon book makes me realise that it’s probably best to just go with a tried-and-tested method if a simple code generator is deemed insufficient. Analyse the block and generate a code graph, analyse the graph to assign registers, break the graph into DAGs, and then generate code (with a simple instruction scheduler to help lessen the impact of macro-ops like load/store instructions) |
Jon Abbott (1421) 2651 posts |
How are you planning to cope with register saturation? Are register mappings going to by dynamic through the block, or a fixed set of most used registers with some spare for dynamic register assignment? What are R0..R14 assigned too when JIT’d code is running?
Sounds like the final stages of a compiler. Is the code generation single pass? At what stage does it perform optimisation, if any? |
Jeffrey Lee (213) 6048 posts |
Yeah, that’s basically what it’s going to end up being.
I’ll let you know once I’ve implemented it! But before going for the full-on compiler-based approach I’ll be implementing a simpler system which focuses on register allocation. The first version would probably still do single-pass code generation, with host registers being assigned on an LRU basis. It won’t generate particularly optimal code, but it should be a lot better than the current code generator. Later on, e.g. when I add support for in-block branches, I may switch to a simple two-pass approach which does some basic register lifetime analysis so that it can get a more accurate idea of what registers will need to be active for each branch. |
Jon Abbott (1421) 2651 posts |
You really want to get all instructions running under the JIT so you can analyse where optimisation is required. Do you have a clear enough design for PC referencing instructions to be able to do that? What you might want the do is leave the emulator to run all code, but direct branches to a specific address to the JIT, so you can run specific test code without having to support the OS, state machine and timers etc. |
Jeffrey Lee (213) 6048 posts |
I have added some metrics reporting now, so I’ve got a pretty good idea of which areas need improvement. E.g. I can see that (for my test case of booting to the desktop) over 50% of executed instructions are being handled by the JIT, but the vast majority of executed code blocks are only one (input) instruction long. Out of the instructions which aren’t being handled by the JIT, the vast majority are B/BL or LDR/STR, so those are the next instructions to focus on trying to support. I can also see the relative code density of the input vs. the output; at the moment the output is about 10x larger than the input, but I think that’s mostly down to the short code block length (the epilogue which updates the PC, cycle counter, etc. is quite lengthy). Longer code blocks should be better, and I can also experiment with passing some values into the JIT functions in registers (the PC and cycle counter would be the obvious candidates, since every code block will need to update those).
Yeah, I think so. I just need a few easy-to-use utility functions for generating the appropriate code sequences.
Good idea. Adding a simple address range filter to the JIT should be easy enough, and if I wanted I could add extra hooks onto the ArcEm SWI handler so that test code running in the emulator can control the filter. Another option I was thinking of was to just replace the RISC OS ROM image with a binary which performs whatever tests are necessary. |
Jon Abbott (1421) 2651 posts |
I’m guessing they’re one instruction long due to unimplemented instructions. Implementing LDR/STR/LDM/STM that don’t reference PC should make a big difference to block length, as will BL, SWI and conditional B.
Doesn’t sound too bad, but I wouldn’t expect that to improve as LDR/STR/B/BL aren’t short code sequences, particularly where LDR/STR reference PC. Gives you a rough idea of final speed, which I’m going to guestimate at around 8% of the host CPU speed. An ARM610 based RiscPC might be viable on a Pi1. |
Jeffrey Lee (213) 6048 posts |
Indeed. I’ve extended the ALU instruction handling to allow for PC reads, and I’ve implemented support for LDR/LDM (as long as they don’t write to PC) and B/BL, and I’m seeing some promising improvements in performance. Implementing the remaining (non-rare) instructions shouldn’t be too much trouble. However single-pass code generation doesn’t lend itself to producing very good code, so if I’m going to take this much further then at some point I think I’ll have to start prototyping a more advanced, graph-based recompiler. |
Jeffrey Lee (213) 6048 posts |
Not much progress over the past week; I’ve hit a bit of a wall in terms of what can easily be done to the JIT (developing a useful code graph system is going to take a lot of time), and my other RISC OS work is starting to take priority. And in true tradition, working on the JIT has revealed a couple of nasty bugs in ArcEm, so I’m going to have to spend some time tracking down and fixing those. In any case I expect I’ll submit the JIT sources to arcem’s CVS sometime this week, so that other people can at least get an idea of how (not) to do things. |
Jon Abbott (1421) 2651 posts |
I’ll take a look once it’s available, did you get the majority of instructions encoded, which are outstanding? Whilst doing some thought experiments on reencoding methods the other day, I realised blocks need only perform one state machine update per block, or where blocks can branch within themselves (for local loops), B is preceded by a machine state update. This allows the code to use most registers, substantially reducing the amount of register reallocation and reencoding of instructions required. One of the more appealing methods for triggering the state update is via an inline SWI which passes the number of cycles since the last update, avoiding the requirement to store/restore user registers. This assumes the state machine is within a Module, not the application. There’s some further optimisation, instead of counting cycles in a register, include the cycle count in the SWI number, so base SWI + cycles with no parameters. e.g. SWI base is &100000 so SWI &100020 would indicate an update of 32 cycles since the last update This reduces the number of run-time registers required by the JIT’d code down to two, a stack and fastmap pointer and avoids the issue of directly interfacing with C. Think of the JIT’d code as the primary code that’s running, with the JIT, emulation and state machine as background tasks that need to preserve registers, not the other way around as would be the case with an emulator. Are you going to plan the graphing before attempting to code it? What are your ideas around this? How many registers can the JIT’d code use. |
Jeffrey Lee (213) 6048 posts |
No. I basically stopped implementing instructions once I realised that producing good quality code is effectively impossible with a single-pass system (and implementing some instructions within the constraints of the system was pretty tedious)
Ouch – I think the overhead of the SWI dispatcher will quickly add up if you start calling SWIs millions of times a second. Even if you were to hook directly onto the SWI vector (bypassing the kernel) the overhead of state switching within the CPU may still be pretty significant.
Yeah, I’ve been doing some pre-planning, but nothing too extensive yet. The initial prototype will be a standalone app, so I can easily feed it sections of code and examine the output. Exact details are yet to be nailed down, but it will likely do multiple transformation passes on the graph before it arrives at the final form. My current thoughts are that the passes would proceed as follows:
The current setup allows it to use R0-R12 and R14. But this might need tweaking a bit to make sure SL and FP are available for calls into C. |
Jon Abbott (1421) 2651 posts |
Would it be called millions of times a second? How often does it need to be called? Is it performing sound/video DMA as well as updating timers and IRQ’s? Sound DMA could be taken out of the equation by using a Channel Handler that emulates DMA. There’s probably a way of handling video DMA separately as well, if an IRQ could be raised at least once per raster.
What’s the overhead on state switching? I’ve seen this occur many millions of times/sec on some games under ADFFS, but they’re still running way too fast. The only game it was noticeable on was DarkWood, which stacks code, self-modifies and has intermixed code/data – it was still running quicker than the fastest StrongARM though.
5+ passes, that’s a lot of pre-processing. Will that not have an impact on perceived performance? By that I mean the output code will be optimized, but may introduce stuttering whilst it generates blocks, particularly when stacked and self-modifying code are used. Quite a lot of legacy C code I’ve analysed had a nasty habit of stacking SWI calls for no logical reason. Its possibly worth looking for those stack frames so they’re not continually being reencoded and impacting performance unnecessarily. Another potential performance impact is where the stack is placed in a code page, possibly triggering code to be dropped unnecessarily depending on how self-modifying code detection is performed.
Does it not want to retain the fastmap pointer in a register, considering the amount of times it’s going to be referenced? Memory access is going to be the biggest impact on performance. |
Jeffrey Lee (213) 6048 posts |
the overhead of the SWI dispatcher will quickly add up if you start calling SWIs millions of times a second “How often does it need to be called’ is definitely the key question here. I’d want it to be used to control everything which isn’t the core CPU emulation – timers, audio/video DMA, keyboard/mouse interrupts, multitasking with the host OS, etc. So depending on exactly how the events are grouped together (e.g. whether the screen is redrawn all at once or one scanline at a time) this could be anywhere from a couple of hundred events per second (100Hz timer emulation + ~100Hz audio DMA IRQ + ~60Hz VSync IRQ) to several thousand (320×256×60Hz would be 15k events/sec if it was on a per-scanline basis). But the problem with having an “increase cycle counter” SWI is that the number of calls to the SWI is going to significantly outweigh the number of times that an event needs to be acted on. The ArcEm JIT will increase the cycle counter, but it won’t trigger events while it’s in the middle of a block – it has to wait until it returns to the emulator. This might sound OK but it seems to be enough to break some games (e.g. Lotus II gets stuck on the pre-race loading screen – I haven’t checked exactly where/why but I suspect it’s sat in a loop reading two regular RAM locations waiting for an interrupt handler to set a flag to say that the palette swap interrupt routine is correctly synchronised). For a RiscPC emulator you might be able to get by with fewer event triggers than would be needed for an Archimedes emulator, but you’re still going to have to be pretty cautious (e.g. if a generated code block contains a loop, you’re going to have to insert a trigger check in the loop body) Storing the “time till next event” value in a register and only calling the SWI when it hits zero makes sense, but relying on a SWI to do the decrement does not.
If the event doesn’t need to generate a virtual IRQ/FIQ in the emulated CPU then yeah, you can easily use real interrupts (or threads) to take over some of the events. But if you’re using real interrupts to generate IRQ/FIQ within the emulated CPU then you’ll have the problem that the interrupt handler will need to be able to unpick the generated code to work out which registers were being used for what, and what the correct value of the emulated PC is. if you were to hook directly onto the SWI vector (bypassing the kernel) the overhead of state switching within the CPU may still be pretty significant I’m not sure. I think I’ve had FIQs running at ~1MHz on an Iyonix before without any serious side-effects. But calling a SWI is almost certainly going to be higher than calling a regular subroutine using BL.
Yeah, that’s certainly a concern I have. Once the design is fully implemented, it should hopefully be possible to merge together some of the passes, which should help to reduce the overheads.
I’m still planning on making the self-modification check be on a per-word basis. So unless an instruction which has been translated by the JIT is overwritten, the generated code shouldn’t be invalidated. The current setup allows it to use R0-R12 and R14 That’s why I said “current setup” :-) Once the new system is working there’ll be plenty of time to tweak register usage. |
Jon Abbott (1421) 2651 posts |
There’s a few other games that also require accurate VSync/Timer emulation to get past the loaders, such as Overload (Paradise).
Correct emulation of the 2Mhz Timers and FLYBK are the issue. If those could be IRQ driven there’s no requirement to update the state machine in-line. Audio DMA can be handled externally, as can virtual IRQ’s for I/O. Multitasking is more of a challenge, but could be build into something else. This leaves cycle counting, the routine of which could also handle the multitasking if it’s going to stall the code to achieve a set MHz. With regard to the screen blit, the limiting factor is syncing the 2Mhz Timers with the raster and raising FLYBK around the blit. I’ve proved you can accurately sync the Timers, but FLYBK is a problem when blitting the screen in one go, particularly with code that waits on it to update the screen (there’s only two games that I’m aware of that do this) In short, the bulk of the state machine could be handled outside of the JIT’d code. What’s the overhead on state switching? It was more a rhetorical question, I don’t think the hit is a major one. I’ve seen upwards of 16 million/sec state switches on some games, with minimal impact. Games however do have to stall for VSync, so possibly not a good example, compared to application code.
Without doubt, a lot slower. I wasn’t sure if the code was going to be within range of a BL. If it’s intermixed then BL is the obvious choice, even if the routine it’s branching too then loads PC to jump to a shared routine, BL is the preferred option. Although we both shudder at the thought of using SWI, VMware’s ARMv5 virtualization used SWI for STR/STM. |
Jeffrey Lee (213) 6048 posts |
I’ve now submitted the code to the ‘jit’ branch of ArcEm’s CVS. http://arcem.cvs.sourceforge.net/viewvc/arcem/arcem/?pathrev=jit I’ve tried to keep the JIT fairly self-contained, so most of the bits of interest will be within the ‘jit’ folder. |
Jeffrey Lee (213) 6048 posts |
While looking for something completely different, I stumbled upon this research paper about dynamic translation on ARM. Searching for “Strata binary translator” turns up several other research papers, some of which may also be useful. |
Jan Rinze (235) 368 posts |
https://github.com/kesl/khypervisor looks like a candidate for running RISC OS. |
Jeffrey Lee (213) 6048 posts |
I decided to take advantage of the easter weekend by continuing to ignore my other duties and start work on the “version 2” arcem JIT. Currently it’s just a standalone C++ app that loads a binary, generates a code/flow graph from it (representing the original structure of the binary), and then spits out a poorly-formatted draw file so that I can visually inspect the results. Not much in the grand scheme of things, but it should be enough to get me over the initial hurdle of “I don’t know how any of this code needs to be structured”. I’m also thinking of changing my approach for handling interrupts/scheduled events: My original plan was to have it so that the generated code would regularly check to see if an event needs to be triggered, and then drop out of the JIT to handle the event. Then afterwards it would immediately re-enter the JIT code. This would maximise the amount of time spent in the JITted code sections, but would probably result in a lot of bloat due to all the entry/exit points being generated. You’d probably be doing millions of interrupt checks per second, but an average machine only triggers around 300 interrupts per second. OK, things get a bit worse when you take into account the non-interrupt generating events the emulator uses for IO, but the number of redundant checks would still be very high. My new plan is to have it so that the code generator will break the code down into a sequence of blocks, and for each block, calculate the maximum number of CPU/clock cycles that can occur when executing that block. Then it can add a check at the start of the block to see if it’s possible for a scheduled event to fire while the block is executing. If it thinks an event will fire, it will execute the code for that block using the interpreter engine instead of the JIT. Only once it reaches the end of the block will it allow re-entry into JIT code. Some care will be needed to make sure the “don’t re-enter the JIT until the end of the block” logic works correctly when the scheduled event causes the emulated CPU to take an interrupt – e.g. by flagging the block as being in ‘interpreter mode’, and only clearing the flag the next time the CPU reaches the normal entry point for the block (or when the JIT page is regenerated). |
Jon Abbott (1421) 2651 posts |
Have you considered scheduling physical interrupts and avoiding having to worry about IRQ altogether in the code blocks? The physical IRQ handler can preserve the state of the JIT before repointing it to the interrupting code and restore the JIT state once the interrupt has been handled. Checking every code block is potentially a large chunk of CPU cycles if the block isn’t preserved once seen and checking for potential IRQ at the start of each block is also going to consume many cycles depending on how many IRQ you’re going to check for. I assuming your main concern is around scheduling synchronous IRQ? Async IRQ such as keyboard/mouse could schedule an IRQ to execute once the block completes, with a pending IRQ check done at branches to handle looping code. |
Jeffrey Lee (213) 6048 posts |
It has crossed my mind. But I’m not really a fan of it, since (a) it requires the host OS to support accurate timer interrupts (should be supported “soon” under RISC OS, but arcem is meant to be cross-platform), and (b) I’d like to avoid doing anything which would harm arcem’s timing accuracy (OK, the current version isn’t particularly accurate because the CPU frequency isn’t fixed, but it would be pretty trivial to add support for fixed-speed emulation in the future). Plus it’ll be harder to implement than my current plan – so I might as well try my current plan first before deciding whether an alternative will be needed.
I’m not sure I’m following you. The only time the JIT will throw away generated code is if the code is modified, or if it discovers that an extra entry point is required. So if it’s executing a loop of code, and each iteration of the loop contains a check for a pending interrupt, then it can execute N iterations of the loop using the generated code, execute 1 iteration using the interpreter engine (to allow the interrupt to occur in the middle of the block), and then execute the remaining iterations using the generated code from earlier.
Well at the moment arcem checks for interrupts on every single instruction, so it can’t be worse than that ;-)
Yeah, it’s mainly synchronous/timing-sensitive interrupts that I’m concerned about. There’s certainly the option to have ‘lazy’ interrupts/events which won’t be triggered in the middle of blocks, and if/when the code finds its way into RPCEmu things could probably be relaxed even further (since there’s less software which relies on precise CPU timing). |
Jon Abbott (1421) 2651 posts |
ARM-on-ARM is inherently not cross-platform, unless you’re proposing a generic cross-platform JIT. Yes, timers do need to be supported by the OS which is why I raised them as a pre-req for Virtualization many years ago.
It will need to count the cycles for each code block on entry, to see if an IRQ might occur. The code block cycle count might be dependent on entry register values. What approach are you taking? Are you attempting to improving the speed of ArcEm, in which case 2X speed is seen as a major improvement. Or are you coming at it from the opposite end, ie how close to the host CPU speed can the JIT get? The two approaches fundamental change the implementation, as the former requires accurate instruction timing and the later needs a low instruction ratio. I raise this point as achieving StrongARM performance is not going to be possible without using virtualization techniques. We know the hardware has to be emulated, as no modern SoC supports IOC/IOMD but the majority of instructions are supported, so a CPU paravirtualization approach is going to get closer to a 1:1 instruction ratio than binary translation into a CPU emulator. |
Jeffrey Lee (213) 6048 posts |
I was using platform as an umbrella term for the CPU architecture and the OS.
No – it’ll do static analysis of the block to work out the maximum number of cycles. I.e. it’ll assume worst-case performance for MUL/MLA. Or, since the CPU can’t be interrupted in the middle of an instruction, I could make it so that (a) the last instruction in a block doesn’t contribute to the cycle count estimate, and (b) blocks immediately end after MUL/MLA instructions (or any other instruction that could have a wide variance in timing). The interrupt check that it does before starting the next block will be able to spot any interrupt that was missed and handle it, without needing to fall back to the interpreter mode (unless of course the next interrupt after that is due to occur in the middle of the next block).
Currently I’m focusing on improving the speed of ArcEm – so yes, I probably would consider a 2x improvement to be an acceptable outcome. I guess the ideal outcome would be to get it fast enough to make a StrongARM version feasible (I did a test build once when I was working on Kinetic support for RISC OS 5 – it ran at about 50% ARM2 speed). |
Jon Abbott (1421) 2651 posts |
What are you proposing for SWI’s? Are they always going to be emulated and not included in the block analysis?
Only needs improving by several orders of magnitude then :| I’ve just tested the ADFFS JIT with RISCOSmark on a Pi3. The ARM3 JIT is 65% of a StrongARM 200 and the (unreleased) StrongARM JIT is 1368%. Running natively, the Pi3 is 1383% so the JIT’d code must be running very close to 1:1 instructions. The reason the ARM3 JIT is so slow is due to RISCOSmark intermixing code/data in the same page, so the JIT is having to check every memory write for code self-modification. |
Jeffrey Lee (213) 6048 posts |
SWIs will be handled in a similar manner to branches – i.e. the code generator will generate code for the SWI instruction, but then the block will have to end in order to switch to the block containing the destination code (zero page SWI vector). There’ll be quite a lot of overhead because it’ll branch through several more blocks before arriving at the right SWI handler, so if it turns out to be particularly slow I could probably work out some way of reducing the number of hoops it jumps through. The code after the SWI instruction will be the start of a new block – which could be a single instruction in length if it’s another SWI. Maybe there’s some mileage there for having some common code for executing SWI & branch instructions, to reduce the amount of pressure on the host’s instruction cache (in fact, it’ll be practically the same as the emulator functions for SWIs/branches, so it might be worth just using those directly) |