ARM on ARM discussion
Jon Abbott (1421) 2651 posts |
The purpose of this discussion, is to come to some agreement on the best way to develop an ARM on ARM backend for emulators, to provide support for 26bit software on ARMv5 thru ARMv7. We can assume emulation of VIDC, IOC, MEMC, VIDC20, IOMD, MMU are provided elsewhere. CPU mode switching is also handled elsewhere, with a suitable interface to allow translated code to switch CPU modes. Requirements that must be met:
Additional support for some of the following may also be required:
Methods for code translation:
JIT methods:
Hypercall methods (triggers code translation when running natively):
Codelet call methods (a codelet implements a problematic instruction in native code):
Memory map specifics:
And that’s just some of the possible options off the top of my head. |
Jeffrey Lee (213) 6048 posts |
My preference is to go with the “full JIT” approach. 1:N mapping of instructions, definitely a stack (the JIT will need to call into C functions for some functionality), all code running in user mode, with all address translation handled in software rather than relying on any advanced mapping memory features offered by the host OS. To cope with this the JIT will have to be capable of remapping registers in the generated code; this will make the JIT more complex, but it will also allow it to cache commonly needed state in registers (CPU cycle counter, pointer to the emulated ARM register file, address translation results, etc). It won’t reach the performance of native code, but it will be portable, and to me that should be the primary goal with any kind of emulation solution (both to reduce future maintenance costs for the emulator, and to make sure as many people as possible can benefit from it – a good ARM backend could benefit emulators running on RISC OS, Linux, AArch32, AArch64, etc.) Remember that RPCEmu has already solved some of the technical challenges with this approach. The x86 and x64 backends both have to deal with the fact that all instructions need translation and that you can’t do a 1:1 address translation. I’m not expecting us to get acceptable performance right off the bat, but it’s a working foundation which we can build upon. |
Jon Abbott (1421) 2651 posts |
The goal with the JIT in ADFFS was to minimise the number of instructions being recoded, which is around the 10% mark. Cycle counting is not used as it wasn’t really relevant, the aim being to run code as quickly as possible. BL and instructions that reference or use PC are the only instructions recoded. To that end, it’s split data/code, 1:1 instruction, stackless, translates code once, uses co-pro instructions for Hypercalls, branch for codelets and memory is mapped to source. I opted for split data/code to allow quick self-modifying code detection, translated code writes all go to the data area and write a Hypercall to the code area if there’s an instruction at the matched address. I originally looked into 1:1 vs 1:N instructions, but considered the overheads of going 1:N too high. It would require a seperate address translation table to track the instruction address and size and make self-modifying code quite a challenge to implement without possible memory fragmentation. The drawback of 1:1 is the impact to execution speed when a problematic instruction is encountered, as it has to break execution. I considered various options here and decided to use some of the lower 32mb to store codelets so branches could be used from the translated code to codelets which implement the instruction. The disadvantage is that instructions that don’t set PC need individual codelets, as they have to branch back to the next instruction. The more common instructions, such as LDM R13!, {PC} can however reuse codelets. One area it doesn’t implement is CPU Paravirtualization, I put it low on the initial implementation list, where in hindsight it should have been the first thing implemented. The consequence of this is self-modifying code that’s elevated isn’t handled, for the most part this meant ensuring games ran as USER and didn’t elevate themselves. There’s one game that won’t run under it (Adventures classic compilation), although this doesn’t run under any emulator either as none of them implement the instruction set accurately enough. If I was to code another, capable of hosting an OS, I’d probably stick with 1:1 instruction, look at redirecting L1PT/L2PT to retain 1:1 memory translation and allow hypervised page table changes and paravirtualization of memory access levels through multiple L1/L2PT’s. I’d also use a co-pro or SWI instruction to call codelets to get around the 32mb branch limit and allow all codelets to be reused. Stacked vs stackless is a tricky one, it would need careful statistical analysis of code to determine which register to use for Stacked. Provided that register isn’t used much in the client code, there would be good advantages to be had around codelet interruption and codelet initialisation and exit. The disadvantages with stackless are the requirement to preserve codelet state during IRQ handling and having to store registers locally, which adds to data cache usage. The advantage of stackless is that only problematic instructions need recoding. |
Jeffrey Lee (213) 6048 posts |
Just to clarify: We are talking about a JIT to use with full-system emulators, yes? A JIT for a full-system emulator like RPCEmu is going to be very different from a JIT for a user-space emulator like Aemulor or ADFFS.
Yeah, it does make things tricky. But if it was easy, where would the fun be? :-) RPCEmu uses a simple hash table to map the virtual address to the code block, essentially a 1-way associative cache. The memory for the code blocks is preallocated at startup. Simple and fragmentation-free, but a bit wasteful with memory and liable to suffer performance issues (both from code being evicted from the cache, and because – AIUI – every entry point to a routine will result in a new code block being generated. Routines with multiple entry points won’t share code blocks) The approach I’d be tempted to try would be a lot more complicated:
I’m assuming most code we’d be interested in running would be StrongARM-compatible, so the amount of (performance-critical) self-modifying code should be minimal. Which means we shouldn’t have to worry too much if the cost for overwriting code is quite a bit higher than the cost of writing pure data, or if the cost of generating JIT code blocks is rather higher than 1:1 translation schemes. Another way of doing things which wouldn’t require a massive lookup table the size of ROM+RAM would be to have a lookup table which has one entry per JIT page. Then the generated code for that JIT page would have one entry point, which starts by checking the low bits of the PC to see which part of the code block to jump to (or whether it’s a new address and the JIT generator needs invoking). This would trade off RAM usage (and D-cache thrashing) for increased I-cache and branch predictor usage. |
Jon Abbott (1421) 2651 posts |
I started writing on the same subject this morning, which I’ve pasted at the bottom. We’re thinking along similar lines of how 1:N could be implemented.
Correct, but as a replacement for Aemulor/ADFFS. Wimp apps can be presented on the host desktop via a Wimp shim Module (codename: PixieDust™). We can put the specifics of how PixieDust™ works to one side at this stage, as the JIT is the complicated bit, suffice to say it handles passing Wimp messages and window redraws between the client/host.
I considered this exact method originally, but quickly realised you only need to determine if an address contains an instruction or data. You know there’s an instruction there as it will be in the instruction address table, if its an instruction you haven’t seen yet its still considered data. The end result being you don’t require a separate table for attributes
That might have quite an impact if the JIT pages aren’t sufficiently small…32 or 64 instructions perhaps?
The method I employed in ADFFS was to translate whichever came first: 128 instructions, an instruction that set PC or a terminal SWI, such as OS_GenerateError, OS_Exit etc. It also does predictive branching, but I ended up turning it off as I found it was impacting the performance too much. For example, a branch may rarely if ever be taken, or some self-modifying code means we’ve either wasted time translating instructions or worse still, its branched outside of code so we end up translating data.
Why not retain the state variables in registers? I assume we’re talking about PSR/PC, the stack etc? I’d leave state management within the JIT as its going to be used quite frequently and would waste a lot of space if management were done within each block.
Why do entry points need to be tracked?
I’d let the JIT code itself trigger when the code generator kicks in, pages are already filled with Hypercalls (“do the JIT” as you put it) so will trigger code generation. Self-modifying code can be handled in the same way.
Is memory space really going to be an issue to require code block caching? Apps don’t tend to contain a lot of code, the bulk of the space is going to go on the client OS. Emulating IOC caps client memory to 32mb and IOMD to 128mb (IIRC 256mb caused issues), the bulk of that memory space is either going to be unallocated or allocated to client DA’s that won’t contain code.
StrongARM-compatible self-modifying code is actually more problematic than ARM2/3, if the JIT starts clearing code based on OS_SynchroniseCodeArea. The problem being lazy code that simply flushes the whole cache, instead of a range. I had real performance issues with StrongARM code in ADFFS until I chose to completely ignore OS_SynchroniseCodeArea. It would need to cover both IOC (re ARM2/3) and IOMD (re ARM610/ARM700/ARM710/SA) based machines, be that in the same emulator or via separate emulators. I can’t speak for applications or the OS, as I’ve not done any code analysis on them. So far as games go, the performance critical self-modifying bits tend to be small code changes specific to one routine, covering one or two instructions. Protection and decompression does however tend to take a big hit during startup, but that’s not so much of a concern. The biggest performance impact is going to be stacked code, some C code has a nasty habit of putting a small SWI code block onto the stack and branching to it, instead of using OS_CallASWI. These grated on me so much, I re-wrote the damn things when getting games running under ADFFS! They tend to follow the same code structure so could be detected and handled during code generation if required. Stacked code is very bad on performance, as it marks the stack page as potentially containing code. There’s also the issue of code that places the stack within a code page. DarkWood running with high graphics settings at 25fps…there’s a challenge! My post that I didn’t quite finish this morning: Implementing 1:N Ignoring how self-modifying code is detected for the moment, what methods could be employed to implement 1:N? The challenges that need resolving are: Translating the client code address to a translated address I see a couple of potential issues here. With 1:N you’d be translating code as you see it, so the order of the translated code doesn’t necessarily match the order of the client code, this raises two questions: 1. How to track the translated instruction code address and size?
2. Where to place translated code?
|
Jeffrey Lee (213) 6048 posts |
When a memory write hits a location flagged as “data”, the attributes for that entire JIT page will be reset to default, and the generated code will be discarded Typo there from me – when a write hits a location flagged as code the attributes for the entire page will be reset. Clearly if you’re just overwriting data then the JIT doesn’t need to care about it! When the JIT reaches an instruction which may branch elsewhere, an epilogue instruction sequence will be generated which will write any dirty state variables back to the main emulator state structure and then either branch to the next code block (looking it up in the translation map) or exit to the main emulator loop (which will then do the branch). (I’m not sure which approach would be best) By state variables I mean the state of the virtual machine – the emulated ARM registers, the cycle counter/event timer, etc. There won’t be any fixed mapping of emulated registers to host registers, it will be different for each code block. E.g. a simple round-robin or LRU strategy to assign emulated registers to host registers as code for each instruction is generated. Because the mapping will be unique for each block, and potentially unique for each exit point within a block, it makes sense to have the block contain code to write back the dirty registers itself. If control flow passes over another location marked as an entry point The answer to this will depend on what viewpoint/approach you’re coming from. RPCEmu generates a new, independent code block for each initial address the JIT is invoked for. So if you have a function which has two entry points but a shared body then the JIT will produce two code blocks for the different entry points, and possibly also end up duplicating the function body. If you end up with multiple copies of code like this then it will bloat/pollute the JIT code cache and hurt the host machine’s I-cache usage. For something closer to ADFFS, where there’s no duplication of JIT’d code, tracking entry points will allow for more efficient code to be generated. We’ve got to squeeze 15 general-purpose registers and some special registers like the PC, PSR, cycle counter, etc. into about 10 host registers. You can’t have all those registers active at once, so you need to load and save them on demand from the emulator’s state structure. But loading and saving them around every instruction (in order to allow JIT routines to be entered from any point) would be wasteful – you want to load a register, keep it resident for a few instructions, and then save it back at a later date. To make that possible you need to limit the number of entry and exit points for each generated code block. I guess my initial description of it being a 1:N JIT is inaccurate. Really it would be an N:M JIT, because it will be designed to work on indivisible blocks of code. There’s also the possibility that code sequences could be optimised or rewritten entirely (NOP instructions like MOV R0,R0 can be omitted, sequences of conditional instructions – especially complex ones like load/store or SWI – may use branches to split the execution flow, etc).
Well, it will be useful to know that your emulator isn’t going to slowly gobble all the free memory on the host system :-) The cache size would probably be quite generous, but placing a hard upper bound is always a good idea.
Yeah, the intention is that the JIT will ignore most/all of the cache maintenance requests made by the emulated code. In effect it will be emulating a cacheless ARM with no instruction prefetch pipeline.
In terms of the JIT, I think IOMD is all we need to worry about for now. We already have plenty of Arc emulators that run on RISC OS. |
Jon Abbott (1421) 2651 posts |
How many client instructions are there in the block? I presume it would pre-scan the client block, decide on register allocation and then encode the whole block, or finish at a terminating instruction.
There may be some mileage in counting register use and allocating based on the highest used. It will slow encoding, but will result in faster code if there’s no self-modifying done within the block.
Definitely a requirement with each block entry/exit being unique.
I presume C entry/exit would be managed so flags are retained in the USER PSR for faster code execution? Can 14 not be used if entry/exit to C is managed, retaining just the stack with state variable being accessed as required?
The ability to optimize code is definitely worth implementing, as it would allow problematic code sequences to be rewritten.
I suppose it gives the user some control over the amount of memory being used, beyond the base memory allocation.
You can safely ignore all cache maintenance requests as the JIT itself will be managing the cache itself, essentially cleaning any writes to JIT’d code blocks. Writes to the client memory space will be data so far as the JIT is concerned, irrespective of client code being overwritten by self-modifying writes. Not implementing at least stage 2 of the pipeline will cause some software to fail, particularly writes up to and including the current instruction, which is quite common in decrypt, decompress and protection routines. I implemented it in ADFFS for the following games:
The majority of uses I’ve come across rely on STR Rx,[PC,#-4] working, closely followed by STM’s that overwrite themselves. These could be detected and handled by the optimization routine to avoid implementing a pipeline, which for a JIT is an interesting challenge! |
Jeffrey Lee (213) 6048 posts |
Most of the time I’d expect the encoding to stop when it finds a terminating instruction, e.g. a SWI or a branch to a location outside of the current JIT code page. I guess a good performance optimisation to make to the JIT would be to make it automatically flag the next instruction after a SWI or BL as being an entry point, to save on the code generator being invoked again the first time the SWI/BL returns (it would only be very rare cases where a SWI/BL never returns)
Yeah, the JIT will save/restore the emulated NZCV flags as necessary. Chances are it will treat them the same as other registers – e.g. if you’ve got a sequence of unconditional instructions, none of which set the flags, then there’s no need to bother loading the NZCV into the PSR.
Probably, yeah. I think it mostly boils down to whether you want to get useful stack traces when the code crashes (and if you’re using CLib, whether you want the stack unwinding that’s for upcall/event dispatch to work correctly) One of the nuisances with C is that APCS functions are allowed to clobber R0-R3 and R12, so the JIT will probably want to favour R4-R11 for most cases and only use R0-R3, R12 and R14 for temporaries or for C calls.
Yeah, I guess the JIT would be able to detect dangerous PC-relative STRs pretty easily. It’s the more general case of using a different base register that has me worried – you’d need to do a PC check for every memory write. Although I guess you might be able to get 99% compatibility by building it into the code that does the JIT code page overwrite detection. One option for dealing with pipelining, assuming the emulator also contains a (correctly pipelined) interpreter engine, would be to cause the STR to be cancelled and then re-execute the instruction using the interpreter. For each instruction the interpreter could cross-check the prefetched instructions against what’s in memory, and when it spots that it’s gone past the self-modifying section it can go back to using the JIT. |
Jon Abbott (1421) 2651 posts |
Losing 6 registers could seriously hit performance, but this will depend greatly on where most performance is lost. I expect memory access may be the slowest area as stores will be performing self-modifying checks as well as address translation. How would LDR/STR/LDM/STM be implemented? Within the block or external? If they’re not touching PC and only translating Rd, in-block should be easy enough, if they’re altering PC there’s an argument for centralising the code so it can be used by ALU’s that set PC. The majority of uses I’ve come across rely on STR Rx,[PC,#-4]… I don’t believe detecting the write will impact performance noticeably, STR/STM are probably going to be the slowest instructions anyhow. Implementing the pipelined instruction dynamically however would pose some challenges.
Falling back to emulation is probably the easiest solution. How would cycle counting and timers be implemented in the JIT’d code? Via a centralised routine that’s called every N instructions and at B/BL? How would DMA channels be implemented? Threaded? Or as part of the cycle counter/timer code? |
Jeffrey Lee (213) 6048 posts |
Doing them in-block sounds like a worthy goal to me. What I’d like to aim for is a situation where the generated code is able to cache the result of the address translation + permission check. That way sequences of load/store instructions which target regular (data) RAM pages should be nice and fast. To avoid over-complicating things, loads/stores which require more complex behaviour (IO memory accesses or write tracking for writes to code pages) will probably take an unoptimised path which uses function calls.
It’ll probably be very similar to how I implemented it for ArcEm: One priority queue which contains all of the events which are scheduled for a future time (timer interrupts, audio/video DMA, keyboard/mouse update, etc.) and a simple cycle counter which counts down the time remaining until the next event should fire. The generated code can then be responsible for decrementing the counter at suitable times and calling the event dispatcher when it expires.
Using the event queue system. |
Jon Abbott (1421) 2651 posts |
How would LDR/STR/LDM/STM be implemented? How would you implement the TLB in-block? Where for example would the pointers to L1PT/L2PT be stored? At the beginning/end of the block or in-line with branches over them? Have you tried coding an example LDR to see what the implementation issues and possible optimizations/pitfalls are?
I’m not sure this would be possible as the JIT won’t necessarily know if the load/store is addressing IOMD at the time it builds the code. Have you considered running the code on a dedicated core and mapping memory 1:1 to avoid emulated memory access? This would get it closer to ARM on ARM, than emulation which is essentially what you’re proposing, all be it with instructions coded into blocks instead of being done with centralised routines. Few original instructions would be executing without being modified, assisted or emulated in-line. I can’t help but think this is a very complicated implementation and not going to see much in the way of performance gains over an emulator, possibly two of three times the performance at best. |
Jeffrey Lee (213) 6048 posts |
I’d expect all pointers or other awkward constants would be stored at the end of the block. Since the blocks will be fixed-size chunks of memory this will be pretty straightforward. (Or for ARMv7+, it would be a good chance to see whether MOVW+MOVT really does make a difference)
I’d want to implement the MMU handling very similar to the FastMap system that I implemented for ArcEm: A flat logical → physical translation table which encodes the entries in a devious manner in order to minimise the decoding cost. This meant that the MMU translation, and reads/writes for regular RAM pages, could easily be handled by inline code. IO memory or RAM pages which had special handlers associated with them (e.g. one of the video driver options was to track writes to screen memory) would be handled by calling a function pointer that was contained in the fast map entry (allowing it to go direct to the correct handler for the IO device being accessed). The FastMap system did rely on the ability to detect when the page tables were modified, so that the FastMap could be kept up to date. This is trivial for MEMC emulation (page tables are hardware registers). For ARMv3+ it would probably have to be done by associating write handler functions with any pages which are used for page tables. A bit trickier, but still possible.
I think the way I’d want to do this is have the JIT generate two separate code paths. E.g.: LDR R1,[R0,#foo] LDR R2,[R0,#bar] LDR R3,[R0,#wibble] -> min_addr = (R0 + min(foo, bar, wibble)) & ~3 max_addr = (R0 + max(foo, bar, wibble)) & ~3 tlb = mmu.translate((R0 + foo) ~3) if tlb.is_ram && (min_addr >> 12) == (max_addr >> 12) { // Use log -> host address offset returned in TLB entry to generate a // sequence of load instructions for R1, R2, R3. // Potentially split this into another two code paths if we can predict // that all addresses are likely to be word-aligned (e.g. foo, bar, wibble // are all word-aligned) } else { // Load R1 (could abort, could be direct access, could be function call) // Do another MMU translation for [R0,#bar] // Load R2 (could abort, could be direct access, could be function call) // Do another MMU translation for [R0,#wibble] // Load R3 (could abort, could be direct access, could be function call) } Of course the magic is going to be having the JIT detect the places where the above kind of optimisation is going to be possible.
Well the obvious problem with that is that not all machines are multi-core :-) The second would be, where would the JIT code live? AIUI for the CPU to be able to execute code it needs to be in a readable page, so without some kind of address filtering around all load instructions the emulated code will be able to easily see that it’s living in a simulation. And even with address filtering present there’d be a chance the emulated machine could pick a memory map which makes things awkward (or impossible) for the host to find spare logical address space for it to insert the emulator code. |
Jeffrey Lee (213) 6048 posts |
An example of how the FastMap stuff is intended to work for a simple LDR: BIC log, log, #&fc000003 ; Deal with 26bit ARMv2 address bus, and mask out bottom two bits to make it word aligned MOV temp, log, LSR #12 ADD temp, fastmap_ptr, temp, LSL #3 LDMIA temp, {flags_and_data, func_ptr} ; flags_and_data contains flags in the top byte, and a (shifted) logical -> host address offset in the low bytes AND result, fastmap_mode, flags_and_data MOVS result, result, LSL #1 BEQ abort ; abandon this memory access and trigger a data abort LDRCC data, [log, flags_and_data, LSL #8] BLXCS func_ptr ; return with 'data' containing the loaded value (assume the compiler arranged the in/out registers sensibly) ; + extra code here to deal with rotated loads Possibly there’s an extra instruction there which can be dropped – my original notes (if they still exist) will be at home. fastmap_ptr would be the pointer to the fastmap table, fastmap_mode is a precomputed flags word indicating the current CPU/MEMC mode. Writes could be handled with the same kind of code sequence, you just needed to shift flags_and_data left one during the AND. (For more info see here, here, here – especially the FASTMAP_RESULT macros, here, and here for various fastmap build/rebuild + access functions) |
Jon Abbott (1421) 2651 posts |
Without a timer API, how could it regulate overall speed? It can cycle count the instructions so T1..T4 and DMA occur at the correct relative rate to the emulated CPU, but how could it regulate itself to a specific MHz? There was some mention of ArcEm being too fast on a Pi over on StarDot, which reminded me of the issue. Its a problem I spent a lot of time on with ADFFS, as code runs at roughly the equivalent of a ARM3 at 50% the MHz of the machine you’re running it on. If/when I add a StrongARM JIT, it would probably be close to a StrongARM at 75% of the host CPU MHz. I had to cheat in the end, by regulating at VSync/Frame swap or modifying the code in games that don’t have any regulation. I built the timers into the blitter, as games only use them for palette swapping – again a nasty hack. The FastMap code is about as optimal as it could be, neat use of BLX. There’s possibly some optimization to be had by merging instructions ahead that aren’t affected by the result, to prevent the stalls around the LDMIA although it would make recoding branching a headache; probably not worth implementing for the few cycles its going to get back. How’s the rotated load done? log is losing the key bits, so does it need another working register or would you duplicate the code. ie
Probably not optimal requiring two branches and a large chunk of memory. Alternatively:
What does func_ptr do? Can’t the LDRCC load the data directly by offsetting the address in flags_and_data by the logical address its mapped too? eg LDR R0, &8000 – the page for &8000 is stored at &40000 so the address in flags_and_data is &40000-&8000 or &38000 |
Jeffrey Lee (213) 6048 posts |
At the moment ArcEm doesn’t regulate the CPU speed, only the peripherals like timers/video/etc. Likewise, RPCEmu doesn’t attempt to regulate its CPU speed. The peripheral timing is based on measuring the number of emulated CPU cycles per second, and then using that to calculate a scaling factor that can convert values in real time (e.g. desired number of ticks of the 2MHz IOC clock) to the number of emulated CPU cycles. If we wanted to fix the CPU to run at a specific speed then I think the easiest method would be to disable the code that adjusts the scaling factor (e.g. for emulating an 8MHz ARM2, conversion from IOC clock cycles to CPU cycles would use a fixed scale factor of 4). Then to regulate the speed of the emulator, use some kind of time-slicing system to put the emulator to sleep at regular intervals if it’s running too fast (stop the CPU, timers, video, etc). Since we’re primarily interested in human interaction, 100Hz time slices using TickerV would seem like a reasonable choice. E.g. for the 8MHz ARM2, you’d have entries in the event queue which will cause the emulator to go to sleep every 8MHz/100 CPU cycles, and have a TickerV handler on the host which wakes the emulator up again.
There’s another temporary register. The code only does the rotation if the low two bits are non-zero, like with your second example.
It’s a pointer to a read/write handler for any special types of memory – e.g. IOC/MEMC hardware registers. The way the FastMap flags works allows for full control over whether the pointer is used just for reads, just for writes, or for both. And because each page can be given a unique function pointer, it avoids the need to have one big handler function which does a bunch of address decoding to work out what peripheral is being accessed (although I don’t think I actually did any profiling to prove that this was better than having a static BL to a generic handler function) With a JIT system, any of the RAM pages which contain code would have a pointer to a write handler which checks for the code being overwritten.
That’s exactly what it does :-) Emulator wants to load logical address &8000. Let’s say that maps to (emulated) physical address &123000. The emulator allocates one big block of memory to contain all of RAM, let’s say that’s at &76500 in the host’s memory (n.b. must be 256-byte aligned due to the shift that’s applied). When the fast map is being generated, the entry for the logical page at &8000 will have flags_and_data set to |
Jon Abbott (1421) 2651 posts |
How would STR / STM detect self-modifying code? A couple of options:
Option 2 seems the better option as its essentially free, it will need to track the address and possibly size of each instruction anyway, for reencoding branches. Once self-modifying is detected, we drop the whole JIT block. How many instructions does a block handle and how would we allocate/track the JIT blocks? The lower the number of instruction per block the better, 32 or even 16? I presume the JIT blocks would be heap based, to allow for easy allocation/deallocation? Going back to how the JIT initially scans the instructions and builds up the JIT block, it may be preferable to ignore any terminating instruction and encode N instructions to a temp area, so we know the size of the block required, a heap allocation is then made and the block modified and copied to the heap allocated block, reencoding any branches if appropriate. This does however have the drawback that data after a terminating instruction may be interpreted as an instruction. The alternative would be to observe the terminating instruction, presuming it’s unlikely any further instructions in the initial block would be called. This issue raises the question of how it would handle adding additional code to blocks that have already been partially handled, but terminated early. This would happen quite often due to subroutines exiting mid-block. Perhaps we should consider using a dynamic instruction count per JIT block, instead of a fixed number of instructions? What are the implications/challenges of going dynamic? |
Jeffrey Lee (213) 6048 posts |
My initial thoughts were to go with option 1. As with most things being discussed here, I think the key thing is to remember to profile the code and think about different algorithms as appropriate.
I’m not sure how many instructions there’d be. My initial thoughts were that the only hard limit would be that the JIT code block can’t cross an emulated page boundary, so you’d have a theoretical max of 1024 instructions. But since the JIT will be based on (rudimentary) code analysis I’d expect the algorithm to stop the code generation much sooner than that. If the code analysis is struggling then maybe some extra limits will be necessary, we’ll have to wait and see. Earlier on I suggested that memory allocation for the JIT would be in fixed-size chunks (“JIT code pages”), with all of the blocks allocated on emulator startup to ensure branch instructions can be used to jump between them. Allocating/freeing blocks on demand from the heap would probably kill performance; in the distant past when working on DeathDawn I had some profiling results which suggested CLib malloc/free were pretty slow. Using fixed-sized blocks is my way of pre-emptively combating that (since it’s trivial to create a lightning-fast small object allocator if all your blocks are the same size). I’d guess some kind of pseudo-LRU or round-robin scheme will be used to decide which blocks to evict from the JIT code cache when it runs out of space.
I think the major difficulty is having the JIT correctly identify where routines terminate. First-pass it would probably want to treat any undefined instruction, any unconditional branch (to outside the page), and any instruction which writes to the address bits of the PC (MOV PC, ADD PC, LDR PC, etc.) as being a terminating instruction. Then we’d look at the stats for how many code blocks are being invalidated due to suspected self-modifying code and try tweaking the algorithm as appropriate. |
Jon Abbott (1421) 2651 posts |
I don’t believe it would affect performance. ADFFS dynamically allocates and frees heap blocks whenever it builds code for an instruction, or when self-modifying code overwrites an instruction that was previously reencoded, you wouldn’t even know it was occurring. It has a slight impact on the initial encoding of a block, but once its encoded it should never be touched again unless its self-modified or overwritten. Use malloc to allocate the initial heap space and a bespoke heap manager to alloc/dealloc within it, that’s pretty much what ADFFS does. Dynamically allocating/deallocating and supporting dynamic length JIT blocks would reduce the memory wasted from using fixed block sizes. It would also reduce the hit of self-modifying code when encode blocks have to be dropped.
ADFFS terminates on both conditional and unconditional B, BL, anything that alters PC, SWI OS_Exit / OS_ExitAndDie / OS_GenerateError It terminates on B and BL, regardless of them being conditional as it improved performance on decrypt / uncompress start up sequences. In an attempt to improve the number of instructions encoded at each JIT entry, I added branch prediction, but found this also caused a performance hit due to self-modifying code, so it’s turned off in the public releases. |
Jeffrey Lee (213) 6048 posts |
One of the dangers of this thread – apart from us spending all our time talking instead of writing code – is that it makes me want to drop whatever I’m currently doing and work on a JIT instead. With that in mind I’ve been having a bit of a look at what would be necessary to build a RISC OS version of RPCEmu.
But it’s unlikely that I’ll be able to find the time to do that in any reasonable timeframe (too many other things I should be doing, and too much fighting with the autobuilder+Allegro sources to get over the initial hurdle of setting up a new platform/backend) So instead of working on a RISC OS-friendly JIT for RPCEmu, I think my options are:
Decisions, decisions… |
Jon Abbott (1421) 2651 posts |
My take on this… It’s nowhere near the point of being able to code anything as very little detail has been defined. I spent four months writing notes, reading papers from EMC and other about ARM on ARM, considering different methods and writing (with pencil and paper!) code examples of various different ways to encode instructions, before writing one line of code. The design and specification needs clearly defining, particularly decisive areas of the design that will not be easy to alter later, for example:
Lesser key decisions:
I’ve listed just a few off the top of my head, but there will be plent more. I would envisage over six months of design and specification before coding anything. RPCEmu…does licensing permit the IOMD emulation being ripped out and used elsewhere? As RPCEmu isn’t available for RISCOS, it wouldn’t be much use as a base emulator, in fact there’s large chunks of it you’d rip out anyway for ARM on ARM. If what’s being proposed does not supporting RPC apps from the outset (re StrongARM), the pitchforks will probably come out as the requirement is for a true replacement to Aemulor; as you point out however, ARM3 system emulation is more advanced and available for RISCOS.
On this point, it could Paravirtualize the vector entry/exit within the client OS and only provide 26bit CPU modes. The RISCOS versions that need to be covered don’t use 32bit after all, immediately switching to 26bit. There is however the obvious drawback of not supporting client code that takes over the hardware vectors because it expects a 32bit PC on entry. |
Jeffrey Lee (213) 6048 posts |
Wow – you and I have very different approaches to writing code! Admittedly if I was aiming to implement a para/hyper virtualisation system then I’d also need to do a fair amount of research, since I’ve never needed to look into the details of virtualisation on ARM before. But in general I tend to take a very light approach to designing code; I’ll plan out some of the major challenges on paper/computer but leave the minor ones to sort themselves out during the implementation, refactoring or reimplementing bits of code as necessary. My fear is that if I put too much effort into designing the code then I’ll find that the design document/pseudocode has turned into an almost-complete implementation of the software, except it’s written in this funny language called English which computers don’t understand so I’m going to have to waste time rewriting it all. Or when it actually comes to implementing the design I’ll realise that it’s complete garbage, and I would have saved myself a lot of trouble if I’d gone down the rapid prototyping/evolutionary design approach and started the implementation much sooner. Of course I also realise the choice of implementation language will have an effect on how much prep work you do. Refactoring C is going to be easier than refactoring assembler, and if I’m working on a non-trivial algorithm that’s going to be implemented in assembler then I’ll often prototype in BASIC or BASIC assembler before translating to the final form.
My POV is:
With the lesser decisions being:
Meanwhile, I’m ready to dive in right now :-) A few weeks of focused development could result in a working solution, or a useful failure to learn from. I guess part of the problem is that you’re approaching this from the angle of “how to seamlessly run 26bit ARMv3/v4 apps in a 32bit ARMv4+ OS”, while I’m approach it from the angle of “how to get good performance from a RiscPC emulator”, with the assumption that people will either be happy with doing everything inside an emulator, or the Wimp/OS integration will be solved at a future date by PixieDust™.
It’s GPLv2. |
Rick Murray (539) 13840 posts |
I just sorta sit down, bash out some code, and tweak as necessary. ;-) It’s a naff way to do it, I know, but I tend to find it I spend a long time thinking about how to write something, nothing gets written. At least with crappy code there is something to improve. |
Jon Abbott (1421) 2651 posts |
I look forward to seeing the result. A RiscPC emulator for RISCOS would be an achievement in itself, one that can match RiscPC performance, even more impressive.
True, my interest is in providing backward compatibility. I’d also be looking in detail at where virtualization techniques could be used to improve performance. |
Jeffrey Lee (213) 6048 posts |
As you might expect, I had a go at this over the weekend, implementing a trivial JIT using some of the ideas discussed here. It’s not optimised yet, so adding the control logic actually made the emulator run twice as slow, but once I got the JIT working correctly it was able to regain a fair amount of that lost performance. So the next steps will be:
One interesting area of development will be working out how to generate code that minimises JIT overheads (e.g. register load/store/move) when dealing with multiple entry points, in-block loops, etc. I suspect that a simple one-pass code generator won’t cut it, and eventually it will have to be replaced with something more sophisticated (two-pass? scan forwards N instructions and plan register usage for each ‘window’? scan the entire page and generate a proper code graph?) |
Jon Abbott (1421) 2651 posts |
My head hurts just thinking about the complexity involved. It could track which registers need to be in the working registers for every address in the block, so branches can quickly restore the registers. This is made somewhat more complicated by the fact they potentially map to any registers, so the register mapping also need tracking. This obviously needs a huge chunk of memory, so it could scan the block from the start to work out which registers map to which at the in-block address its going to branch too – not as fast, but will only be a one-hit scan if the code isn’t self-modifying. At the branch, it needs to swap registers that are already mapped in, to the correct register if they’re different in the block its jumping into. It obviously needs to store any registers which are no longer mapped in. When branching to an address that hasn’t yet been translated, it will need the JIT to build up the block before it can work out the register mappings. ALU’s and LDR/LDM that branch are a whole different challenge, as they’ll need to dynamically work out what to do with registers before branching. I’ve no idea how these could be implemented without a huge hit to performance and requiring tracking of register mappings for every address. |