ARM on ARM discussion

89 posts, 13 voices

Pages: 1 2 3 4

Apr 14, 2017 9:03pm Jon Abbott (1421) 2651 posts	The purpose of this discussion, is to come to some agreement on the best way to develop an ARM on ARM backend for emulators, to provide support for 26bit software on ARMv5 thru ARMv7. We can assume emulation of VIDC, IOC, MEMC, VIDC20, IOMD, MMU are provided elsewhere. CPU mode switching is also handled elsewhere, with a suitable interface to allow translated code to switch CPU modes. Requirements that must be met: Support self-modifying code (pre StrongARM) Support self-modifying code on StrongARM (done via OS_SynchroniseCodeArea) Provide accurate IOC/IOMD timers Paravirtualized CPU modes (ie code always runs in USER, with register banks/page-tables soft-switched) Cache aware, cleaning the data/instruction cache as appropriate Additional support for some of the following may also be required: code relocation (ie code is run at a different logical address to where it thinks its running) split data/code (ie code is kept separate from data to optimize instruction cache cleaning) code reuse/deduping (if multiple VM’s are running for example) codelet management, particularly around instructions converted to codelets that are later overwritten Methods for code translation: 1:1 (source instructions translate to 1 translated instruction, with codelets or emulation for problematic instructions) 1:N (source instructions can translate to many translated instructions) JIT methods: Stacked (a stack is available to translated code, losing one register in the translated code) Stackless (all registers are available to the translated code) Translate code once (ie its runs natively once translated) Translate code whenever its seen (ie code is translated to a temporary execution area in small chunks) Hypercall methods (triggers code translation when running natively): A branch instruction Co-pro instruction SWI Codelet call methods (a codelet implements a problematic instruction in native code): A branch instruction Co-pro instruction SWI Memory map specifics: Mapped to match the source (ie &8000 under the original code is &8000 under the translated code) Mapped to an offset address (ie &8000 under the original code is now within a Dynamic Area) And that’s just some of the possible options off the top of my head.

Apr 14, 2017 9:53pm Jeffrey Lee (213) 6048 posts	My preference is to go with the “full JIT” approach. 1:N mapping of instructions, definitely a stack (the JIT will need to call into C functions for some functionality), all code running in user mode, with all address translation handled in software rather than relying on any advanced mapping memory features offered by the host OS. To cope with this the JIT will have to be capable of remapping registers in the generated code; this will make the JIT more complex, but it will also allow it to cache commonly needed state in registers (CPU cycle counter, pointer to the emulated ARM register file, address translation results, etc). It won’t reach the performance of native code, but it will be portable, and to me that should be the primary goal with any kind of emulation solution (both to reduce future maintenance costs for the emulator, and to make sure as many people as possible can benefit from it – a good ARM backend could benefit emulators running on RISC OS, Linux, AArch32, AArch64, etc.) Remember that RPCEmu has already solved some of the technical challenges with this approach. The x86 and x64 backends both have to deal with the fact that all instructions need translation and that you can’t do a 1:1 address translation. I’m not expecting us to get acceptable performance right off the bat, but it’s a working foundation which we can build upon.

Apr 16, 2017 9:03am Jon Abbott (1421) 2651 posts	The goal with the JIT in ADFFS was to minimise the number of instructions being recoded, which is around the 10% mark. Cycle counting is not used as it wasn’t really relevant, the aim being to run code as quickly as possible. BL and instructions that reference or use PC are the only instructions recoded. To that end, it’s split data/code, 1:1 instruction, stackless, translates code once, uses co-pro instructions for Hypercalls, branch for codelets and memory is mapped to source. I opted for split data/code to allow quick self-modifying code detection, translated code writes all go to the data area and write a Hypercall to the code area if there’s an instruction at the matched address. I originally looked into 1:1 vs 1:N instructions, but considered the overheads of going 1:N too high. It would require a seperate address translation table to track the instruction address and size and make self-modifying code quite a challenge to implement without possible memory fragmentation. The drawback of 1:1 is the impact to execution speed when a problematic instruction is encountered, as it has to break execution. I considered various options here and decided to use some of the lower 32mb to store codelets so branches could be used from the translated code to codelets which implement the instruction. The disadvantage is that instructions that don’t set PC need individual codelets, as they have to branch back to the next instruction. The more common instructions, such as LDM R13!, {PC} can however reuse codelets. One area it doesn’t implement is CPU Paravirtualization, I put it low on the initial implementation list, where in hindsight it should have been the first thing implemented. The consequence of this is self-modifying code that’s elevated isn’t handled, for the most part this meant ensuring games ran as USER and didn’t elevate themselves. There’s one game that won’t run under it (Adventures classic compilation), although this doesn’t run under any emulator either as none of them implement the instruction set accurately enough. If I was to code another, capable of hosting an OS, I’d probably stick with 1:1 instruction, look at redirecting L1PT/L2PT to retain 1:1 memory translation and allow hypervised page table changes and paravirtualization of memory access levels through multiple L1/L2PT’s. I’d also use a co-pro or SWI instruction to call codelets to get around the 32mb branch limit and allow all codelets to be reused. Stacked vs stackless is a tricky one, it would need careful statistical analysis of code to determine which register to use for Stacked. Provided that register isn’t used much in the client code, there would be good advantages to be had around codelet interruption and codelet initialisation and exit. The disadvantages with stackless are the requirement to preserve codelet state during IRQ handling and having to store registers locally, which adds to data cache usage. The advantage of stackless is that only problematic instructions need recoding.

Apr 17, 2017 11:01am Jeffrey Lee (213) 6048 posts	Just to clarify: We are talking about a JIT to use with full-system emulators, yes? A JIT for a full-system emulator like RPCEmu is going to be very different from a JIT for a user-space emulator like Aemulor or ADFFS. I originally looked into 1:1 vs 1:N instructions, but considered the overheads of going 1:N too high. It would require a seperate address translation table to track the instruction address and size and make self-modifying code quite a challenge to implement without possible memory fragmentation. Yeah, it does make things tricky. But if it was easy, where would the fun be? :-) RPCEmu uses a simple hash table to map the virtual address to the code block, essentially a 1-way associative cache. The memory for the code blocks is preallocated at startup. Simple and fragmentation-free, but a bit wasteful with memory and liable to suffer performance issues (both from code being evicted from the cache, and because – AIUI – every entry point to a routine will result in a new code block being generated. Routines with multiple entry points won’t share code blocks) The approach I’d be tempted to try would be a lot more complicated: Instead of using a hash table to map virtual addresses to code blocks, use a simple flat array the size of ROM+RAM. The array will map physical addresses to code blocks. This will add an extra overhead in terms of requiring a separate logical → physical translation to each (intra-block) branch instruction, but should allow the code to avoid flushing the code blocks when the page tables are modified, resulting in benefits from increased code re-use and fewer horrible edge cases where RISC OS’s buggy page table handling doesn’t work with the emulator The array will be initialised with pointers to the “do the JIT” routine, so if the page table lookup succeeds then the code can just load the routine address from the array and branch to it without any extra checks being necessary. Memory will be split into a series of “JIT pages”. For now let’s assume a JIT page is 4K in size, but in reality they may end up being smaller (1024 bytes? 256 bytes?) When the JIT is invoked, it analyses the entire JIT page and regenerates all the code for that page To help the JIT deal with code being overwritten, and mixed code+data pages, each word of memory will have a number of different attributes which are updated over time in response to the behaviour of the system: The code/data flag. Initially all memory will be flagged as data When the JIT generates code for an address, it will flag it as code When a memory write hits a location flagged as “code”, the attributes for that entire JIT page will be reset to default, and the generated code will be discarded Entry point flag Every time the “do the JIT” routine is executed for an address, that address will be marked as being an entry point JIT code generation will proceed as follows The JIT will identify the first entry point within the JIT page and start generating code, following the flow of instructions JIT generation will proceed for as long as the JIT is able to determine with certainty where the next instruction is. E.g. most ALU instructions don’t alter the program flow, so a sequence of ALU instructions can be JITed. A branch which targets an address within the current page can be followed, but a branch which goes outside the page cannot. As mentioned above, each instruction that does get translated will have the corresponding address marked as “code”. When the JIT reaches an instruction which may branch elsewhere, an epilogue instruction sequence will be generated which will write any dirty state variables back to the main emulator state structure and then either branch to the next code block (looking it up in the translation map) or exit to the main emulator loop (which will then do the branch). (I’m not sure which approach would be best) If control flow passes over another location marked as an entry point, a suitable function prologue will be generated to allow the generated JIT code to be shared (e.g. loading the correct values into host registers to match what’s already been cached) The JIT will proceed in this manner until all entry points have been dealt with The code will be structure in a way such that the JIT code generator can only be executed when no JIT code blocks are currently executing. This will allow the system to safely choose any code block to evict from the cache when it runs out of space It will also allow writes to code areas to safely execute the prologue for that write instruction without fear that the prologue has been overwritten by something else Memory allocation for the JIT code blocks is likely to be on a finer-grained scale than allocating one block per JIT code page (some JIT pages may contain lots of code, some very little). E.g. maybe 256 or 512 bytes would be a suitable block size The code will probably want to allocate the blocks in one big chunk on startup, so that it can guarantee that regular branch instructions can be used to get between different linked blocks that make up a complete JIT page. When the code associated with a JIT page gets evicted from the cache to free up space, the memory attributes will be retained, so the next time the code is generated the system won’t have to re-learn all the attributes. To deal with memory attributes in an efficient manner, it will be possible to associate different read/write handlers with each (4K ARM) physical page. E.g. a page which is known to contain only data will need no special handler. But a page which contains JIT code pages will need extra checks in order to allow for the memory attributes to be updated and code blocks to be invalidated. These memory handlers will also be used to handle interaction with IO memory (the same way ArcEm handled things) I’m assuming most code we’d be interested in running would be StrongARM-compatible, so the amount of (performance-critical) self-modifying code should be minimal. Which means we shouldn’t have to worry too much if the cost for overwriting code is quite a bit higher than the cost of writing pure data, or if the cost of generating JIT code blocks is rather higher than 1:1 translation schemes. Another way of doing things which wouldn’t require a massive lookup table the size of ROM+RAM would be to have a lookup table which has one entry per JIT page. Then the generated code for that JIT page would have one entry point, which starts by checking the low bits of the PC to see which part of the code block to jump to (or whether it’s a new address and the JIT generator needs invoking). This would trade off RAM usage (and D-cache thrashing) for increased I-cache and branch predictor usage.

Apr 17, 2017 6:53pm Jon Abbott (1421) 2651 posts	I started writing on the same subject this morning, which I’ve pasted at the bottom. We’re thinking along similar lines of how 1:N could be implemented. We are talking about a JIT to use with full-system emulators, yes? Correct, but as a replacement for Aemulor/ADFFS. Wimp apps can be presented on the host desktop via a Wimp shim Module (codename: PixieDust™). We can put the specifics of how PixieDust™ works to one side at this stage, as the JIT is the complicated bit, suffice to say it handles passing Wimp messages and window redraws between the client/host. To help the JIT deal with code being overwritten, and mixed code+data pages, each word of memory will have a number of different attributes I considered this exact method originally, but quickly realised you only need to determine if an address contains an instruction or data. You know there’s an instruction there as it will be in the instruction address table, if its an instruction you haven’t seen yet its still considered data. The end result being you don’t require a separate table for attributes When a memory write hits a location flagged as “data”, the attributes for that entire JIT page will be reset to default, and the generated code will be discarded That might have quite an impact if the JIT pages aren’t sufficiently small…32 or 64 instructions perhaps? JIT generation will proceed for as long as the JIT is able to determine with certainty where the next instruction is. E.g. most ALU instructions don’t alter the program flow, so a sequence of ALU instructions can be JITed. A branch which targets an address within the current page can be followed, but a branch which goes outside the page cannot. The method I employed in ADFFS was to translate whichever came first: 128 instructions, an instruction that set PC or a terminal SWI, such as OS_GenerateError, OS_Exit etc. It also does predictive branching, but I ended up turning it off as I found it was impacting the performance too much. For example, a branch may rarely if ever be taken, or some self-modifying code means we’ve either wasted time translating instructions or worse still, its branched outside of code so we end up translating data. When the JIT reaches an instruction which may branch elsewhere, an epilogue instruction sequence will be generated which will write any dirty state variables back to the main emulator state structure and then either branch to the next code block (looking it up in the translation map) or exit to the main emulator loop (which will then do the branch). (I’m not sure which approach would be best) Why not retain the state variables in registers? I assume we’re talking about PSR/PC, the stack etc? I’d leave state management within the JIT as its going to be used quite frequently and would waste a lot of space if management were done within each block. If control flow passes over another location marked as an entry point Why do entry points need to be tracked? The code will be structure in a way such that the JIT code generator can only be executed when no JIT code blocks are currently executing. I’d let the JIT code itself trigger when the code generator kicks in, pages are already filled with Hypercalls (“do the JIT” as you put it) so will trigger code generation. Self-modifying code can be handled in the same way. This will allow the system to safely choose any code block to evict from the cache when it runs out of space Is memory space really going to be an issue to require code block caching? Apps don’t tend to contain a lot of code, the bulk of the space is going to go on the client OS. Emulating IOC caps client memory to 32mb and IOMD to 128mb (IIRC 256mb caused issues), the bulk of that memory space is either going to be unallocated or allocated to client DA’s that won’t contain code. I’m assuming most code we’d be interested in running would be StrongARM-compatible, so the amount of (performance-critical) self-modifying code should be minimal. StrongARM-compatible self-modifying code is actually more problematic than ARM2/3, if the JIT starts clearing code based on OS_SynchroniseCodeArea. The problem being lazy code that simply flushes the whole cache, instead of a range. I had real performance issues with StrongARM code in ADFFS until I chose to completely ignore OS_SynchroniseCodeArea. It would need to cover both IOC (re ARM2/3) and IOMD (re ARM610/ARM700/ARM710/SA) based machines, be that in the same emulator or via separate emulators. I can’t speak for applications or the OS, as I’ve not done any code analysis on them. So far as games go, the performance critical self-modifying bits tend to be small code changes specific to one routine, covering one or two instructions. Protection and decompression does however tend to take a big hit during startup, but that’s not so much of a concern. The biggest performance impact is going to be stacked code, some C code has a nasty habit of putting a small SWI code block onto the stack and branching to it, instead of using OS_CallASWI. These grated on me so much, I re-wrote the damn things when getting games running under ADFFS! They tend to follow the same code structure so could be detected and handled during code generation if required. Stacked code is very bad on performance, as it marks the stack page as potentially containing code. There’s also the issue of code that places the stack within a code page. DarkWood running with high graphics settings at 25fps…there’s a challenge! My post that I didn’t quite finish this morning: Implementing 1:N Ignoring how self-modifying code is detected for the moment, what methods could be employed to implement 1:N? The challenges that need resolving are: Translating the client code address to a translated address I see a couple of potential issues here. With 1:N you’d be translating code as you see it, so the order of the translated code doesn’t necessarily match the order of the client code, this raises two questions: 1. How to track the translated instruction code address and size? Option 1: 1:1 address translation, using a table of 1/2n words containing the translated address and size of each instruction, where n is the client memory size in bytes Pros: Very quick address translation Cons: Large table required Option 2: build up a table as instructions are translated, requiring 3i words containing the client address, translated address and size, where i is the number of instructions translated. The lookup impact could be mitigated with a b-tree or by breaking the addresses into smaller tables based on the client instruction page Pros: Small table size Cons: Slower to translate, will impact self-modifying code speed 2. Where to place translated code? Option 1: Place code consecutively as its seen Pros: Small impact during code translation Cons: Self-modifying code would fragment memory quite heavily Option 2: Break the code into chunks based on the client code page Pros: Self-modifying code that requires additional instructions when translated will only impact that code block Cons: Potentially wasting memory if few instructions are in a page

Apr 17, 2017 10:23pm Jeffrey Lee (213) 6048 posts	When a memory write hits a location flagged as “data”, the attributes for that entire JIT page will be reset to default, and the generated code will be discarded That might have quite an impact if the JIT pages aren’t sufficiently small…32 or 64 instructions perhaps? Typo there from me – when a write hits a location flagged as code the attributes for the entire page will be reset. Clearly if you’re just overwriting data then the JIT doesn’t need to care about it! When the JIT reaches an instruction which may branch elsewhere, an epilogue instruction sequence will be generated which will write any dirty state variables back to the main emulator state structure and then either branch to the next code block (looking it up in the translation map) or exit to the main emulator loop (which will then do the branch). (I’m not sure which approach would be best) Why not retain the state variables in registers? I assume we’re talking about PSR/PC, the stack etc? I’d leave state management within the JIT as its going to be used quite frequently and would waste a lot of space if management were done within each block. By state variables I mean the state of the virtual machine – the emulated ARM registers, the cycle counter/event timer, etc. There won’t be any fixed mapping of emulated registers to host registers, it will be different for each code block. E.g. a simple round-robin or LRU strategy to assign emulated registers to host registers as code for each instruction is generated. Because the mapping will be unique for each block, and potentially unique for each exit point within a block, it makes sense to have the block contain code to write back the dirty registers itself. If control flow passes over another location marked as an entry point Why do entry points need to be tracked? The answer to this will depend on what viewpoint/approach you’re coming from. RPCEmu generates a new, independent code block for each initial address the JIT is invoked for. So if you have a function which has two entry points but a shared body then the JIT will produce two code blocks for the different entry points, and possibly also end up duplicating the function body. If you end up with multiple copies of code like this then it will bloat/pollute the JIT code cache and hurt the host machine’s I-cache usage. For something closer to ADFFS, where there’s no duplication of JIT’d code, tracking entry points will allow for more efficient code to be generated. We’ve got to squeeze 15 general-purpose registers and some special registers like the PC, PSR, cycle counter, etc. into about 10 host registers. You can’t have all those registers active at once, so you need to load and save them on demand from the emulator’s state structure. But loading and saving them around every instruction (in order to allow JIT routines to be entered from any point) would be wasteful – you want to load a register, keep it resident for a few instructions, and then save it back at a later date. To make that possible you need to limit the number of entry and exit points for each generated code block. I guess my initial description of it being a 1:N JIT is inaccurate. Really it would be an N:M JIT, because it will be designed to work on indivisible blocks of code. There’s also the possibility that code sequences could be optimised or rewritten entirely (NOP instructions like MOV R0,R0 can be omitted, sequences of conditional instructions – especially complex ones like load/store or SWI – may use branches to split the execution flow, etc). Is memory space really going to be an issue to require code block caching? Apps don’t tend to contain a lot of code, the bulk of the space is going to go on the client OS. Emulating IOC caps client memory to 32mb and IOMD to 128mb (IIRC 256mb caused issues), the bulk of that memory space is either going to be unallocated or allocated to client DA’s that won’t contain code. Well, it will be useful to know that your emulator isn’t going to slowly gobble all the free memory on the host system :-) The cache size would probably be quite generous, but placing a hard upper bound is always a good idea. StrongARM-compatible self-modifying code is actually more problematic than ARM2/3, if the JIT starts clearing code based on OS_SynchroniseCodeArea. The problem being lazy code that simply flushes the whole cache, instead of a range. I had real performance issues with StrongARM code in ADFFS until I chose to completely ignore OS_SynchroniseCodeArea. Yeah, the intention is that the JIT will ignore most/all of the cache maintenance requests made by the emulated code. In effect it will be emulating a cacheless ARM with no instruction prefetch pipeline. It would need to cover both IOC (re ARM2/3) and IOMD (re ARM610/ARM700/ARM710/SA) based machines, be that in the same emulator or via separate emulators. In terms of the JIT, I think IOMD is all we need to worry about for now. We already have plenty of Arc emulators that run on RISC OS.

Apr 18, 2017 6:48pm Jon Abbott (1421) 2651 posts	There won’t be any fixed mapping of emulated registers to host registers, it will be different for each code block How many client instructions are there in the block? I presume it would pre-scan the client block, decide on register allocation and then encode the whole block, or finish at a terminating instruction. a simple round-robin or LRU strategy to assign emulated registers to host registers as code for each instruction is generated There may be some mileage in counting register use and allocating based on the highest used. It will slow encoding, but will result in faster code if there’s no self-modifying done within the block. it makes sense to have the block contain code to write back the dirty registers itself. Definitely a requirement with each block entry/exit being unique. We’ve got to squeeze 15 general-purpose registers and some special registers like the PC, PSR, cycle counter, etc. into about 10 host registers. I presume C entry/exit would be managed so flags are retained in the USER PSR for faster code execution? Can 14 not be used if entry/exit to C is managed, retaining just the stack with state variable being accessed as required? There’s also the possibility that code sequences could be optimised or rewritten entirely (NOP instructions like MOV R0,R0 can be omitted, sequences of conditional instructions – especially complex ones like load/store or SWI – may use branches to split the execution flow, etc). The ability to optimize code is definitely worth implementing, as it would allow problematic code sequences to be rewritten. it will be useful to know that your emulator isn’t going to slowly gobble all the free memory on the host system :-) The cache size would probably be quite generous, but placing a hard upper bound is always a good idea. I suppose it gives the user some control over the amount of memory being used, beyond the base memory allocation. the intention is that the JIT will ignore most/all of the cache maintenance requests made by the emulated code. In effect it will be emulating a cacheless ARM with no instruction prefetch pipeline. You can safely ignore all cache maintenance requests as the JIT itself will be managing the cache itself, essentially cleaning any writes to JIT’d code blocks. Writes to the client memory space will be data so far as the JIT is concerned, irrespective of client code being overwritten by self-modifying writes. Not implementing at least stage 2 of the pipeline will cause some software to fail, particularly writes up to and including the current instruction, which is quite common in decrypt, decompress and protection routines. I implemented it in ADFFS for the following games: Adventures classic collection, Arcade 3, Carnage Inc., Cataclysm, Command Ship, Cyber Ape, Drifter, Drop Ship, The Dungeon, Elite, Exotic Adventures of Silvia Layne, F.R.E.D., K.V., Pandora’s Box and TACTIC. The majority of uses I’ve come across rely on STR Rx,[PC,#-4] working, closely followed by STM’s that overwrite themselves. These could be detected and handled by the optimization routine to avoid implementing a pipeline, which for a JIT is an interesting challenge!

Apr 18, 2017 7:32pm Jeffrey Lee (213) 6048 posts	How many client instructions are there in the block? I presume it would pre-scan the client block, decide on register allocation and then encode the whole block, or finish at a terminating instruction. Most of the time I’d expect the encoding to stop when it finds a terminating instruction, e.g. a SWI or a branch to a location outside of the current JIT code page. I guess a good performance optimisation to make to the JIT would be to make it automatically flag the next instruction after a SWI or BL as being an entry point, to save on the code generator being invoked again the first time the SWI/BL returns (it would only be very rare cases where a SWI/BL never returns) I presume C entry/exit would be managed so flags are retained in the USER PSR for faster code execution? Yeah, the JIT will save/restore the emulated NZCV flags as necessary. Chances are it will treat them the same as other registers – e.g. if you’ve got a sequence of unconditional instructions, none of which set the flags, then there’s no need to bother loading the NZCV into the PSR. Can 14 not be used if entry/exit to C is managed, retaining just the stack with state variable being accessed as required? Probably, yeah. I think it mostly boils down to whether you want to get useful stack traces when the code crashes (and if you’re using CLib, whether you want the stack unwinding that’s for upcall/event dispatch to work correctly) One of the nuisances with C is that APCS functions are allowed to clobber R0-R3 and R12, so the JIT will probably want to favour R4-R11 for most cases and only use R0-R3, R12 and R14 for temporaries or for C calls. The majority of uses I’ve come across rely on STR Rx,[PC,#-4] working, closely followed by STM’s that overwrite themselves. These could be detected and handled by the optimization routine to avoid implementing a pipeline, which for a JIT is an interesting challenge! Yeah, I guess the JIT would be able to detect dangerous PC-relative STRs pretty easily. It’s the more general case of using a different base register that has me worried – you’d need to do a PC check for every memory write. Although I guess you might be able to get 99% compatibility by building it into the code that does the JIT code page overwrite detection. One option for dealing with pipelining, assuming the emulator also contains a (correctly pipelined) interpreter engine, would be to cause the STR to be cancelled and then re-execute the instruction using the interpreter. For each instruction the interpreter could cross-check the prefetched instructions against what’s in memory, and when it spots that it’s gone past the self-modifying section it can go back to using the JIT.

Apr 19, 2017 1:39pm Jon Abbott (1421) 2651 posts	One of the nuisances with C is that APCS functions are allowed to clobber R0-R3 and R12, so the JIT will probably want to favour R4-R11 for most cases and only use R0-R3, R12 and R14 for temporaries or for C calls. Losing 6 registers could seriously hit performance, but this will depend greatly on where most performance is lost. I expect memory access may be the slowest area as stores will be performing self-modifying checks as well as address translation. How would LDR/STR/LDM/STM be implemented? Within the block or external? If they’re not touching PC and only translating Rd, in-block should be easy enough, if they’re altering PC there’s an argument for centralising the code so it can be used by ALU’s that set PC. The majority of uses I’ve come across rely on STR Rx,[PC,#-4]… ..It’s the more general case of using a different base register that has me worried – you’d need to do a PC check for every memory write. I don’t believe detecting the write will impact performance noticeably, STR/STM are probably going to be the slowest instructions anyhow. Implementing the pipelined instruction dynamically however would pose some challenges. One option for dealing with pipelining, assuming the emulator also contains a (correctly pipelined) interpreter engine, would be to cause the STR to be cancelled and then re-execute the instruction using the interpreter. Falling back to emulation is probably the easiest solution. How would cycle counting and timers be implemented in the JIT’d code? Via a centralised routine that’s called every N instructions and at B/BL? How would DMA channels be implemented? Threaded? Or as part of the cycle counter/timer code?

Apr 19, 2017 2:34pm Jeffrey Lee (213) 6048 posts	How would LDR/STR/LDM/STM be implemented? Within the block or external? If they’re not touching PC and only translating Rd, in-block should be easy enough, if they’re altering PC there’s an argument for centralising the code so it can be used by ALU’s that set PC. Doing them in-block sounds like a worthy goal to me. What I’d like to aim for is a situation where the generated code is able to cache the result of the address translation + permission check. That way sequences of load/store instructions which target regular (data) RAM pages should be nice and fast. To avoid over-complicating things, loads/stores which require more complex behaviour (IO memory accesses or write tracking for writes to code pages) will probably take an unoptimised path which uses function calls. How would cycle counting and timers be implemented in the JIT’d code? Via a centralised routine that’s called every N instructions and at B/BL? It’ll probably be very similar to how I implemented it for ArcEm: One priority queue which contains all of the events which are scheduled for a future time (timer interrupts, audio/video DMA, keyboard/mouse update, etc.) and a simple cycle counter which counts down the time remaining until the next event should fire. The generated code can then be responsible for decrementing the counter at suitable times and calling the event dispatcher when it expires. How would DMA channels be implemented? Threaded? Or as part of the cycle counter/timer code? Using the event queue system.

Apr 24, 2017 9:01am Jon Abbott (1421) 2651 posts	How would LDR/STR/LDM/STM be implemented? Doing them in-block sounds like a worthy goal to me. How would you implement the TLB in-block? Where for example would the pointers to L1PT/L2PT be stored? At the beginning/end of the block or in-line with branches over them? Have you tried coding an example LDR to see what the implementation issues and possible optimizations/pitfalls are? loads/stores which require more complex behaviour (IO memory accesses or write tracking for writes to code pages) will probably take an unoptimised path which uses function calls. I’m not sure this would be possible as the JIT won’t necessarily know if the load/store is addressing IOMD at the time it builds the code. Have you considered running the code on a dedicated core and mapping memory 1:1 to avoid emulated memory access? This would get it closer to ARM on ARM, than emulation which is essentially what you’re proposing, all be it with instructions coded into blocks instead of being done with centralised routines. Few original instructions would be executing without being modified, assisted or emulated in-line. I can’t help but think this is a very complicated implementation and not going to see much in the way of performance gains over an emulator, possibly two of three times the performance at best.

Apr 24, 2017 10:44am Jeffrey Lee (213) 6048 posts	Where for example would the pointers to L1PT/L2PT be stored? At the beginning/end of the block or in-line with branches over them? I’d expect all pointers or other awkward constants would be stored at the end of the block. Since the blocks will be fixed-size chunks of memory this will be pretty straightforward. (Or for ARMv7+, it would be a good chance to see whether MOVW+MOVT really does make a difference) Have you tried coding an example LDR to see what the implementation issues and possible optimizations/pitfalls are? I’d want to implement the MMU handling very similar to the FastMap system that I implemented for ArcEm: A flat logical → physical translation table which encodes the entries in a devious manner in order to minimise the decoding cost. This meant that the MMU translation, and reads/writes for regular RAM pages, could easily be handled by inline code. IO memory or RAM pages which had special handlers associated with them (e.g. one of the video driver options was to track writes to screen memory) would be handled by calling a function pointer that was contained in the fast map entry (allowing it to go direct to the correct handler for the IO device being accessed). The FastMap system did rely on the ability to detect when the page tables were modified, so that the FastMap could be kept up to date. This is trivial for MEMC emulation (page tables are hardware registers). For ARMv3+ it would probably have to be done by associating write handler functions with any pages which are used for page tables. A bit trickier, but still possible. I’m not sure this would be possible as the JIT won’t necessarily know if the load/store is addressing IOMD at the time it builds the code. I think the way I’d want to do this is have the JIT generate two separate code paths. E.g.: LDR R1,[R0,#foo] LDR R2,[R0,#bar] LDR R3,[R0,#wibble] -> min_addr = (R0 + min(foo, bar, wibble)) & ~3 max_addr = (R0 + max(foo, bar, wibble)) & ~3 tlb = mmu.translate((R0 + foo) ~3) if tlb.is_ram && (min_addr >> 12) == (max_addr >> 12) { // Use log -> host address offset returned in TLB entry to generate a // sequence of load instructions for R1, R2, R3. // Potentially split this into another two code paths if we can predict // that all addresses are likely to be word-aligned (e.g. foo, bar, wibble // are all word-aligned) } else { // Load R1 (could abort, could be direct access, could be function call) // Do another MMU translation for [R0,#bar] // Load R2 (could abort, could be direct access, could be function call) // Do another MMU translation for [R0,#wibble] // Load R3 (could abort, could be direct access, could be function call) } Of course the magic is going to be having the JIT detect the places where the above kind of optimisation is going to be possible. Have you considered running the code on a dedicated core and mapping memory 1:1 to avoid emulated memory access? Well the obvious problem with that is that not all machines are multi-core :-) The second would be, where would the JIT code live? AIUI for the CPU to be able to execute code it needs to be in a readable page, so without some kind of address filtering around all load instructions the emulated code will be able to easily see that it’s living in a simulation. And even with address filtering present there’d be a chance the emulated machine could pick a memory map which makes things awkward (or impossible) for the host to find spare logical address space for it to insert the emulator code.

Apr 24, 2017 12:56pm Jeffrey Lee (213) 6048 posts	An example of how the FastMap stuff is intended to work for a simple LDR: BIC log, log, #&fc000003 ; Deal with 26bit ARMv2 address bus, and mask out bottom two bits to make it word aligned MOV temp, log, LSR #12 ADD temp, fastmap_ptr, temp, LSL #3 LDMIA temp, {flags_and_data, func_ptr} ; flags_and_data contains flags in the top byte, and a (shifted) logical -> host address offset in the low bytes AND result, fastmap_mode, flags_and_data MOVS result, result, LSL #1 BEQ abort ; abandon this memory access and trigger a data abort LDRCC data, [log, flags_and_data, LSL #8] BLXCS func_ptr ; return with 'data' containing the loaded value (assume the compiler arranged the in/out registers sensibly) ; + extra code here to deal with rotated loads Possibly there’s an extra instruction there which can be dropped – my original notes (if they still exist) will be at home. fastmap_ptr would be the pointer to the fastmap table, fastmap_mode is a precomputed flags word indicating the current CPU/MEMC mode. Writes could be handled with the same kind of code sequence, you just needed to shift flags_and_data left one during the AND. (For more info see here, here, here – especially the FASTMAP_RESULT macros, here, and here for various fastmap build/rebuild + access functions)

May 1, 2017 7:56pm Jon Abbott (1421) 2651 posts	Without a timer API, how could it regulate overall speed? It can cycle count the instructions so T1..T4 and DMA occur at the correct relative rate to the emulated CPU, but how could it regulate itself to a specific MHz? There was some mention of ArcEm being too fast on a Pi over on StarDot, which reminded me of the issue. Its a problem I spent a lot of time on with ADFFS, as code runs at roughly the equivalent of a ARM3 at 50% the MHz of the machine you’re running it on. If/when I add a StrongARM JIT, it would probably be close to a StrongARM at 75% of the host CPU MHz. I had to cheat in the end, by regulating at VSync/Frame swap or modifying the code in games that don’t have any regulation. I built the timers into the blitter, as games only use them for palette swapping – again a nasty hack. The FastMap code is about as optimal as it could be, neat use of BLX. There’s possibly some optimization to be had by merging instructions ahead that aren’t affected by the result, to prevent the stalls around the LDMIA although it would make recoding branching a headache; probably not worth implementing for the few cycles its going to get back. How’s the rotated load done? log is losing the key bits, so does it need another working register or would you duplicate the code. ie `ANDS temp, log, #%11 ADDNE PC, PC, temp, LSL #6 ...aligned load +64 load / ROR #8 +128 load / ROR #16 +192 load / ROR #24` Probably not optimal requiring two branches and a large chunk of memory. Alternatively: `BIC log, log, #&fc000000 ;retain the rotation MOV temp, log, LSR #12 ADD temp, fastmap_ptr, temp, LSL #3 LDMIA temp, {flags_and_data, func_ptr} AND result, fastmap_mode, flags_and_data MOVS result, result, LSL #1 BIC temp, log, #3 ;align the load BEQ abort LDRCC data, [temp, flags_and_data, LSL #8] BLXCS func_ptr ANDS log, log, #3 ;does it need rotation? MOVNE log, log, LSL #3 ;YES MOVNE data, data, ROR log` What does func_ptr do? Can’t the LDRCC load the data directly by offsetting the address in flags_and_data by the logical address its mapped too? eg LDR R0, &8000 – the page for &8000 is stored at &40000 so the address in flags_and_data is &40000-&8000 or &38000

May 1, 2017 10:58pm Jeffrey Lee (213) 6048 posts	Without a timer API, how could it regulate overall speed? It can cycle count the instructions so T1..T4 and DMA occur at the correct relative rate to the emulated CPU, but how could it regulate itself to a specific MHz? There was some mention of ArcEm being too fast on a Pi over on StarDot, which reminded me of the issue. At the moment ArcEm doesn’t regulate the CPU speed, only the peripherals like timers/video/etc. Likewise, RPCEmu doesn’t attempt to regulate its CPU speed. The peripheral timing is based on measuring the number of emulated CPU cycles per second, and then using that to calculate a scaling factor that can convert values in real time (e.g. desired number of ticks of the 2MHz IOC clock) to the number of emulated CPU cycles. If we wanted to fix the CPU to run at a specific speed then I think the easiest method would be to disable the code that adjusts the scaling factor (e.g. for emulating an 8MHz ARM2, conversion from IOC clock cycles to CPU cycles would use a fixed scale factor of 4). Then to regulate the speed of the emulator, use some kind of time-slicing system to put the emulator to sleep at regular intervals if it’s running too fast (stop the CPU, timers, video, etc). Since we’re primarily interested in human interaction, 100Hz time slices using TickerV would seem like a reasonable choice. E.g. for the 8MHz ARM2, you’d have entries in the event queue which will cause the emulator to go to sleep every 8MHz/100 CPU cycles, and have a TickerV handler on the host which wakes the emulator up again. How’s the rotated load done? There’s another temporary register. The code only does the rotation if the low two bits are non-zero, like with your second example. What does func_ptr do? It’s a pointer to a read/write handler for any special types of memory – e.g. IOC/MEMC hardware registers. The way the FastMap flags works allows for full control over whether the pointer is used just for reads, just for writes, or for both. And because each page can be given a unique function pointer, it avoids the need to have one big handler function which does a bunch of address decoding to work out what peripheral is being accessed (although I don’t think I actually did any profiling to prove that this was better than having a static BL to a generic handler function) With a JIT system, any of the RAM pages which contain code would have a pointer to a write handler which checks for the code being overwritten. Can’t the LDRCC load the data directly by offsetting the address in flags_and_data by the logical address its mapped too? eg LDR R0, &8000 – the page for &8000 is stored at &40000 so the address in flags_and_data is &40000-&8000 or &38000 That’s exactly what it does :-) Emulator wants to load logical address &8000. Let’s say that maps to (emulated) physical address &123000. The emulator allocates one big block of memory to contain all of RAM, let’s say that’s at &76500 in the host’s memory (n.b. must be 256-byte aligned due to the shift that’s applied). When the fast map is being generated, the entry for the logical page at &8000 will have flags_and_data set to `(-&8000+&123000+&75600) >> 8` (plus the appropriate access permission flags in the high byte)

May 2, 2017 7:54am Jon Abbott (1421) 2651 posts	How would STR / STM detect self-modifying code? A couple of options: Bit table. Each memory word is assigned a bit, with 1 indicating an instruction. To reduce the memory requirement, the bit block could be allocated along with the page and a pointer stored in the fastmap table Duplicate every mapped page and write a value to the identical address in the duplicate page when the JIT processes an instruction. Although this takes a large chunk of memory, it does remove the overhead of decoding the address that a bit table would require and could double as the table containing the address / size of the instructions address within the JIT’d code. To reduce memory, only allocate blocks when the JIT first hits an instruction in the page, storing the pointer in fastmap, with each entry in the table being < address >< size > (size being 8 bits?) so the JIT’s code block size can be quickly shifted out to get the address pointer Option 2 seems the better option as its essentially free, it will need to track the address and possibly size of each instruction anyway, for reencoding branches. Once self-modifying is detected, we drop the whole JIT block. How many instructions does a block handle and how would we allocate/track the JIT blocks? The lower the number of instruction per block the better, 32 or even 16? I presume the JIT blocks would be heap based, to allow for easy allocation/deallocation? Going back to how the JIT initially scans the instructions and builds up the JIT block, it may be preferable to ignore any terminating instruction and encode N instructions to a temp area, so we know the size of the block required, a heap allocation is then made and the block modified and copied to the heap allocated block, reencoding any branches if appropriate. This does however have the drawback that data after a terminating instruction may be interpreted as an instruction. The alternative would be to observe the terminating instruction, presuming it’s unlikely any further instructions in the initial block would be called. This issue raises the question of how it would handle adding additional code to blocks that have already been partially handled, but terminated early. This would happen quite often due to subroutines exiting mid-block. Perhaps we should consider using a dynamic instruction count per JIT block, instead of a fixed number of instructions? What are the implications/challenges of going dynamic?

May 2, 2017 9:01am Jeffrey Lee (213) 6048 posts	How would STR / STM detect self-modifying code? … Option 2 seems the better option as its essentially free, it will need to track the address and possibly size of each instruction anyway, for reencoding branches. My initial thoughts were to go with option 1. As with most things being discussed here, I think the key thing is to remember to profile the code and think about different algorithms as appropriate. How many instructions does a block handle and how would we allocate/track the JIT blocks? The lower the number of instruction per block the better, 32 or even 16? I presume the JIT blocks would be heap based, to allow for easy allocation/deallocation? I’m not sure how many instructions there’d be. My initial thoughts were that the only hard limit would be that the JIT code block can’t cross an emulated page boundary, so you’d have a theoretical max of 1024 instructions. But since the JIT will be based on (rudimentary) code analysis I’d expect the algorithm to stop the code generation much sooner than that. If the code analysis is struggling then maybe some extra limits will be necessary, we’ll have to wait and see. Earlier on I suggested that memory allocation for the JIT would be in fixed-size chunks (“JIT code pages”), with all of the blocks allocated on emulator startup to ensure branch instructions can be used to jump between them. Allocating/freeing blocks on demand from the heap would probably kill performance; in the distant past when working on DeathDawn I had some profiling results which suggested CLib malloc/free were pretty slow. Using fixed-sized blocks is my way of pre-emptively combating that (since it’s trivial to create a lightning-fast small object allocator if all your blocks are the same size). I’d guess some kind of pseudo-LRU or round-robin scheme will be used to decide which blocks to evict from the JIT code cache when it runs out of space. What are the implications/challenges of going dynamic? I think the major difficulty is having the JIT correctly identify where routines terminate. First-pass it would probably want to treat any undefined instruction, any unconditional branch (to outside the page), and any instruction which writes to the address bits of the PC (MOV PC, ADD PC, LDR PC, etc.) as being a terminating instruction. Then we’d look at the stats for how many code blocks are being invalidated due to suspected self-modifying code and try tweaking the algorithm as appropriate.

May 2, 2017 3:02pm Jon Abbott (1421) 2651 posts	Allocating/freeing blocks on demand from the heap would probably kill performance I don’t believe it would affect performance. ADFFS dynamically allocates and frees heap blocks whenever it builds code for an instruction, or when self-modifying code overwrites an instruction that was previously reencoded, you wouldn’t even know it was occurring. It has a slight impact on the initial encoding of a block, but once its encoded it should never be touched again unless its self-modified or overwritten. Use malloc to allocate the initial heap space and a bespoke heap manager to alloc/dealloc within it, that’s pretty much what ADFFS does. Dynamically allocating/deallocating and supporting dynamic length JIT blocks would reduce the memory wasted from using fixed block sizes. It would also reduce the hit of self-modifying code when encode blocks have to be dropped. First-pass it would probably want to treat any undefined instruction, any unconditional branch (to outside the page), and any instruction which writes to the address bits of the PC (MOV PC, ADD PC, LDR PC, etc.) as being a terminating instruction ADFFS terminates on both conditional and unconditional B, BL, anything that alters PC, SWI OS_Exit / OS_ExitAndDie / OS_GenerateError It terminates on B and BL, regardless of them being conditional as it improved performance on decrypt / uncompress start up sequences. In an attempt to improve the number of instructions encoded at each JIT entry, I added branch prediction, but found this also caused a performance hit due to self-modifying code, so it’s turned off in the public releases.

May 2, 2017 9:58pm Jeffrey Lee (213) 6048 posts	One of the dangers of this thread – apart from us spending all our time talking instead of writing code – is that it makes me want to drop whatever I’m currently doing and work on a JIT instead. With that in mind I’ve been having a bit of a look at what would be necessary to build a RISC OS version of RPCEmu. It turns out that the GCCSDK autobuilder can build both RPCEmu and Allegro with only minimal tweaks required (tweaks available on request if anyone feels like taking up the gauntlet) However the resulting binary won’t do anything useful since Allegro will build without any video/windowing systems enabled For a variety of reasons, removing Allegro from RPCEmu and tying it directly to RISC OS/the Wimp isn’t a very good idea So instead, the goal would be to add proper RISC OS support to Allegro (i.e. a native Wimp version of Allegro) But it’s unlikely that I’ll be able to find the time to do that in any reasonable timeframe (too many other things I should be doing, and too much fighting with the autobuilder+Allegro sources to get over the initial hurdle of setting up a new platform/backend) So instead of working on a RISC OS-friendly JIT for RPCEmu, I think my options are: Develop an ARM Linux-friendly JIT for RPCEmu. Pros: RPCEmu gets an ARM JIT (which could later be adapted for RISC OS friendly). Cons: having to use ARM Linux, a bit of a learning curve compared to JIT under RISC OS, and no direct progress towards a useful RISC OS version of RPCEmu Develop a (RISC OS-friendly) JIT for ArcEm. Pros: ArcEm performance boost, quicker development due to familiarity with ArcEm code + some desirable features like FastMap already existing, more modular JIT since it will be designed more from the POV of allowing it to be plugged into different emulators, ARMv2 might be easier than ARMv3/v4 due to only having to deal with 26bit CPU modes. Cons: No direct progress towards a useful RISC OS version of RPCEmu Do nothing. Pros: All the other stuff on my todo list stands a (slightly) higher chance of getting done. Cons: No direct progress towards a useful RISC OS version of RPCEmu Decisions, decisions…

May 3, 2017 7:58am Jon Abbott (1421) 2651 posts	My take on this… It’s nowhere near the point of being able to code anything as very little detail has been defined. I spent four months writing notes, reading papers from EMC and other about ARM on ARM, considering different methods and writing (with pencil and paper!) code examples of various different ways to encode instructions, before writing one line of code. The design and specification needs clearly defining, particularly decisive areas of the design that will not be easy to alter later, for example: how the JIT breaks the source instructions into blocks to encode, in particular if it’s a fixed number or dynamic based on termination instructions how much of the hardware it’s going to use, ie what’s paravirtualized and what’s emulated (you’re leaning towards fully emulated, I’m more for making use of paravirtualization) how instruction addresses are tracked, Ie for branch reencoding and handling self-modifying how are 26bit / 32bit CPU modes handled, ie does it have two separate instruction encoders, or one unified encoder? detail around how self-modifying code is handled how JIT entry occurs, or more specifically how the JIT starts up and how instructions that alter PC trigger entry into the JIT if the source block hasn’t yet been encoded Lesser key decisions: how is memory allocated what development environment to use what ARM level the JIT will run on, eg will it support StrongARM and will it make use of instructions specific to ARM revisions where data is held and how it’s expanded, ie DA’s or AppSpace are the client memory and JIT blocks kept seperate in memory, or simply allocated by page size blocks from a pool does it hang off an existing machine emulator or include its own chipset emulation how does it detect an app being run on the host, that needs to run under the client OS how is the client WIMP paravirtualized, so wimp apps integrate seamlessly with the host desktop I’ve listed just a few off the top of my head, but there will be plent more. I would envisage over six months of design and specification before coding anything. RPCEmu…does licensing permit the IOMD emulation being ripped out and used elsewhere? As RPCEmu isn’t available for RISCOS, it wouldn’t be much use as a base emulator, in fact there’s large chunks of it you’d rip out anyway for ARM on ARM. If what’s being proposed does not supporting RPC apps from the outset (re StrongARM), the pitchforks will probably come out as the requirement is for a true replacement to Aemulor; as you point out however, ARM3 system emulation is more advanced and available for RISCOS. ARMv2 might be easier than ARMv3/v4 due to only having to deal with 26bit CPU modes On this point, it could Paravirtualize the vector entry/exit within the client OS and only provide 26bit CPU modes. The RISCOS versions that need to be covered don’t use 32bit after all, immediately switching to 26bit. There is however the obvious drawback of not supporting client code that takes over the hardware vectors because it expects a 32bit PC on entry.

May 3, 2017 9:15am Jeffrey Lee (213) 6048 posts	It’s nowhere near the point of being able to code anything as very little detail has been defined. I spent four months writing notes, reading papers from EMC and other about ARM on ARM, considering different methods and writing (with pencil and paper!) code examples of various different ways to encode instructions, before writing one line of code. Wow – you and I have very different approaches to writing code! Admittedly if I was aiming to implement a para/hyper virtualisation system then I’d also need to do a fair amount of research, since I’ve never needed to look into the details of virtualisation on ARM before. But in general I tend to take a very light approach to designing code; I’ll plan out some of the major challenges on paper/computer but leave the minor ones to sort themselves out during the implementation, refactoring or reimplementing bits of code as necessary. My fear is that if I put too much effort into designing the code then I’ll find that the design document/pseudocode has turned into an almost-complete implementation of the software, except it’s written in this funny language called English which computers don’t understand so I’m going to have to waste time rewriting it all. Or when it actually comes to implementing the design I’ll realise that it’s complete garbage, and I would have saved myself a lot of trouble if I’d gone down the rapid prototyping/evolutionary design approach and started the implementation much sooner. Of course I also realise the choice of implementation language will have an effect on how much prep work you do. Refactoring C is going to be easier than refactoring assembler, and if I’m working on a non-trivial algorithm that’s going to be implemented in assembler then I’ll often prototype in BASIC or BASIC assembler before translating to the final form. The design and specification needs clearly defining, particularly decisive areas of the design that will not be easy to alter later, for example: My POV is: Dynamic based on terminating instructions. If necessary, tweak the code later on to improve performance. Full system emulation, with the JIT re-encoding/examining all instructions (host never directly executes an instruction that was written by the emulator) Big lookup table to map physical addresses to JIT code blocks (or a pointer to the “do the JIT” routine if no entry point exists for that address) Generated code will probably be capable of emulating both 26 & 32bit within the same JIT code block Self-modifying code handled by tracking memory writes on a per-word basis for any pages containing JIT code, and invalidating the generated code as appropriate (+ fallback to interpreter to deal with any self-modifying code which relies on instruction pipelining) JIT entry handled by the big lookup table (i.e. any branch/PC write will do a logical → physical translation via the fastmap, and then physical → JIT translation via the lookup table) With the lesser decisions being: Memory allocated in big blocks to simplify address translation. Allocate as much as possible ahead of time so to avoid out-of-memory situations when the emulator is running. GCC Initially aim for compatibility with ARMv5+ host CPUs. Maybe make use of some newer instructions where it’s easy to do so, but for the most part worry about that kind of stuff (or ARMv3/v4 compatibility) once the initial version is complete. Data is held in application space Use custom small object allocator for allocating JIT code blocks from the pool Hang off of an existing emulator It’s going to be full system emulation, so no need to detect apps being launched Wimp integration is a separate problem to CPU emulation, worry about that at a later date I would envisage over six months of design and specification before coding anything. Meanwhile, I’m ready to dive in right now :-) A few weeks of focused development could result in a working solution, or a useful failure to learn from. I guess part of the problem is that you’re approaching this from the angle of “how to seamlessly run 26bit ARMv3/v4 apps in a 32bit ARMv4+ OS”, while I’m approach it from the angle of “how to get good performance from a RiscPC emulator”, with the assumption that people will either be happy with doing everything inside an emulator, or the Wimp/OS integration will be solved at a future date by PixieDust™. RPCEmu…does licensing permit the IOMD emulation being ripped out and used elsewhere? As RPCEmu isn’t available for RISCOS, it wouldn’t be much use as a base emulator, in fact there’s large chunks of it you’d rip out anyway for ARM on ARM. It’s GPLv2.

May 3, 2017 12:14pm Rick Murray (539) 13840 posts	Wow – you and I have very different approaches to writing code! I just sorta sit down, bash out some code, and tweak as necessary. ;-) It’s a naff way to do it, I know, but I tend to find it I spend a long time thinking about how to write something, nothing gets written. At least with crappy code there is something to improve.

May 4, 2017 9:05am Jon Abbott (1421) 2651 posts	I’m ready to dive in right now :-) A few weeks of focused development could result in a working solution I look forward to seeing the result. A RiscPC emulator for RISCOS would be an achievement in itself, one that can match RiscPC performance, even more impressive. part of the problem is that you’re approaching this from the angle of “how to seamlessly run 26bit ARMv3/v4 apps in a 32bit ARMv4+ OS”, while I’m approach it from the angle of “how to get good performance from a RiscPC emulator” True, my interest is in providing backward compatibility. I’d also be looking in detail at where virtualization techniques could be used to improve performance.

May 8, 2017 1:02pm Jeffrey Lee (213) 6048 posts	Develop a (RISC OS-friendly) JIT for ArcEm. As you might expect, I had a go at this over the weekend, implementing a trivial JIT using some of the ideas discussed here. It’s not optimised yet, so adding the control logic actually made the emulator run twice as slow, but once I got the JIT working correctly it was able to regain a fair amount of that lost performance. So the next steps will be: Refactor the code to try and get the interpreter performance back to where it was (basically some key JIT-related functions need to be inlined) Add a register allocator to the JIT (currently it saves & restores the input & output regs for every translated instruction) Add some metrics reporting so parameters can be tuned There’s a test mode which checks the behaviour of the JIT vs. the interpreter for every code block executed, that could do with some improvements Extend the JIT to handle more instructions! Currently it only handles ALU operations, and only those which don’t read/write R15. At some point I’ll also need to re-implement the pipelining logic in the interpreter so that it won’t call into the JIT if it’s in the middle of a self-modifying code sequence One interesting area of development will be working out how to generate code that minimises JIT overheads (e.g. register load/store/move) when dealing with multiple entry points, in-block loops, etc. I suspect that a simple one-pass code generator won’t cut it, and eventually it will have to be replaced with something more sophisticated (two-pass? scan forwards N instructions and plan register usage for each ‘window’? scan the entire page and generate a proper code graph?)

May 8, 2017 4:33pm Jon Abbott (1421) 2651 posts	One interesting area of development will be working out how to generate code that minimises JIT overheads (e.g. register load/store/move) when dealing with multiple entry points, in-block loops, etc. I suspect that a simple one-pass code generator won’t cut it, and eventually it will have to be replaced with something more sophisticated (two-pass? scan forwards N instructions and plan register usage for each ‘window’? scan the entire page and generate a proper code graph?) My head hurts just thinking about the complexity involved. It could track which registers need to be in the working registers for every address in the block, so branches can quickly restore the registers. This is made somewhat more complicated by the fact they potentially map to any registers, so the register mapping also need tracking. This obviously needs a huge chunk of memory, so it could scan the block from the start to work out which registers map to which at the in-block address its going to branch too – not as fast, but will only be a one-hit scan if the code isn’t self-modifying. At the branch, it needs to swap registers that are already mapped in, to the correct register if they’re different in the block its jumping into. It obviously needs to store any registers which are no longer mapped in. When branching to an address that hasn’t yet been translated, it will need the JIT to build up the block before it can work out the register mappings. ALU’s and LDR/LDM that branch are a whole different challenge, as they’ll need to dynamically work out what to do with registers before branching. I’ve no idea how these could be implemented without a huge hit to performance and requiring tracking of register mappings for every address.

Pages: 1 2 3 4

Reply

To post replies, please first log in.

Forums → Code review →

ARM on ARM discussion

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options