Physical memory allocation
Pages: 1 2
Jeffrey Lee (213) 6048 posts |
A bit of fiddling today has resulted in some improvements to the dynamic area shrink code, making it about twice as fast as it was before, making it comparable to the performance of the old AMB shrink code. That’s the good news, so now it’s time for the bad news again.
So for now I suspect I’ll focus only on optimising the common AMB operations – mapping pages in and out – and leave the other memory mapping operations as-is. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
“optimising the common AMB operations” actually turned into “just use the old AMB memory mapping code”. The old code was so well optimised for its purpose that the only way to improve on it would be to pick a better algorithm (e.g. tree-based CAM), or avoid remapping memory altogether (e.g. ASID-based task swapping), two things that are outside the scope of my current changes. And when you consider that (a) a tree-based CAM won’t help much with lazy task swapping and (b) ASID-based task swapping is only possible on ARMv6+, you realise that on ARMv5 and below the current code is pretty much the only option (although a tree-based CAM would be nice for the machines which can’t support lazy task swapping). Since the AMB work has highlighted a lot of issues, I’m now looking at how best to tidy up s.ChangeDyn so that the memory mapping code it uses can more easily be tweaked/replaced in the future. Mainly this involves spotting duplicate bits of code that can be turned into subroutines, and making sure that all bits which use the PageFlags_Unsafe optimisation are standalone routines. A couple of other random observations I’ve made while working on the code:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Actually, some investigation of the ARM ARM reveals that the newer the architecture, the more support there is for having the MMU read from the caches. So there are a few machines where we could easily make the page tables write-back cacheable, with both the CPU and MMU seeing the benefits. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Early testing of cacheable pagetables on Cortex-A8 is promising:
The MMU in the A8 can’t read from the L1 cache, so it’s using an L1 write-through and L2 write-back cache policy for the page tables. I think the MMUs in all the other ARMv7/v8 devices we support are capable of reading from the L1 cache, so they should see even greater benefits. So the next step is to tidy up a couple of loose ends and then move on to testing on other systems. In particular I’ve replaced the ARMv6+ DSB+ISB sequences with a macro, so that I can see if there’s any benefit to having cacheable pagetables with older CPUs. Possibly just XScale, since its data cache supports both write-back and write-through, whereas older CPUs are either much slower (ARMv3) or only support write-back (StrongARM). |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Testing results for the full set of CPUs I have at my disposal:
So all the machines show an appreciable performance gain for most operations, but it’s really the high-performance chips (A8, A9, A15) which stand to gain the most from the changes. However I don’t have any ARMv3 machines (RiscPC 600, 700, A7000, A7000+) to test against, and my StrongARM doesn’t support lazy task swapping, so there are a few datapoints missing. UPDATE ARM6+ARM7+lazy StrongARM results added
Really I guess it’s getting results for at least one ARMv3 machine that will be the most important, since they are likely to perform very differently to StrongARM. But if someone has a StrongARM that supports lazy task swapping then getting some results from that would be useful too. ( Also it’s possible that one or both won’t boot at all on ARMv3 machines, so it’s important testing before I check in the changes! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chris Hall (132) 3554 posts |
getting some results from that would be useful too. (SYS “OS_AMBControl”,5,-1 TO ,A : PRINT A will show 1 if lazy task swapping is supported. One of my two Risc PCs shows 1 (the Kinetic one running RISC OS 4.03). I’ll have a look at this on Monday. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chris Evans (457) 1614 posts |
We can let you have ARM610, ARM710 and Rev T StrongARM cards. If anyone has a spare A7000 or A7000+ spare would they be useful for you? |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
Good improvements. Why is there no improvement with Block map in on the A53? |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
An ARM610 card would be tempting – it’s got a few special requirements compared to the other ARMv3 machines we support. And I can’t imagine ARM610 cards are flying off of your shelves anymore! I’ll have a think about it (I think Sprow has volunteered to do some ARM610 testing for me, and these kind of changes are pretty rare, so it might be another year or so before I need to worry about ARMv3 testing again).
I’m not sure. The key routine for that operation would be AMB_movepagesin_L2PT. My local version is a bit different (some changes to PageNumToL2PT mean that it’s now only able to do 4 words of L2PT per loop instead of 8), but the key feature of using STM to write the page table entries is still there. A quick check through the A53 TRM suggests that it’s got a pretty wide memory bus (256 bit wide write interface from L1 to L2, 128 bit wide interfaces through the rest of the system), so it should be able to perform 4-word STMs to quadword-aligned addresses without breaking a sweat. But the A15 also has a 128 bit bus width, and that saw a 4x improvement, so there must be something else at play. I guess the A53 just has better write streaming to strongly-ordered memory |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
ARM610+ARM710 results added to the table, courtesy of Sprow.
I’ve re-checked the numbers, and with cacheable pagetables the results are actually marginally slower. Mapping in 16MB of RAM 256 times (so 4GB total) takes 11.5ms with non-cacheable pagetables and 12.1ms with cacheable pagetables. Maybe if I do any more testing on the Pi 3 I’ll redo that one but with a larger loop counter – 0.6ms out of 12.1ms sounds like it could easily be the result of a lengthy interrupt or two getting in the way at the wrong time. (The same operation on the Pi 2 took about twice as long, so the x1.2 result there is a bit more trustworthy) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chris Hall (132) 3554 posts |
One of my two Risc PCs shows 1 (the Kinetic one running RISC OS 4.03). I’ll have a look at this on Monday. I get ‘ROM type not supported’ from !SoftLoad (which is already set up to softload Select2i5). |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Sounds like ROL’s softload tool doesn’t like ROOL ROMs – try the ROOL softload tool (up to date binary here, or you can grab the app from a recent IOMD ROM download) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Colin Ferris (399) 1814 posts |
Jeffrey -what does the n[odecompress] do with Softload? Chris – if you run ROL ‘Softload’ util (ie double click on it- it’s inside !Softload)- what options does it give? I did a StubG version of ‘Softload’ v1.18 – if anyone wants to try – for the RPC/Emulator – no need of loading the CLib – if just to run ‘Softload’. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chris Hall (132) 3554 posts |
Using the absolute file ‘Softload’ you provided, with the roms ‘c’ and ‘nc’, and the command:
I now get ‘application is not 32 bit compatible’ but it doesn’t tell me which one. However the ROM loads OK and *Desktop still works (but of course no system resources have been seen). Results with ROM ‘c’:
Results with ROM ‘nc’:
Hope this is helpful. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Prevents pre-decompression of compressed ROMs. With a compressed ROM, normally the softload tool will decompress it before booting it, either to allow it to be patched or to double-check that that ROM isn’t already running. With the nodecompress option that code will be disabled, so the ROM will have to decompress itself. Mainly it’s for testing the self-decompression support within the ROM.
Yeah, that error is a bit crap. If you haven’t softloaded RISC OS 5 on your machine before then chances are you need to replace the boot sequence with the RISC OS 5 one (plus the “system resources” download) and then merge in anything else that you need from your old boot sequence.
Yep, that’s great – thanks. I’ve updated the table with the results. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13840 posts |
For reference (for others), it’s generated by CLib if you try to start an application using the old APCS call. You can either…
I would recommend the oddly named one, but then I’m just a wee bit biased ;-) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Colin Ferris (399) 1814 posts |
With ref to Harinezumi (looks a useful prog) – What is BootFX? *AddtoRMA doesn’t seem to avaliable on RO4.02 on boot – but answers to *help in Desktop. Can’t find a ref to a util of that name. Prefixing the call with a X – seems to work. BootLog – file is type Data – is that supposed to be Text? Why is PatchApp module not been 32bitted? – version here is dated 2014. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chris Mahoney (1684) 2165 posts |
It’s a module included with Pi ROMs that makes the “splash screen” appear when booting. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rick Murray (539) 13840 posts |
As Chris said, it’s a Pi module that makes a fancy startup (big colourful screen, the normal init in a tiny text window in the middle, and a sliding-bar style status indicator). Eye candy.
It is provided by a module called “BootCommands”. It’s built into ROM on 5.xx; not having RO4.02 I don’t know what it’ll be called, but take a look in !System to see if anything looks like it might be it.
The X just means “if it throws an error, don’t run around screaming panic!”. Something that really irritates me with RISC OS is that if some component of the boot process fails, the entire boot is just abandoned. That’s part of why I wrote Harinezumi – it reads what is supposed to happen, does it, then reports on whether or not it worked. And then it continues.
Hmm, I don’t have Harinezumi on the Pi. I’d better pull the sources off the old PC and rebuild them with the latest compiler. You know…just in case. :-) The file, if it doesn’t exist, is opened as a Data file, and once it has been written, it is supposed to be set to be a text file. I can see “ Hang on, let me install it on my Pi. …the words we love to hate → “works for me”. $.!Boot.BootLog is a text file.
It’s supposed to mess around with stuff to be “StrongARM compatible”, which I think involves detecting potentially iffy code and patching in calls to OS_SynchroniseCodeAreas. The long and short of it is that it’s a legacy thing – 26 bit code plain won’t work and 32 bit code should be aware of the behaviour of split instruction/data caches. |
Pages: 1 2