OS_ChangeDynamicArea slowness
Jeffrey Lee (213) 6048 posts |
Last night I started looking into why OS_ChangeDynamicArea is so slow in some situations (moving all of RAM into the free pool on startup, growing screen memory by ~10MB, etc). Initially I was expecting there to be some complex and difficult to follow algorithm at work within the code (Kernel.s.ChangeDyn is a big source file!), but on closer inspection OS_ChangeDynamicArea only takes up a small part of it. Current codeThere are four main algorithms that OS_ChangeDynamicArea uses:
As you can see, all of the above work by moving pages one at a time, calling “BangCamUpdate” on each page in order to actually perform the move. BangCamUpdate internally splits the task into two activites – mapping out the existing use of the page, and mapping in the page at its new location. For each of these sub-tasks cache/TLB maintenance is dealt with:
There are also some extra complications within OS_ChangeDynamicArea when it comes to handling doubly-mapped regions. (For the uninitiated, doubly-mapped regions are ones in which the dynamic area is mapped in twice, with the two mappings adjacent to each other – it’s primarily used for screen memory on machines which support hardware scrolling, so that the CPU can access the same scrolled address range as the video hardware) When growing a doubly-mapped region the second mapping (which is located before the start address of the DA) first needs to be shuffled down. It looks like this is done on a per chunk basis – whether those are big chunks (when DoTheGrowNotSpecified is uses) or small chunks (DoTheGrowPagesSpecified). Except it’ll pretty much always be DoTheGrowPagesSpecified, since screen memory will want to be using specific pages. This makes growing screen memory O(n^2), as unless the grow amount is relatively small (<=252k) each page in the secondary mapping will get moved multiple times. Each page being moved will also cause a full TLB flush, which isn’t likely to help performance much. When shrinking a doubly-mapped region, the second mapping gets shuffled in one go once the main shrink has completed. One important thing to observe is that DoTheGrowPagesSpecified is the only routine that manipulates pages which are actively being used. All the other routines only act on pages which are known to be unused. DoTheGrowPagesSpecified is also the only routine that disables IRQ during its operation (since it needs to be able to copy and replace each page without any IRQ process messing with it) Potential improvementsFirst off there’s the really obvious one – at the moment cache/TLB maintenance is being performed one page at a time. This is almost certainly the cause of the slow performance when moving all of RAM into the free pool on startup (On a 512MB machine that would be around 100,000 pages). Since AreaShrink and DoTheGrowNotSpecified know straight away which pages they’re operating on they could easily be changed to do the cache/TLB maintenance up front using just one ARMop call. This will allow the ARMop code to decide for itself whether a full or partial cache flush is the most appropriate. For DoTheGrowPagesSpecified/DoTheGrow things are a bit trickier, since it’ll only move stuff in blocks of up to 252k, and it’ll often have to deal with pages which are in active use. To fix performance problems there I think we’d have to make some pretty major changes. One of the easier changes to make might be to try using DMA to copy the contents of each page instead of using the CPU. Another thing I can think of at the moment is the handling of growing doubly-mapped regions. The easiest way I can think of for dealing with this would be to start off by unmapping the second mapping entirely, performing the grow, then recreating the second mapping in the appropriate location. This does run the risk of things going horribly wrong if OS_ChangeDynamicArea crashes, so maybe some fixup code can be added to the kernels abort handler to allow it to repair any doubly-mapped regions if something crashes while OS_ChangeDynamicArea is busy moving things around (and as a special case, fixup any broken VDU driver variables – it should be easy enough for the kernel to spot if it was the screen DA which was in the middle of being manipulated). We should also be able to get a performance boost by making sure that full cache/TLB flushes only occur once when messing with doubly-mapped regions instead of for each page being manipulated. |
Jeffrey Lee (213) 6048 posts |
I had a quick go at modifying DoTheGrowNotSpecified to do all the cache/TLB maintenance in one go. The OS_ChangeDynamicArea call that FreePool makes on startup now takes 15cs instead of 1346cs! |
Steffen Huber (91) 1949 posts |
1346cs? You mean more than 13s? That was the thing that really crippled startup performance? |
Michael Drake (88) 336 posts |
Is that on an Iyonix? How much memory does the machine have? Sounds great anyway. :) |
Jeffrey Lee (213) 6048 posts |
Yes. Inefficient code can be hidden pretty much anywhere!
BB-xM, 512MB. The Iyonix does suffer from poor performance as well, but I’m not sure if it’s quite as bad as on the BB. |
Andrew Rawnsley (492) 1443 posts |
Has this improvement been put into the main ROM build tree, or is it still in testing? Would love to begin testing this! |
Jeffrey Lee (213) 6048 posts |
Still in testing. Before I check it in I’m going to spend a little while playing around with DoTheGrowPagesSpecified to see if there’s anything easy I can do to speed up screen memory growing. Last night I added some profiling code to time the different steps it goes through, so I should be able to get a pretty good picture of which bits are eating all the CPU time. |
Steve Revill (20) 1361 posts |
Great work, Jeffrey. Thanks for that – it’s good to know our suspicions about Freepool were on the money and you’ve thought of some ways to sort it out, with various other improvements presenting themselves. This should make a massive difference to the pre-desk phase of the boot sequence in particular. The compressed ROM images are also a big help here. There’s some other work in progress that I’m aware of which will improve things still further. That’d leave the long, random DHCP pause as the only thing left to clobber (and I can think of a fairly simple solution to that, too). |
Jeffrey Lee (213) 6048 posts |
These changes are now checked in. I managed to find a few different ways to tweak DoTheGrowPagesSpecified, so for my test case of growing the BB’s screen memory by 16MB it’s now between 2 and 5 times faster, depending on the state the required pages are in. In real terms this was a reduction from 1.93s to 0.41s (4.7x speedup) for the “easy” case, and a reduction from 1.58s to 0.75s (2.1x speedup) for the “hard” case. From looking at my profiling results, it looks like the main difference between the two cases is that the “easy” one only needs to preserve the contents of a few pages, while the “hard” one has to preserve the contents of more pages (presumably all of them). The page copying doesn’t use DMA yet, so that’s why the “hard” case is so much slower (at least when dealing with noncacheable pages).
In the end I decided not to go with this approach. It would have involved calling the DA pregrow/postgrow handlers without the second mapping being present, which the DA owner might not be expecting. However now that I’ve taken another look at my profiling results, it looks like this code is now taking most of the CPU time for the “easy” case, and a healthy chunk of the CPU time for the “hard” case (about 0.28s for each case). So I’ll probably implement this idea at some point in the future (doubly-mapped DAs are reserved for Acorn/OS use, so there shouldn’t be any knock-ons for user code)
Ah, yes – FileCore reading the disc map one sector at a time. Good to see that one finally get fixed! Regarding Iyonix performance – *FreePool wasn’t anywhere near as bad as it was on the BB. For a 512MB Iyonix, the time has been reduced from 1.18s to 0.23s. |
Steve Revill (20) 1361 posts |
Looks like RISC OS 5.20 (whenever that happens) will have quite a number of useful and significant performance improvements. :) |
Michael Drake (88) 336 posts |
Out of interest, how long does startup take on a BB-xM now? |
Rob Heaton (274) 515 posts |
From power-on to a usable desktop is around 38 seconds, it was around 55 seconds prior to the recent changes. |
Michael Drake (88) 336 posts |
Thanks Rob. So it’s certainly a lot faster now, but still some way off what I’d call quick. |
Chris Gransden (337) 1202 posts |
It’s possible to improve boot time by a few more seconds. Mine went from 62 seconds to 33 seconds. Using a manual IP address instead of DHCP saves a few. Using the latest MLO and u-boot.bin files saves a few more. The latest version of u-boot has support for preEnv.txt files. This enables changing the default boot delay of 3 seconds to 0. I’ve uploaded a zip file containing the lastest versions here. They should work on all beagleboards. I’ve only tested on a Xm rev A2. |
Chris Johnson (125) 825 posts |
Having installed the latest ROM, it is certainly quicker to boot on the ARMini. Am I correct in thinking that a substantial part of the boot time is now due to the BB checking its onboard memory before it ever gets around to running RISC OS? |
Jeffrey Lee (213) 6048 posts |
I don’t think there are any memory tests performed on startup. Some of these are just estimates based around the timings I performed when I was working on the compressed ROM support, but I think the current boot flow is broken down as follows:
|
Sprow (202) 1155 posts |
I’m not near a disassmbler just now, but could the uboot binary be patched with a cheeky branch past that test to save 3 seconds? |
Jeffrey Lee (213) 6048 posts |
Or, you could just use the new version Chris G. posted a few posts above which allows you to specify the delay using a preEnv.txt file. |
Andrew Rawnsley (492) 1443 posts |
Have been trying Chris G’s MLO/uboot files and even using !SDcreate as supplied with Jeffrey’s latest ROM, I can’t get my rev C to boot RISC OS properly with those files. The 3 mobo lights come on, but there’s no further “life” (ie. no keyboard/mouse init etc). Has anyone had more luck than me? |
Jeffrey Lee (213) 6048 posts |
BB-xM, I presume? If it was a standard BB then I’d be tempted to blame a version mismatch between the x-loader version in NAND and the u-boot version on the SD card. I haven’t tried the new versions yet, so I’m not sure what else could be the issue. |
Andrew Rawnsley (492) 1443 posts |
Yep, xM. I was basically trying to see if there were success reports for me to compare findings against. Trying to sort out a suitable uSD card release for the show this weekend. |