Contemplating better task swapping
Jeffrey Lee (213) 6048 posts |
One of the things I’ve been thinking about recently is how to improve task swapping in the OS (beyond the obvious fix of not flushing the caches on ARMv6+). As I’ve mentioned a few times in the past, ARMv6+ has two level 1 page table pointers, so that the OS can easily split the memory map into two sections (a global one and a process-specific one) and swap out the process-specific page table pointer when it performs a task/context switch. No O(N) iteration of page lists, no flushing of the TLB, just a couple of pointer swaps and some synchronisation instructions. And with the way RISC OS’s memory map is laid out this is a perfect fit for the OS. In terms of the realities of making it work, I think we’d need (at least) the following:
Of course another benefit of this form of task swapping is that it’s multi-core safe. All you need is a little bit of core-local data to track the currently mapped in ASID. The current approach of modifying both the CAM and page tables during a task switch simply won’t work (you’d end up needing core-specific L1 and L2 page tables, and either core-specific CAMs or adding an ASID-like thing to the CAM – so you might as well go the whole hog and use the native ASID support) |
Steffen Huber (91) 1953 posts |
I have understood maybe 5% of the details you mentioned – will it help if I just shout “Go for it, Jeffrey!”? |
Rick Murray (539) 13840 posts |
I can imagine Wimp_TransferBlock will be fun. I’ll echo what Steffen said – because getting the task handling to a state where it can contemplate using other cores would be quite a step forward. There are many hurdles to overcome with that, but each one passed is one impediment less… |
Jeffrey Lee (213) 6048 posts |
Just to clarify, I’m not planning on implementing this in the near future (too many other bits still need to fall into place first). I was just out of useful things to do at lunch, and wanted to get these thoughts out of my head before I lost them ;-) It’s also good to think ahead like this just to make sure the prerequisite tasks don’t end up getting implemented in a way which make the final step harder/impossible.
The hardware is perfectly fine with Wimp_TransferBlock – multiple ASIDs can happily use the same pages as each other, and there’s nothing stopping a page from being part of a global mapping and a non-global mapping. The fun part is teaching RISC OS about multiply mapped pages. The easiest (and multi-core safe) solution would probably be to use RAM-RAM DMA transfers. That way the OS doesn’t need to do any multiple mapping, and the DMA manager will automatically pause & resume the transfer if a DA handler decides it wants to claim one of the pages for itself. If you’re running on a system where the DMA controller doesn’t support RAM-RAM transfers, you can still emulate it in software – you can make temporary mappings of the pages (bypassing the CAM’s bookkeeping), so that as far as the OS is concerned there’s no multiple mapping going on. And you can rely on the DMA manager to tell you if the transfer needs to be paused for any reason. (Disclaimer: Teaching DMAManager about RAM-RAM transfers is also another thing which hasn’t been done yet) |
rob andrews (112) 200 posts |
When you set yourself a task it’s always a big one but I agree with your thinking about breaking it down into smaller tasks, making Risc OS use the on-chip support, so as Steffen said Go for It. |
Sprow (202) 1158 posts |
Sounds interesting, or complicated, or both! How would it work on pre-ARMv6? Do you think there should be a software emulation (handing out ASIDs and so on), or would Iyonix and earlier just stick with the current scheme and return “SWI not known” or similar errors from any new APIs?
Like the clone-o-matic-5000 so two or three of you can work on it at once. |
Jeffrey Lee (213) 6048 posts |
I think it would be best to just stick with the current scheme. Pre-ARMv6 you will need to flush pages from the cache when you’re mapping them out, so the current lazy task swapping system is pretty optimal. |
Jon Abbott (1421) 2651 posts |
I’ve been pondering this recently as well, looking through AMBControl to see how I could do the task swapping I’d require to code a full Hypervisor. Having reviewed the code, I came to the conclusion that the current method is far too slow for a Hypervisor, so I’d have to consider coding my own replace for it.
Would flushing pages and reverting their cacheability when they’re released avoid the need to both track and flush the cache at allocation? How many areas in RISC OS allocate pages? Is it centralised?
I presume sharing ASID’s would only start occurring when all 256 have been exhausted?
Sounds sensible.
If the CAM is “internal use only”, is it worth starting from scratch? If it’s currently page based, make it task/range based? Afraid my knowledge of it is zero. How will Application space allocated to Modules fit into all this? Will they get ASID’s and be treated like any other task? |
Jeffrey Lee (213) 6048 posts |
Would flushing pages and reverting their cacheability when they’re released avoid the need to both track and flush the cache at allocation? Almost. The tricky bit is DA pre-grow handlers which request specific pages; if they request a cacheable page which is currently mapped out then there’d have to be some extra logic there to deal with that (if the new owner wants it non-cacheable). Flushing pages from the cache when releasing them would almost certainly have worse performance than tracking the cacheability correctly, so unless it made things significantly easier (I’m not sure if it will) it’s probably not worth considering it. How many areas in RISC OS allocate pages? Is it centralised? When I added the PMP support I did tidy up the code a bit so that the only place pages will be allocated from will be the free pool (previously OS_ChangeDynamicArea would pull pages directly from application space if the free pool was empty; now it shrinks application space into the free pool as an intermediate step). Claiming pages which are in use by someone else are the exception to this rule (they go straight from the old owner to the new owner), but the replacement page which the old owner is given will come from the free pool, so ultimately it’s still quite easy to find the places where page allocation is going on. So a quick search through the kernel for where the free pool is being accessed suggests that AMBControl (growing an AMB), OS_ChangeDynamicArea (growing a DA), OS_DynamicArea (adding pages to a PMP), and AllocateBackingLevel2 (L2PT allocation for new DAs) are all involved in allocating pages. If L2PT was turned into a sparse DA or a PMP, and AMBControl was changed to use PMPs, then we’d only have two places to worry about. (Disclaimer: There are also other places which allocate pages during kernel init, but they’re special since they generally have to operate before the free pool or CAM even exist) I presume sharing ASID’s would only start occurring when all 256 have been exhausted? Yes. Easily achieved by keeping a count of how many times each is being used. A possible solution would be for the task ID to contain both the ASID and additional bits to extend it which the task switcher makes use of when swapping. I’d expect the Wimp to continue to use AMB handles for task swapping, and internally AMBControl would deal with mapping AMBs/PMPs to ASIDs. If the CAM is “internal use only”, is it worth starting from scratch? If it’s currently page based, make it task/range based? Afraid my knowledge of it is zero. Yes, the CAM is currently page based. For each RAM page the OS manages, it stores its current logical address, flags, and PMP association (the table is indexed by physical page number, which can easily be calculated from the physical address). I think the PMP association information is good enough for all current and future needs, so the only motivation I can see for rewriting the CAM would be if we wanted to find a data structure that was more memory efficient than a linear array. But that would add a lot of complexity so it’s probably not worth it at this point in time (at 16 bytes per entry, 0.4% of the RAM in the system will go towards storing the CAM). Although I guess there’s also multiply-mapped memory to think about – rewriting the CAM to allow multiply-mapped memory to be represented properly would certainly be useful.
Yes. For both regular tasks and module tasks the Wimp makes use of AMBControl for the memory management. |
Jon Abbott (1421) 2651 posts |
Isn’t the cache currently being flushed based on MVA? When flushing at allocation, won’t you need to track the CPU core and logical mapping after they’re released, or will you switch to flushing based on physical and do it across all cores? From memory ARMv6 supports the new ranged cache flush instructions, so I don’t think performance would be hit as bad as you might think. If RISC OS’s memory subsystem is going to change, threading and multicore should probably be factored into the changes. Otherwise it’s simply change for changes sake. |
Jeffrey Lee (213) 6048 posts |
Unfortunately those are an optional, ARMv6-specific extension, which isn’t supported by ARMv7+. Also (except for the PL310) there aren’t any cache operations for flushing by physical address. However, after spending a long time staring at the documentation, I think you’re right that flushing (at least some of) the caches when freeing pages (or mapping out global pages) is a sensible way to go. First, here’s the details of all the different cache types we have to deal with:
Taken all together, this means the following:
The above could be simplified a bit if, when mapping out global pages, we flushed the page from any VIPT/IVIPT caches. That would allow us to avoid worrying about page colouring constraints (which is probably wise – for ARMv7+ the page colouring scheme isn’t documented), or having to remember the previous VA of mapped out pages. On ARMv6 this would equate to flushing the instruction + data cache, while on ARMv7 we’d only have to flush the instruction cache. (For anyone struggling with the above, I did find a few useful links while I was writing it up: ARM article on page colouring, Hints as to what IVIPT really means, and how the A8’s PIPT data cache actually isn’t, and Wikipedia’s explanation of the different cache types)
The new code won’t be multi-core/multi-thread ready (I’m not ready for that quite yet), but it will certainly be a lot more multi-core friendly than the way we’re currently doing things. |
Jeffrey Lee (213) 6048 posts |
I guess if we were to redesign the CAM, it might make sense to go with a structure that complements the capabilities of the page tables: Go with a 3-level tree structure. The first level describes memory in 1MB aligned 1MB chunks (perfect for section mappings). The second level in 64KB chunks (large pages), and the third in 4KB chunks (small pages). If a 1MB chunk is being used as a section mapping, the 2nd and 3rd levels of the tree aren’t required. Change the memory allocation strategy for the OS so that as dynamic areas grow the page mappings can be automatically changed from small pages, to large pages, to section mappings – i.e. for each 1MB logical window the OS will try to assign it its own 1MB chunk of physical RAM, and will only map pages such that the low 20 bits of the logical address match the low 20 bits of the physical address. This will allow the OS to make more efficient use of the TLB, and will result in more consistent page colouring across the memory map (at least until the OS starts having to share 1MB sections between multiple 1MB logical windows) Multiply mapped pages could be handled by giving each leaf node in the tree a pointer to a reference-counted list of logical address offsets. The use of offsets rather than absolute addresses will allow multiple nodes to share the same list, helping to minimise memory usage. However I’m not quite sure how PMPs would fit into the above. Maybe for simplicity PMPs would only support a fixed page size (selectable when creating the PMP), so that they can retain their ability to map their pages anywhere they want within their logical address space, and the OS can pre-allocate all the necessary memory (CAM, page tables) when the pages are added to the PMP. And instead of implementing AMB nodes as PMPs we’d have to implement them as hidden dynamic areas, so that they can use the same page allocation/coalescing logic as regular DAs. |