Contemplating better task swapping

12 posts, 6 voices

Jul 14, 2016 1:27pm Jeffrey Lee (213) 6048 posts	One of the things I’ve been thinking about recently is how to improve task swapping in the OS (beyond the obvious fix of not flushing the caches on ARMv6+). As I’ve mentioned a few times in the past, ARMv6+ has two level 1 page table pointers, so that the OS can easily split the memory map into two sections (a global one and a process-specific one) and swap out the process-specific page table pointer when it performs a task/context switch. No O(N) iteration of page lists, no flushing of the TLB, just a couple of pointer swaps and some synchronisation instructions. And with the way RISC OS’s memory map is laid out this is a perfect fit for the OS. In terms of the realities of making it work, I think we’d need (at least) the following: The above-mentioned fix to avoid flushing the cache when mapping out pages (slightly more complicated than that makes it sound, since the CAM will need to be updated to keep track of the cacheability of unmapped pages so that if the page is later on re-used with different cacheability attributes we know to flush it from the cache first) A system to control the dynamic assignment of ASIDs (“address space ID” numbers used by the hardware) to tasks. ASIDs are limited to 8 bits, so in order to avoid a hard limit of 256 tasks we’ll need a way for multiple tasks to share the same ASID. In reality I think this just requires us to flush the TLB whenever we switch between two tasks that share the same ASID A resizable memory pool for storing the L1 and L2 page tables. For each task (or for each task which we want to be able to switch to quickly) we’ll need to have its L1 and L2 page tables ready to go; with a 512MB wimpslot that would equate to 2KB per task-specific L1 page table, plus however much L2 memory is required. Currently we only allocate L2 page table memory 4K at a time, but to avoid wasting memory we might want to start allocating at 1K granularity (corresponding to a single L1 page table entry, or enough for describing 1MB of RAM) Smarter logical-to-physical translation in the kernel. The kernel will have to know to use the task-specific L1 page table whenever it’s asked for an address which is below the 512MB barrier. It will also have to deal with the L2 memory not having a flat logical mapping. When we switch tasks we don’t want to waste time updating the CAM to say that the pages have been mapped out, so we’ll need a way of flagging pages as being ASID-specific within the CAM (and of course we need to flag which ASID they’re associated with). Then whenever the OS does a physical-to-logical translation via the CAM it will be able to return the correct state for the page depending on whether that task is actually mapped in or not. Since I’m hoping to rewrite the existing AMBControl implementation so that it’s built ontop of the PMP system, it would make sense to have the ASID be an attribute of the PMP, rather than add an extra field to each CAM entry (although this would also require each task to have its own PMP, which may present problems of its own) Of course another benefit of this form of task swapping is that it’s multi-core safe. All you need is a little bit of core-local data to track the currently mapped in ASID. The current approach of modifying both the CAM and page tables during a task switch simply won’t work (you’d end up needing core-specific L1 and L2 page tables, and either core-specific CAMs or adding an ASID-like thing to the CAM – so you might as well go the whole hog and use the native ASID support)

Jul 14, 2016 5:13pm Steffen Huber (91) 1953 posts	I have understood maybe 5% of the details you mentioned – will it help if I just shout “Go for it, Jeffrey!”?

Jul 14, 2016 5:23pm Rick Murray (539) 13840 posts	I can imagine Wimp_TransferBlock will be fun. I’ll echo what Steffen said – because getting the task handling to a state where it can contemplate using other cores would be quite a step forward. There are many hurdles to overcome with that, but each one passed is one impediment less…

Jul 14, 2016 6:58pm Jeffrey Lee (213) 6048 posts	I have understood maybe 5% of the details you mentioned – will it help if I just shout “Go for it, Jeffrey!”? Just to clarify, I’m not planning on implementing this in the near future (too many other bits still need to fall into place first). I was just out of useful things to do at lunch, and wanted to get these thoughts out of my head before I lost them ;-) It’s also good to think ahead like this just to make sure the prerequisite tasks don’t end up getting implemented in a way which make the final step harder/impossible. I can imagine Wimp_TransferBlock will be fun. The hardware is perfectly fine with Wimp_TransferBlock – multiple ASIDs can happily use the same pages as each other, and there’s nothing stopping a page from being part of a global mapping and a non-global mapping. The fun part is teaching RISC OS about multiply mapped pages. The easiest (and multi-core safe) solution would probably be to use RAM-RAM DMA transfers. That way the OS doesn’t need to do any multiple mapping, and the DMA manager will automatically pause & resume the transfer if a DA handler decides it wants to claim one of the pages for itself. If you’re running on a system where the DMA controller doesn’t support RAM-RAM transfers, you can still emulate it in software – you can make temporary mappings of the pages (bypassing the CAM’s bookkeeping), so that as far as the OS is concerned there’s no multiple mapping going on. And you can rely on the DMA manager to tell you if the transfer needs to be paused for any reason. (Disclaimer: Teaching DMAManager about RAM-RAM transfers is also another thing which hasn’t been done yet)

Jul 14, 2016 7:09pm rob andrews (112) 200 posts	When you set yourself a task it’s always a big one but I agree with your thinking about breaking it down into smaller tasks, making Risc OS use the on-chip support, so as Steffen said Go for It.

Jul 16, 2016 9:25am Sprow (202) 1158 posts	In terms of the realities of making it work, I think we’d need (at least) the following: […] A system to control the dynamic assignment of ASIDs Sounds interesting, or complicated, or both! How would it work on pre-ARMv6? Do you think there should be a software emulation (handing out ASIDs and so on), or would Iyonix and earlier just stick with the current scheme and return “SWI not known” or similar errors from any new APIs? Just to clarify, I’m not planning on implementing this in the near future (too many other bits still need to fall into place first). Like the clone-o-matic-5000 so two or three of you can work on it at once.

Jul 16, 2016 10:19am Jeffrey Lee (213) 6048 posts	How would it work on pre-ARMv6? Do you think there should be a software emulation (handing out ASIDs and so on), or would Iyonix and earlier just stick with the current scheme and return “SWI not known” or similar errors from any new APIs? I think it would be best to just stick with the current scheme. Pre-ARMv6 you will need to flush pages from the cache when you’re mapping them out, so the current lazy task swapping system is pretty optimal.

Jul 16, 2016 1:11pm Jon Abbott (1421) 2651 posts	One of the things I’ve been thinking about recently is how to improve task swapping in the OS (beyond the obvious fix of not flushing the caches on ARMv6+). I’ve been pondering this recently as well, looking through AMBControl to see how I could do the task swapping I’d require to code a full Hypervisor. Having reviewed the code, I came to the conclusion that the current method is far too slow for a Hypervisor, so I’d have to consider coding my own replace for it. The above-mentioned fix to avoid flushing the cache when mapping out pages (slightly more complicated than that makes it sound, since the CAM will need to be updated to keep track of the cacheability of unmapped pages so that if the page is later on re-used with different cacheability attributes we know to flush it from the cache first) Would flushing pages and reverting their cacheability when they’re released avoid the need to both track and flush the cache at allocation? How many areas in RISC OS allocate pages? Is it centralised? A system to control the dynamic assignment of ASIDs (“address space ID” numbers used by the hardware) to tasks. ASIDs are limited to 8 bits, so in order to avoid a hard limit of 256 tasks we’ll need a way for multiple tasks to share the same ASID. In reality I think this just requires us to flush the TLB whenever we switch between two tasks that share the same ASID I presume sharing ASID’s would only start occurring when all 256 have been exhausted? A possible solution would be for the task ID to contain both the ASID and additional bits to extend it which the task switcher makes use of when swapping. A resizable memory pool for storing the L1 and L2 page tables. For each task (or for each task which we want to be able to switch to quickly) we’ll need to have its L1 and L2 page tables ready to go; with a 512MB wimpslot that would equate to 2KB per task-specific L1 page table, plus however much L2 memory is required. Currently we only allocate L2 page table memory 4K at a time, but to avoid wasting memory we might want to start allocating at 1K granularity (corresponding to a single L1 page table entry, or enough for describing 1MB of RAM) Sounds sensible. When we switch tasks we don’t want to waste time updating the CAM to say that the pages have been mapped out, so we’ll need a way of flagging pages as being ASID-specific within the CAM (and of course we need to flag which ASID they’re associated with). Then whenever the OS does a physical-to-logical translation via the CAM it will be able to return the correct state for the page depending on whether that task is actually mapped in or not. Since I’m hoping to rewrite the existing AMBControl implementation so that it’s built ontop of the PMP system, it would make sense to have the ASID be an attribute of the PMP, rather than add an extra field to each CAM entry (although this would also require each task to have its own PMP, which may present problems of its own) If the CAM is “internal use only”, is it worth starting from scratch? If it’s currently page based, make it task/range based? Afraid my knowledge of it is zero. How will Application space allocated to Modules fit into all this? Will they get ASID’s and be treated like any other task?

Jul 16, 2016 3:29pm Jeffrey Lee (213) 6048 posts	Would flushing pages and reverting their cacheability when they’re released avoid the need to both track and flush the cache at allocation? Almost. The tricky bit is DA pre-grow handlers which request specific pages; if they request a cacheable page which is currently mapped out then there’d have to be some extra logic there to deal with that (if the new owner wants it non-cacheable). Flushing pages from the cache when releasing them would almost certainly have worse performance than tracking the cacheability correctly, so unless it made things significantly easier (I’m not sure if it will) it’s probably not worth considering it. How many areas in RISC OS allocate pages? Is it centralised? When I added the PMP support I did tidy up the code a bit so that the only place pages will be allocated from will be the free pool (previously OS_ChangeDynamicArea would pull pages directly from application space if the free pool was empty; now it shrinks application space into the free pool as an intermediate step). Claiming pages which are in use by someone else are the exception to this rule (they go straight from the old owner to the new owner), but the replacement page which the old owner is given will come from the free pool, so ultimately it’s still quite easy to find the places where page allocation is going on. So a quick search through the kernel for where the free pool is being accessed suggests that AMBControl (growing an AMB), OS_ChangeDynamicArea (growing a DA), OS_DynamicArea (adding pages to a PMP), and AllocateBackingLevel2 (L2PT allocation for new DAs) are all involved in allocating pages. If L2PT was turned into a sparse DA or a PMP, and AMBControl was changed to use PMPs, then we’d only have two places to worry about. (Disclaimer: There are also other places which allocate pages during kernel init, but they’re special since they generally have to operate before the free pool or CAM even exist) I presume sharing ASID’s would only start occurring when all 256 have been exhausted? Yes. Easily achieved by keeping a count of how many times each is being used. A possible solution would be for the task ID to contain both the ASID and additional bits to extend it which the task switcher makes use of when swapping. I’d expect the Wimp to continue to use AMB handles for task swapping, and internally AMBControl would deal with mapping AMBs/PMPs to ASIDs. If the CAM is “internal use only”, is it worth starting from scratch? If it’s currently page based, make it task/range based? Afraid my knowledge of it is zero. Yes, the CAM is currently page based. For each RAM page the OS manages, it stores its current logical address, flags, and PMP association (the table is indexed by physical page number, which can easily be calculated from the physical address). I think the PMP association information is good enough for all current and future needs, so the only motivation I can see for rewriting the CAM would be if we wanted to find a data structure that was more memory efficient than a linear array. But that would add a lot of complexity so it’s probably not worth it at this point in time (at 16 bytes per entry, 0.4% of the RAM in the system will go towards storing the CAM). Although I guess there’s also multiply-mapped memory to think about – rewriting the CAM to allow multiply-mapped memory to be represented properly would certainly be useful. How will Application space allocated to Modules fit into all this? Will they get ASID’s and be treated like any other task? Yes. For both regular tasks and module tasks the Wimp makes use of AMBControl for the memory management.

Jul 16, 2016 4:31pm Jon Abbott (1421) 2651 posts	Flushing pages from the cache when releasing them would almost certainly have worse performance than tracking the cacheability correctly, so unless it made things significantly easier (I’m not sure if it will) it’s probably not worth considering it. Isn’t the cache currently being flushed based on MVA? When flushing at allocation, won’t you need to track the CPU core and logical mapping after they’re released, or will you switch to flushing based on physical and do it across all cores? From memory ARMv6 supports the new ranged cache flush instructions, so I don’t think performance would be hit as bad as you might think. If RISC OS’s memory subsystem is going to change, threading and multicore should probably be factored into the changes. Otherwise it’s simply change for changes sake.

Jul 16, 2016 9:22pm Jeffrey Lee (213) 6048 posts	From memory ARMv6 supports the new ranged cache flush instructions, so I don’t think performance would be hit as bad as you might think. Unfortunately those are an optional, ARMv6-specific extension, which isn’t supported by ARMv7+. Also (except for the PL310) there aren’t any cache operations for flushing by physical address. However, after spending a long time staring at the documentation, I think you’re right that flushing (at least some of) the caches when freeing pages (or mapping out global pages) is a sensible way to go. First, here’s the details of all the different cache types we have to deal with: On ARMv7 and later, data and unified caches are PIPT Multiply mapped pages are fully supported – cache maintenance will affect all aliases of a given page, so it doesn’t matter what ASID or VA we use to map in a page when we want to perform maintenance The only times you need to perform cache maintenance are: If you need the data out of the cache (e.g. DMA, writing code) If you’re changing some of the memory attributes for that PA (cache policy, shareability) ARMv7 instruction caches are more annoying, falling into three categories: PIPT instruction caches Multiply mapped pages are fully supported – cache maintenance will affect all aliases of a given page, so it doesn’t matter what ASID or VA we use to map in a page when we want to perform maintenance The only times you need to perform cache maintenance are: If you need to see new instructions that have been written to memory If you’re changing some of the memory attributes for that PA (cache policy, shareability) IVIPT instruction caches Multiply mapped pages are partially supported – address-based maintenance ops aren’t guaranteed to affect all aliases of the page. So we’ll have to remember the logical address(es) the page was at (or do a full instruction cache invalidate) The only times you need to perform cache maintenance are: If you need to see new instructions that have been written to memory If you’re changing some of the memory attributes for that PA (cache policy, shareability) ASID (+VMID) tagged VIVT instruction caches Multiply mapped pages are partially supported, but only by virtue of the fact that instruction cache lines will never be dirty. We’ll need to remember both the ASID (if appropriate) and logical address the page was last at (or do a full instruction cache invalidate). However the virtual tagging, combined with the fact that the cache lines will never be dirty, means that we don’t actually care what physical page we use (this may or may not be useful). The only times you need to perform cache maintenance are: If you need to see new instructions that have been written to memory If you’re changing some of the memory attributes for that PA (cache policy, shareability) If (for a given ASID+VA) you’re changing the page at that location to a page which contains different instructions For ARMv6 it looks like all caches will be VIPT (or VIPT-compatible): Multiply mapped pages are partially supported. Address-based maintenance ops aren’t guaranteed to affect all aliases of the page (so we’ll need to remember the VA), and there are also page colouring restrictions (for any multiple mapping of a page, bits 13:12 of the virtual address must be the same across all mappings). The only times you need to perform cache maintenance are: (data/unified caches) If you need the data out of the cache (e.g. DMA, writing code) (instruction caches) If you need to see new instructions that have been written to memory If you’re changing some of the memory attributes for that PA (cache policy, shareability) For ARM11 (and potentially other ARMv6 CPUs) there’s also the nuisance that address-based cache maintenance operations don’t affect pages which are currently mapped as non-cacheable. So the only guaranteed way of making sure a page is out of the cache is to do a full cache clean/invalidate (But it’s rare that you’d actually need to guarantee that a page isn’t in the cache). Taken all together, this means the following: When doing an ASID-based task swap, there’s no need for any cache maintenance When we want to free a page, or map out a global page: If we have a VIVT instruction cache we’ll need to flush it If we have a VIPT/IVIPT cache, we can either flush it, or remember the VA + cacheability + shareability, to allow potential flushing in the future. For non-global pages the ASID isn’t required since the cache doesn’t store it. For PIPT we only need to remember the cacheability + shareability When a page needs to be mapped in to a different logical address to its previous location: (e.g. DA handler claiming it, or a PMP changing its memory mapping, or we need to share one ASID between two tasks) If we have a VIVT instruction cache, the current location of the page will need to be flushed from the instruction cache. If we have VIPT or IVIPT cache, the page will need to be flushed from the instruction and/or data cache if the new colour is going to be different from the current one If the location is to have different shareability/cacheability we’ll have to flush it (from both caches) If the page is a global page which is currently mapped out, obviously we only need to flush caches which we didn’t flush as part of the map-out process. If we need to flush a non-global page from the cache which is currently mapped out, the easiest approach might be to temporarily switch to that ASID, so that we can guarantee we get the ASID+VA match that’s required for VIVT instruction caches, or the VA match that’s required for VIPT/IVIPT. Considering that page reclaiming occurs with interrupts disabled, the fact that we’re temporarily swapping out the entirety of application space shouldn’t be an issue (nobody will see that it’s swapped out), and the fact that we don’t need to do any cache maintenance when switching ASID should mean the overheads are minimal. If we need to flush a global page from the cache which is currently mapped out: For PIPT caches we can map the page anywhere we want For VIPT/IVIPT we need to pick an address that will get a page colour match. This could either be the original VA (might be tricky if there’s something else there now) or another address entirely. VIVT pages should never reach this step since they should have already been flushed when the page was being mapped out The above could be simplified a bit if, when mapping out global pages, we flushed the page from any VIPT/IVIPT caches. That would allow us to avoid worrying about page colouring constraints (which is probably wise – for ARMv7+ the page colouring scheme isn’t documented), or having to remember the previous VA of mapped out pages. On ARMv6 this would equate to flushing the instruction + data cache, while on ARMv7 we’d only have to flush the instruction cache. (For anyone struggling with the above, I did find a few useful links while I was writing it up: ARM article on page colouring, Hints as to what IVIPT really means, and how the A8’s PIPT data cache actually isn’t, and Wikipedia’s explanation of the different cache types) If RISC OS’s memory subsystem is going to change, threading and multicore should probably be factored into the changes. Otherwise it’s simply change for changes sake. The new code won’t be multi-core/multi-thread ready (I’m not ready for that quite yet), but it will certainly be a lot more multi-core friendly than the way we’re currently doing things.

Jul 17, 2016 10:01am Jeffrey Lee (213) 6048 posts	I guess if we were to redesign the CAM, it might make sense to go with a structure that complements the capabilities of the page tables: Go with a 3-level tree structure. The first level describes memory in 1MB aligned 1MB chunks (perfect for section mappings). The second level in 64KB chunks (large pages), and the third in 4KB chunks (small pages). If a 1MB chunk is being used as a section mapping, the 2nd and 3rd levels of the tree aren’t required. Change the memory allocation strategy for the OS so that as dynamic areas grow the page mappings can be automatically changed from small pages, to large pages, to section mappings – i.e. for each 1MB logical window the OS will try to assign it its own 1MB chunk of physical RAM, and will only map pages such that the low 20 bits of the logical address match the low 20 bits of the physical address. This will allow the OS to make more efficient use of the TLB, and will result in more consistent page colouring across the memory map (at least until the OS starts having to share 1MB sections between multiple 1MB logical windows) Multiply mapped pages could be handled by giving each leaf node in the tree a pointer to a reference-counted list of logical address offsets. The use of offsets rather than absolute addresses will allow multiple nodes to share the same list, helping to minimise memory usage. However I’m not quite sure how PMPs would fit into the above. Maybe for simplicity PMPs would only support a fixed page size (selectable when creating the PMP), so that they can retain their ability to map their pages anywhere they want within their logical address space, and the OS can pre-allocate all the necessary memory (CAM, page tables) when the pages are added to the PMP. And instead of implementing AMB nodes as PMPs we’d have to implement them as hidden dynamic areas, so that they can use the same page allocation/coalescing logic as regular DAs.

Reply

To post replies, please first log in.

Forums → Code review →

Contemplating better task swapping

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options