Thoughts on GraphicsV memory management
Jeffrey Lee (213) 6048 posts |
Here are my current thoughts on how video/GraphicsV memory management should be extended to cope with all the cool stuff that’s been on my wishlist for the past few years. First, there are a few major assumptions:
With that in mind, along with several other complex scenarios (mixing multi-monitor setups with display rotation, scaling, pixel format conversion, cacheable screen memory, etc.), here are my thoughts on how things will need to be restructured in order to work sensibly: PMPs
Physical memory management
Logical memory management
Physical buffers: (note - alignment is for illustration purposes only) |....|....|....|....|....|....|....|....| (page boundary guide) +-----------+--+ | |==| LineLength = 3 pages | display 1 |==| | |==| +-----------+--+ +-+-----------++ |=| || LineLength = 3 pages |=| display 2 || |=| || +-+-----------++ +---+-----------+---+ |===| |===| LineLength = 4 pages |===| display 3 |===| |===| |===| +---+-----------+---+ Logical mapping: |....|....|XXXX|....|XXXX|....|....|....| (page boundary guide) +---------+-+-----------++----------+---+ | |=| || |===| LineLength = 8 pages | display |=| display 2 ||display 3 |===| | |=| || |===| +---------+-+-----------++----------+---+ The pages in the columns marked with XXXX will need special handling, for they are the locations where the physical framebuffers are overlapping. To cope with this it’s expected that a custom abort handler will be used to detect writes to the overlapping pages, so that any pixel data written to the out-of-bounds area can then be copied to the correct display (e.g. wait until VSync and then either have the CPU copy the data, or use memory-to-memory DMA). If the CPU is to copy the data then it will require the relevant pages from display 1 and 3 to be mapped in somewhere else (perhaps in a completely different DA, especially if PMPs take the ‘restricted’ approach to supporting doubly-mapped areas) Observant people may spot that the buffer setup for this example isn’t optimal, and we could get by with one XXXX column instead of two – relax, it’s just an example. The final code will hopefully be smart enough to organise things in an optimal manner.
What if a driver needs to access a logical mapping of its own memory?E.g. DisplayLink, or any of the ‘wrapper’ drivers. For this to work, there’ll have to be an API (most likely a SWI?) to allow a driver to request a logical mapping of a given rectangle of a framebuffer. The result of this SWI will be a list of rectangles – e.g. in the three-monitor setup above, if a mapping of all of display 1 was requested, there’d be one rectangle for the lefthand portion (which would be part of main screen memory), and one rectangle for the righthand portion (which might be off in some other DA). There’ll also have to be a “release mapping” call, which will allow any temporary mapping which was created to be released (although, the kernel will probably just take a lazy approach and leave the pages mapped in just in case they’re needed again later) What about the HAL video API?Most of these changes won’t reach the HAL video API; the required interactions between the driver and the OS are going to become too complex. So it’s probably best to consider the HAL video API to be deprecated. What about ADFFS/Aemulor?When it comes to screen memory, they’re mostly interested in BPP conversion, so can follow the same method as a BPP conversion wrapper driver (e.g. give the OS different page lists depending on whether the current pixel format is one the hardware supports or not). But they also want to emulate the Arc-era memory map, so will want to control the logical address of where the screen memory is mapped. To cope with this it’s probably best to just have a GraphicsV call to allow drivers to specify the base address of the DA that the kernel is about to create. What about the screen dynamic area?When you think about it, there are only really two choices – keep it (and make sure it works sensibly), or get rid of it. But I’m not sure which would be best.
That’s all I can think of for now (or at least, all I can afford to write tonight). If anyone has any comments – good or bad – feel free to share them before I start implementing all of this! (which won’t be right away due to other tasks continually popping up, but it is the next big thing I want to do) [edit – For some reason textile has decided it wants to use a different font for HAL, API, and ADFFS within those h2. sections. Just ignore its silly formatting.] |
Jon Abbott (1421) 2651 posts |
From an ADFFS perspective, it doesn’t matter where DA2 is or who owns it – so long as RISCOS is using it for it’s VDU output, it’s at a legal address that’s always available regardless of appspace mappings and its separate to the GPU memory. When a MODE is entered from an app that’s being Hypervised, ADFFS will map the RO3.1 double map (eg 1F88000 / 2000000) for DA2 for compatibility. When the MODE changes, ADFFS will unmap the RO3.1 double map and only remap it if the MODE is legacy and the task is one being Hypervised. As you note, if DA2 were to go, ADFFS would have to intercept the relevant SWI’s and emulate the DA. If it’s kept, under ADFFS it’s always associated with the primary driver as RISCOS neatly falls back to DA2 if there’s no alternative GraphicsV driver. ADFFS achieves this by Hypervising GraphicsV and passing calls to either RISCOS or the GPU GraphicsV driver as appropriate, and in some cases doubles them up so both are aware of the call. Switching DA2 to the active (GPU) driver would break ADFFS, as it’s relying on DA2 and the GPU being separate entities. If however ADFFS was to become an active GraphicsV driver and not Hypervise GraphicsV, it would carry on working. Task switching would cause big complications here though, as ADFFS would have to register/unregister with GraphicsV so VDU output goes to the correct driver as tasks are switched. As it’s currently Hypervising GraphicsV, this is easily achieved by checking the DomainID and passing the call on if the task isn’t one it’s Hypervising. |
Jon Abbott (1421) 2651 posts |
I should probably add that if DA2 is to change, don’t let ADFFS hold it up and I certainly don’t expect RISCOS to include any botches to keep ADFFS working, I’ll recode to fit around RISCOS; we’d just need to coordinate the changes. Short of a few tweaks to the blitter, general bug fixing and ARMv7 support, ADFFS is almost complete in terms of RO5 providing a single IOC / IOMD VMM. It may need additional SWI’s either Hypervised or Paravitualized to get the odd RO2/3 game working, but as a whole it’s near complete.
I’d forget double-mapping as a separate entity, just provide a means to map the same physical address space multiple times and leave the driver to manage it. If a driver then needs a double map for hardware scrolling for example, it could just claim double the screen memory it requires and map the same physical space twice. This shifts the management of them out of the OS and gives total flexibility on how they’re mapped. The only thing RISCOS might need to track is the fact there’s multiple MVA’s that might need cleaning in the event of a L2 cache flush – although from your notes and what I discovered yesterday, provided the multiple maps are all in-sync flag wise, this may not be necessary on current hardware.
Agree with all your points here, the only thing I’d add is that we need an easier means to get logical <> physical address translation. Having to remap the physical address to get the logical as is currently the case isn’t a long term solution.
This will heavily rely on the hardware supporting borders to blank out the padding, and RISCOS / GraphicsV drivers / software to account for the padding. I’m certainly using it in ADFFS and I believe the Iyonix uses something similar where screen widths aren’t divisible 32 pixels, so agree it’s the correct way to go. Do the iMX6 / Titanium / Pandaboard etc all support this though? It probably needs technical input from Chris Evans, Andrew Rawnsley and a few others to confirm.
I’ve been looking at this since we last mentioned it as a means to only blit the display when it’s actually updated. As pony as the RO4 implementation was (moving the screen DA to a different Domain and trapping Domain access violations) it does seem like the only viable option and not really a massive overhead, as once you know the screen has been written to, you alter the TLB for the whole DA until the next VSync.
Almost certainly needs an API, so the pages are only raising Aborts if a driver specifically requires it. The client should register it’s interest in being notified about writes to it’s logical space and leave the API to handle the detection of writes and TLB changes required. Two options should be available here:
Where it’s known that a write is about to take place that’s DMA based, the DMA initiator should notify the API of the rectangle it’s about to modify and leave the API to work out the page list that’s subsequently sent to a delta change based driver. This puts a slight overhead on the DMA initiator and may cover a few pages that haven’t changed but is possible a good compromise for speed. What about ADFFS/Aemulor? To cope with multi-tasking / Wimp based tasks, the legacy DA2 needs to map in/out as the task switcher switches apps being covered by ADFFS/Aemulor. Something similar needs to happen to provide 26bit Module support without restricting the whole machine to a 32mb appspace limit, but on a per-Module basis. As things stand currently, only the DA2 side can be implemented (via MODE based service calls or Paravirtualizing Wimp_Poll / Wimp_PollIdle. I have some ideas on how to coerce RISCOS into working around the 32mb appspace limit, but ideally some fundamental changes to the way taskswitcher tracks per-app / per-module memory maps would be the preferred route. If a 26bit app is running and it’s switched in, the legacy DA2 map can be switched in as part of the appspace TLB changes in one hit and unmapped when it switches out. Likewise if a 26bit Module is entered, the relevant pages below the 32mb limit are mapped in/out by taskswitcher. My current idea here, which I’ve partially implemented, is that ADFFS creates a stub Module in the actual RMA which then maps in the relevant pages below 32mb temporarily whilst the Module is doing it’s thing. The stub Module acts as an entry/exit Hypervisor that initiates the required memory map changes. I guess what I’m getting at here is that Aemulor/ADFFS could handle 26bit address space as things stand without OS support, it all depends on how far we want to push legacy support in the OS. I personally don’t think the OS should be concerned in this regard, as the requirement is specific to ADFFS / Aemulor but if we’re considering changes to GraphicsV to handle DA2 then its worth at least considering. In the longer term, extending taskswitcher to allow VMM’s to operate efficiently could be a way around the multi-core / pre-emptive multitasking issue, allowing RISCOS to remain single tasking and leave the VMM to deal with the multi-core / pre-emptive implementation by running multiple copies of sandboxed RISCOS’ within a Type 2 Hypervisor. Way beyond the topic of this discussion though and already raised elsewhere as a separate thread.
Making DA2 the active screen DA does make sense, with multi-head displays sharing the DA2 memory map as you’ve described with padding either side for alignment etc. Ironically, in the past week I’ve been considering implementing a near identical implementation for ADFFS so it can handle screen geometry changes between successive frames. This would be implemented as a triple head display in the GPU buffer with only one being visible at any one time, but in theory its a triple-head display and you could have all three shown at once. Changing DA2 to only be the active display in a multi-head setup however could add major complications, with DA2 switching continually depending on which GraphicsV driver is being entered at the time. Huge potential for TLB overheads here. Having GraphicsV drivers sharing one DA2 in an interleaved fashion is probably the better route. In this scenario, ADFFS would simply touch a rectangle within DA2 when it blits. There is however the complication of dual/triple buffering, which we need to cover for gaming. ADFFS is currently dual buffering both the GPU and legacy DA2 and I’m testing triple buffering on the GPU for the next release. In a multi-head display setup, you could implement it as it currently is, with buffers one after the other. eg Head 1 buffer 0:Head 2 buffer 0:Head 1 buffer 1:Head 2 buffer 2. This would however break hardware scrolling, alternatively implementing as: Head 1 buffer 0:Head 1 buffer 1:Head 2 buffer 0:Head 2 buffer 1 would get around this, but possibly add complications elsewhere as there’s the potential for one head to switch from single to dual/triple buffering and cause the logical address of Head 2/3 to change, so where would need to be a means to notify drivers of a logical address change to their view of DA2. |
Jeffrey Lee (213) 6048 posts |
Supporting arbitrary multiple mapping of pages is harder than supporting double-mapping. And as you’ve discovered, bad things can happen if multiple mappings of pages aren’t handled correctly – which is why I’d want the kernel to be fully aware of any multiple mappings, instead of leaving it to the software which created the mapping to deal with all of the headaches. E.g. if we’re using aborts to track changes to pages, and that page is multiply-mapped, we’d typically want all mappings to be subject to abort trapping – something the kernel could easily manage but external software might not be aware of. The main data structure which the kernel uses to keep track of memory is the CAM. It’s a simple table which is indexed by physical page number (which means it’s limited to coping with RAM pages), and for each entry it stores the current logical address of the page and the page flags. if you look deep enough into the kernel you’ll see the CAM being referred to as the “soft CAM”, which makes me think it was originally a direct softcopy of the MEMC-format page tables (and as I’m sure you’re aware, MEMC required things to be specified in terms of physical → logical mappings, rather than the logical → physical page tables which are used now). Supporting doubly-mapped areas are “easy” because if the kernel can work out which DA a doubly-mapped page belongs to, it can easily work out the offset of the second mapping from the primary mapping (since it’s equal to the current size of the DA). But historically the CAM hasn’t stored the DA association, so that’s where things like OS_SetMemMapEntries get unstuck if you try modifying a doubly-mapped page. With the PMP changes, the CAM has been extended to allow it to track which PMP a RAM page belongs to. This was required to allow reclaiming of PMP pages to work correctly, but the code could also easily be extended to allow the DA association of regular pages to be stored (although that would increase the risk of breaking legacy software which likes to be able to remap pages at will). Other solutions to allow doubly-mapped areas to be tracked properly would be to (if the doubly-mapped area is a PMP) use the PMP association to work out the second mapping, and to add a check to OS_SetMemMapEntries (and any other relevant APIs) to make sure that non-PMP doubly-mapped areas aren’t interacted with in dangerous ways (e.g. OS_SetMemMapEntries does now contain a check to make sure that pages belonging to PMPs aren’t interacted with in dangerous ways – if the OS loses track of the PMP association of pages then bad things are likely to happen). Safely supporting arbitrary multiple mapping of pages would require the kernel to keep a list of all the logical addresses the page is currently mapped to. It’s not an impossible task, but it would add a fair bit of extra complexity, so it’s something I’d like to avoid (s.ChangeDyn is already seven and a half thousand lines of near-incomprehensible code!). So I’d prefer to avoid supporting arbitrary multiple mappings. But if we find out that that is the only sensible way to deal with certain situations then maybe I won’t have a choice! Logical memory management Yes. OS_Memory 0 is the standard way of doing logical to physical translation, but it currently has the limitation that it only works with 4K page sizes (so is useless for memory mapped by OS_Memory 13). OS_Memory 0 will definitely need expanding once PMPs can map IO memory, probably using the same scheme as ROL, PhysicalPageNumber = (PhysicalAddress>>12) + (1<<30). So it’s possible I’ll extend it to support non-4K page sizes at the same time. Physical to logical translation is a trickier matter, since the only generic way the OS would be able to do it would be by scanning the page tables, which would be a bit slow (OS_Memory 0 can do physical to logical translation, but only for RAM pages, where it can easily work out the physical (i.e. RAM) page number and then have a quick peek at the CAM). Things will get a bit complicated if the combined LineLength isn’t a multiple of 4K – this is where the end-of-line gap mentioned at the start of this post will come in handy. This is fine for all TI chips (Titanium, Pandaboard, etc.), the Pi and Iyonix. iMX6 I’m not 100% sure about, but I’d be surprised if it didn’t support it The only machine I know of which will definitely doesn’t support gaps between the rows is IOMD, so it may be that some features are either a lot slower (more software emulation) or simply aren’t supported at all (e.g. anything requiring abort trapping may be tricky for chips which use the base-updated abort model or suffer from the abort restart bug). But on the other hand, we don’t have any USB drivers for iOMD, or any drivers for the video podules, so you’re highly unlikely to be in a situation where multiple spanned displays are needed.
One of the things on my todo list is to work out how to make application memory more flexible, essentially granting PMP-level control of memory mapping to applications (so they can map and unmap pages anywhere they want within the entire 512MB application space window). This would allow you to create your own (fake) dynamic areas within application space (I think you’re doing this already?), but have the kernel manage mapping them in and out on task switches. However this won’t solve the problem where you’re wanting to (a) doubly-map memory, and (b) doubly-map memory which doesn’t even belong to you. Another option (perhaps in addition to the above) would be to implement the “area bound to client application” DA flag. Then you could legitimately create a doubly-mapped area within application space (by manually specifying the DA base), which would only leave the problem of wanting to doubly-map memory which doesn’t belong to you. Or (as I’ve mentioned many times before) you could give up on the idea of running old games directly and go down the full system emulation route like ArcEm ;-)
Yeah, hardware scrolling is a bit of a bitch. There are some situations it would work (e.g. vertical scrolling would be fine if the displays are arranged horizontally, and all the same height), but for most multi-monitor situations it probably wouldn’t be possible. But, you’re unlikely to be running a game (especially an old one which wants to use hardware scrolling) that’s spanned across multiple monitors. When you start the game the OS would switch into single-monitor mode, allowing the DA to use a standard flat memory mapping, which would then allow for the same level of hardware scrolling that’s available now. |
Rick Murray (539) 13850 posts |
Just a small question: would this proposal permit two entirely different things on each screen? |
Jeffrey Lee (213) 6048 posts |
Yes. |
Jon Abbott (1421) 2651 posts |
We could probably do with some input from Adrian around how Aemulor provides legacy DA2 support. If there’s no legal way to provide the double (it’s actually tertiary) map of DA2, then lets not worry about it. I can still directly modify L2PT on entry/exit as is currently happening, and if that eventually turns out to not work, I’ll figure out a workaround.
Where’s the challenge in that! ADFFS is a stepping stone to a full type 1/2 Hypervisor, one of the key things with a Hypervisor is the code runs natively on the CPU and only falls back to Paravirtualization where absolutely necessary (eg CPU privilege level). The only bits that require emulation (or more correctly virtualizing) should be the IO and in our case, some CPU behaviour to match 26bit CPU’s. You just want me to speed up ArcEm … admit it ;-)
Yes, what’s being proposed here is to extend GraphicsV and relevant areas of RISCOS to support multiple graphics cards and multiple monitors. We’re also trying to figure out how to provide legacy MODE support for 1/2/4/8 bpp MODE’s on GPU’s that only support 24/32bit bpp, in a manageable legal way that allows ADFFS/Aemulor to work and eventually allow RISCOS to provide bpp upscaling natively. |
Jeffrey Lee (213) 6048 posts |
One of the things that’s been on the todo list for a while is implementing support for ROL’s OS_ScreenMode reason codes 7-10. On the surface they look trivial, but OS_ScreenMode 7 has always confused me a bit, preventing me from proceeding. Reading through the memory management docs today, I think it’s finally clicked into place:
A lot of the properties of ROL’s approach sound good for us (giving drivers more control over memory management, simplifying screen scrolling support). But there are some bits that feel a bit nasty (lack of strong guarantees that you’ll be able to get all the screen banks you want/need), and I suspect removing support for scrolling tall framebuffers will cause problems with games. To start with, I think we can implement OS_ScreenMode 7-10 without any significant changes to our memory management:
This should allow software to use multiple screen banks without having to worry about resizing DA 2 manually, as per RISC OS Select (and it should also help them avoid the pitfall of resizing DA 2 on systems that don’t use it). Once the basics are implemented we can then have another look at memory management overall and see if we can come up with a design which gives drivers more control over things without breaking any important use-cases. Perhaps we should take a leaf from the Pi’s book, and have separate “virtual” and “physical” framebuffer sizes? So software can explicitly request a framebuffer which is larger than the screen, and then scroll the display freely (ish) within that area, potentially with VIDC-style wrap around (if the hardware supports it). Single-monitor setups/modes could use this approach for handling multiple screen banks, while more complex scenarios (multi-monitor modes, hardware display rotation, etc.) can use the more restrictive ROL-style approach where each screen bank is a separate block of memory. |
Jeffrey Lee (213) 6048 posts |
Some updates:
This has been dealt with via GraphicsV 19 and the pre-existing ExtraBytes control list item (although BCMVideo is the only driver that implements GV 19 at the moment).
The thought I’ve had today, is that for each (doubly-mapped) PMP the kernel could maintain a list which specifies the address offset(s) of the additional mapping(s). The list would store the details for ranges of addresses/pages, rather than storing the data on a per-page basis – avoiding excess memory overheads while keeping lookups reasonably fast. New OS_DynamicArea reason codes would be used to create/destroy/modify the multiply-mapped areas within the DA/PMP. |