Fast RAM to screen copy
Alan Buckley (167) 232 posts |
Having used SDL (Simple Direct Media Layer) to port games etc to RISC OS I was wondering if it would make sense to add a couple of SWIs to RISC OS to allow for fast copying of an area of RAM to the screen. A lot of the games using SDL seem to use a buffer in RAM that they prepare all or parts of the screen. I believe this was done as reading from the memory in a graphics card was so much slower than writing to it. The area of RAM that is copied is created in the correct format for the screen so it is usually the case of a transfer with no processing of the bytes. The advantages of having the SWIs would be that it would be possible to create machine specific versions of the module which could then use the tricks appropriate for that machine type. As I see it there would be two calls, one to find out information about the screen so the format of the memory it needs can be determined and a second to do the actual screen copies. Obviously it would be better if we could have a full graphics acceleration API, but that seems far to complicated for the resources available. I believe there have been other modules that do this for some machines, but the ones I found and looked at always had some kind of restrictive copyright and machine dependencies. What I propose here would be provided to all (preferably with the OS), and have a generic (not so fast) version that could be used with all machines until a better version was written for a particular machine. Ideally the correct version would be built as part of the ROM build. Any comments? |
Jeffrey Lee (213) 6048 posts |
Sounds like a good idea to me.
It would also make sense if there was a way for the module to indicate whether direct screen access is more preferable than using an offscreen buffer. E.g. for RISC OS 4/6 where screen memory is cacheable. I’m tempted to suggest that we use an API similar to DirectX/OpenGL where the graphics library does all the memory management for you and you just use lock/release calls to get pointers to buffers. Or to design it so that it can also be used for fast OS sprite plotting. But I think we should deliberately keep it simple otherwise we’ll never find the time to finish designing or implementing it! |
Rik Griffin (98) 264 posts |
I wrote something along these lines for Popcorn (a game engine). I think the only “special case” for copying a frame buffer to screen memory currently is the Iyonix which can use DMA … ? All other machines have the screen memory in main RAM (I think – does this apply to BBs etc too?) Popcorn uses Thomas Milius’ IntelDMA module to manage DMA transfers and falls back to simple memcpy() if that’s not available. As an aside it also has a memcpy() replacement that uses the application accelerator, which again is Iyonix only. |
Jeffrey Lee (213) 6048 posts |
The BB screen memory is in main RAM, but RISC OS 5 doesn’t mark screen memory as cacheable, so read/write performance is still less than ideal. So until we get cacheable screen memory working with RISC OS 5, it would be nice to be able to use DMA to copy the data around. Updating DMAManager to support RAM-to-RAM DMA transfers has been on the todo list for quite a while, so it would be good if we could get that out of the way. Hmm, the more I think about this, the more I see hidden complexities. E.g. you’d want the code to be in DMAManager so it can reuse the code which does the logical → physical address conversion, and so it can make sure the transfer doesn’t go wrong if the OS decides to remap some of the pages. But you’d also want the video drivers to be aware of the transfer, so that they can use their own DMA where sensible, and so that you can guarantee that the transfer has completed before any accelerated GraphicsV render ops are performed. And if the transfers are going to be left to run in the background then we’d ideally want a way of making sure the transfer is completed before the CPU is allowed to access screen memory.
The only downside with IntelDMA is that nobody’s worked out how to stop the IOP DMA from conflicting with the audio DMA and causing sound distortion/stuttering. Luckily I don’t think the application accelerator suffers from the same problem. |
Jeff Doggett (257) 234 posts |
My Doom port uses memcpy() to copy the internal Doom buffer to the display RAM. This works fine on the Iyonix, but not AFAIK on the BB. I’ve got some experimental code that writes via a sprite but the FPS suffers badly. As already stated above the API for any new stuff would require a method of Jeff Edit: It also occurs to me that it could speed up the OS_SpriteOp plotting calls. Possibly. |
Adrian Lees (168) 23 posts |
Wouldn’t it make more sense just to use the standard OS_SpriteOp SWIs and then, perhaps, have the OS_SpriteOp implementation invoke a HAL/GraphicsV routine for blitting, before falling back to its current ‘memcpy’ routine? Then all applications benefit from the speed increase, transparently and without code modification. I certainly don’t think application code should be directly invoking a blitting routine that knows nothing about the graphics clip rect, dual monitor setups, portrait/landscape mode etc. Geminus already does this, though being written before the days of ROOL, it was necessary to do a lot more work since it intercepts at the SWI/vector level and must do all the clipping, transformations etc too. It admittedly does not have any optimisations specific to the BB yet. |
Jeffrey Lee (213) 6048 posts |
That would make sense, yes. Although judging by Jeff Doggett’s comments, there might already be something stopping OS_SpriteOp from using a basic memcpy() style routine. |
Alan Buckley (167) 232 posts |
This I guess would be the ideal solution as everything could use it as you say.
The idea was for graphics intensive programs like games which would be working directly with the screen memory and usually taking full control of either all the screen or the area of an uncovered window. The blitting call would take care of things like dual monitor setups. Graphics clip would have to be taken into account by the application making the call. The reasons I thought a new routine may be appropriate were: 1. Having the routines outside of OS_SpriteOp would give some speed increase because the memory copied is already in the correct format for the screen and we could limit the complexity of the processing. 2. If we had a standard OS call then regardless of graphics acceleration/DMA transfers the call could use the best memcpy implementation for the hardware it runs upon. I seem to remember that rewriting the memcpy used in SDL (with Adrian’s help) to take into account new instructions on the Iyonix which improved it from the earlier more generic version. 3. A version of the module could be shipped for RiscPCs and A9s etc that wouldn’t rely on RISC OS 5. Maybe even ROL could do a version for RISC OS 6. A nice API like Direct X or OpenGL like Jeffrey mentioned would also be great – but the more complex this becomes the less likely it is to be implemented. Maybe, as Adrian said, it would be better to just add acceleration to OS_SpriteOp. I am assuming there isn’t too much overhead to get to a routine that does a straight memcpy (or equivalent) if the the screen and sprite memory match. The one thing that would help libraries like SDL is if the sprite header could be anywhere in memory and not directly before the pixel data (or is this now possible). So has my original idea now been shot down? If it has, should it be replaced by HAL/vector calls hidden inside SpriteOp? |
Jeff Doggett (257) 234 posts |
Not shot down, but perhaps smoking slightly. Putting the mods in the OS_SpriteOp call will produce an immediate gain across the board and will be easy to test. There must be plenty of simple programs that do full screen sprite plotting….. Jeff |
Jeffrey Lee (213) 6048 posts |
Good point – that’s one thing a sprite-based interface would be lacking. I’m not sure whether the best thing to do would be to add a new OS_SpriteOp flag to indicate that you’re providing data/mask/palette pointers via registers, or to modify the sprite header format to allow pointers to be used. Or even to ditch the sprite header entirely, since it’s a bit of a crusty old format that’s difficult to add new features to, and sometimes references things which aren’t always easy to decode (e.g. using mode numbers instead of the OS 3.5 style sprite mode word). If each graphics driver has to deal with decoding sprite headers then that could introduce a lot of opportunities for bugs to creep in. So I’m thinking it would be best if the lower-level API was given the data in a more friendly format, and the OS_SpriteOp calls (which still take standard sprites as input) just act as wrapperd around it. Software which doesn’t/can’t render via a sprite can bypass the sprite interface entirely and use the lower-level interface directly (but the lower-level interface would still be smart enough to deal with clipping rectangles and dual monitors and things). |
Sprow (202) 1158 posts |
If it’s already in screen format then an OS_SpriteOp is probably the best way to go, and for OS_SpriteOp to be accelerated (where possible) using target specific hardware. Similarly, memcpy() could be tailored on a target by target basis to decide if the region is big enough to merit hardware acceleration. Both these seem preferable to introducing a new module since they benefit many applications without needing any external changes. |
Adrian Lees (168) 23 posts |
Does not SDL provide an “allocate image buffer/off-screen surface/similar” API call which /could/ be implemented such that it creates a sprite header prior to the image data itself? If not, then I guess I’d rather see the OS_SpriteOp extended cleanly such that the header and the image data can be separated. The sprite header provides a means of quickly supplying a number of the parameters (width, height, bits/pixel, component order via ‘sprite type’ etc). IMHO the low-level acceleration code/driver should be unaware of higher-level concepts like clip rects, palettes, component order etc. and should be concerned only with transferring bytes. Only if an application really can’t do what it wants via OS_SpriteOp should it incur all the complexity of doing that work itself. Multi-monitor support probably belongs in the GraphicsV/HAL layer, certainly not in the OS_SpriteOp code, but proper multi-monitor support is a much broader question. As an aside Geminus implements its cacheing of sprites in off-screen video memory by intercepting OS_SpriteOp calls..I’d like to see that implemented via a proper API extension, however. Applications that do not (frequently) manipulate their sprites directly and use only OS_SpriteOp calls could benefit from signalling this fact. |
Alan Buckley (167) 232 posts |
SDL does have these calls and I’m pretty sure I’ve even put some code in to use sprites always (I think this may have been as a test to see if Geminus could speed up SDL). So it may have been a bad example. I think the Chox11 library could have been simplified/speeded up if the header and data could have been seperated. But I guess if I don’t have a concrete example it shows that this requirement may not be as important as I thought.
I think this is what my original idea was about. The difference is that this call would be available to the OS but not the end user who should just use OS_SpriteOp. So in OS_SpriteOp if it finds the sprite memory matches the screen memory then it would call something that took the memory pointer, width, stride, height of the memory and x and y screen coordinates (in pixels).
I also have wondered about extending OS_SpriteOp to help with graphics acceleration. Things like marking a sprite read-only (or cacheable) so it can be put on the graphics card once and plotted from there. Of course it’s not that easy as there then needs to be a whole lot of cache management. |
Sprow (202) 1158 posts |
I recently changed the vetting of sprite area pointers (in SpriteExtend) to distinguish which ops only ever do reads and hence could be candidates for cacheing. What’s missing is some way for an app to hint that the area contents haven’t changed between calls. Anything like a CRC of the data is probably going to take as long as just plotting it longhand, and anything weak like a 4 point check would probably miss that 1 pixel that got toggled. You want something like a master sequence number in the sprite header, all spriteops increment it by one on changing, so if the sequence number is the same it’s safe to plot from cache. |
Jeffrey Lee (213) 6048 posts |
Unfortunately that isn’t something that will be easy/possible to get working with existing sprites/programs. I suspect a lot of code loads sprite files by simply querying the file size, allocating an appropriately-sized buffer, and loading it into memory. Even if it uses OS_SpriteOp to load the file, the buffer size won’t be big enough to allow sequence numbers to be inserted for each sprite. Then you’ve got to contend with all the programs which modify sprites directly, and deal with situations where sprite files are moved in memory (or tasks are paged in/out) causing the system to get confused, etc. If we’re going to implement sprite caching then I think the only sensible options we have available are the following:
I’m not going to be a fan of creating a system that’s heavily reliant upon having a list of whitelisted/blacklisted apps which do/don’t work. |
Alan Buckley (167) 232 posts |
A couple of more suggestions;
|
Sprow (202) 1158 posts |
I think I was trying to come up with a scheme that meant the vast majority of existing apps didn’t need changing and got the cacheing ‘for free’. I suppose the simpler way of doing it would be to cache nothing by default and just add a few new spriteops to add to cache/paint/remove from cache, then update the apps that care about it to use the new scheme, but then you end up having to have fallbacks in the cases where cacheing isn’t available. By signalling the cacheing via the sprite header, such as my master sequence number scheme, or some other flag, you can then keep all the original spriteop reason codes. |