Proposed GraphicsV enhancements

307 posts, 36 voices

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13

Dec 20, 2013 4:09pm Jeffrey Lee (213) 6048 posts	Roger that! I’ll have a go at putting something together

Dec 20, 2013 10:30pm Steve Revill (20) 1376 posts	Cheers. If things aren’t at the right point, feel free to defer my request and keep focus on the important stuff; just be aware that I’ll be asking you for this at the point where we start gearing-up to 5.22.

Jan 1, 2014 12:23pm Chris Hall (132) 3588 posts	If you want to try out some stuff on a user, feel free to do so. I am good at commenting on documentation…

Jan 2, 2014 5:36pm Doug Webb (190) 1184 posts	Re: Jeffrey’s submission 15th Dec. Iyonix: FX-series cards will now advertise 32K colour modes as being red/blue swapped. Also added support for 64K colour modes. I’ve only tested this with an FX 5500 card, so if people could test 32K colour modes and 64K colour modes on other card types then that would be appreciated, as there are a few bits which need to be done differently depending on the card architecture. I have tried it on my Iyonix with a Gainward Pro660 Fx5200 128MB card fitted and with Geminus installed and I get a rather funky and garbled screen when trying 32K and 64K colour screens with the 2nd Jan 14 ROM and HardDisc image installed. Removing Geminus and rebooting allows the 32K and 64K screens to work correctly. I’m happy to continue without Geminus, unless someone knows how to disable the built in RGB swapping as I believe I also lose some screen acceration options without it as well. Apart from that another excellent addition to RISC OS 5 and more fantastic work from Jeffrey.

Jan 3, 2014 10:04am Jeffrey Lee (213) 6048 posts	Yes, it looks like the red/blue swapped modes are causing problems with Geminus. At the moment I’m thinking the best way of fixing it would be to add a system variable or command which can be used to control red/blue swapping of modes on the Iyonix. That way people will be able to disable red/blue swapping of 32K modes if needed, and it could also allow control over red/blue swapping of 8bpp and 16M colour modes (to allow unmodified cards to be used – similar to the PC RGB feature in Geminus). Eventually I’m planning on updating the screen setup plugin in Configure to allow for driver-specific features to be configured (red/blue swapping on Iyonix, overscan settings on the Pi, TV offset on anything with TV-out, etc.)

Jan 23, 2014 1:50pm Jeffrey Lee (213) 6048 posts	When I next have a chance (maybe this weekend?) I think I’ll have a go at rewriting the proposal page so that it correctly reflects the current state of things and what’s left on the todo list. There’s been a fair bit of interest recently in a few different areas of the work so it would be good to update the page to either explain how I think certain things should be implemented or to give a basic roadmap on the implementation order that makes the most sense to me. And with any luck it will reveal some relatively self-contained tasks to allow more people to get involved if they wish.

Jan 23, 2014 4:35pm William Harden (2174) 244 posts	ScrnSetup needs a lot of work for different reasons (EDID, multi-monitor, plus your suggestions above). Writing down what you need from above (likely UI changes and outputs) would be useful. I personally think Screen saver may have to be a self-contained plugin as there are a lot of things that now need to come into ScrnSetup and ScreenSaver would function well independently of ScrnSetup. For my bits: ScrnSetup needs a grouped radio ‘Use monitor mode information (EDID)’ versus ‘Use Monitor Definition file’. We then move the MDF selector up with that into a new group. If EDID is selected, the MDF selector is greyed. At plugin start, we try to read EDID and if the registers are unchanged on exit we grey the EDID selection because we cannot support it. In the group below we have mode info. For outputs, if EDID is selected, the command is X ReadEDID. For MDFs we function as previously. I have not yet got a SWI allocation for preferred mode. So for now EDID should just offer its selections exactly as MDFs do. If we have a SWI ScreenModes_GetPreferredMode we read a VIDC3 data block of mode data, and present our preferred option when the radio is changed. A neat option would be to alter WimpMode to have a WimpMode Auto option which then calls ScreenModes_GetPreferredMode and sets the mode. This would allow an option button in modes of ‘Auto’ which if selected changes the screen setup output to *WimpMode Auto. The resulting configuration says on startup ‘read the EDID and set the mode to whatever the monitor wants’.

Jan 23, 2014 5:24pm William Harden (2174) 244 posts	Jeffrey – also meant to say: I have sorted out a Dropbox for ROOL for EDID stuff (whilst it is pending review and submission). ROOL folk can invite to it, so if you want/need access then let me know or them (I don’t have your email address). I will put the current revision in later tonight.

Jan 26, 2014 6:52pm Jeffrey Lee (213) 6048 posts	The rewritten proposal page is now up, although I think it might be a bit more scatterbrained than the previous version. However it should fully document everything that I have planned so far, whether it’s effectively a full specification (like the pointer changes) or just a quick “this needs fixing but I haven’t decided how yet”. You’ll also spot that there isn’t an area dedicated to the display manager or the screen setup plugin – because I haven’t really given those much thought. After all, the document is primarily about what needs changing in GraphicsV, not what needs changing in other parts of the system in order to properly take advantage of those GraphicsV changes. But there are a few notes dotted around in each individual section wherever I think display manager or screen setup changes are required.

Jan 27, 2014 3:29am Chris Hall (132) 3588 posts	Wow! I’m impressed. Coding and documentation…

Feb 25, 2014 2:06pm Jeffrey Lee (213) 6048 posts	There’s an interesting thing I’ve spotted recently – now that the display manager asks for a full 256 colour palette when you select a 256 colour mode, redrawing the desktop (particularly filer windows) is several times slower than it was before (on an Iyonix, at least). After doing a bit of profiling last night it looks like most of the time is being spent in ColourTrans_SelectTable, where the Wimp is building translation tables to map from the filer sprites to the desktop palette. If the default 256 colour palette is used then ColourTrans will use a specially optimised routine for generating the translation tables, but in full 256 colour modes the check for that optimisation is skipped (even though we’re still using the default palette) – so it falls back to a much slower brute-force routine for generating the translation tables. So the easy fix for this is to allow ColourTrans to detect when the default 256 colour palette is being used in a full 256 colour mode, so that it can use its optimised routine as usual. But I’m also considering some other improvements that should help things further: Optimising the brute-force routine. I don’t think I’ll go as far as completely rewriting it, but there are a few simple optimisations that can be made which will help newer CPUs (the current implementation seems to be optimised for ARM2/ARM3, which actually results in a few redundant instructions – e.g. converting numbers to unsigned before squaring them in order to cope with the terrible MUL performance on early ARMs) A special-case routine for greyscale modes (I think ROL have done similar things?). Not terribly important since I don’t think many people use them, but as it should be pretty easy to implement and will give significant performance gains I think it’s worth spending some time on it. Smarter translation table caching in the Wimp. At the moment the Wimp doesn’t seem to make any attempt to cache translation tables for palettised sprites – only for true colour sprites. So each individual filer icon that gets drawn results in the translation table being recalculated, even if it’s the same sprite as before. (ColourTrans doesn’t cache the tables either – it only does that for the “32K” style tables) Even Smarter translation table caching in the Wimp – one thing I spotted during some testing was that performance was about the same regardless of whether the icon text, the sprite, or both were visible. This is because the Wimp doesn’t seem to do any checks for whether the sprite is visible – so even if the sprite part of an icon lies outside of the current graphics rectangle it’ll still waste time building its translation table. Not very nice if you’re dragging one window over another with the filer set to small icons or full info display! Smarter filer window redraws. From my brief looks at the code in the past when trying to work out how to fix redraws of alpha sprites, it looks like the filer takes two approaches – redraw icons one at a time or redraw the entire window. For the ‘redraw entire window’ approach it should be easy enough to make the code build a list of icons to redraw and then sort them by filetype/sprite, so that (if the improved caching is implemented) the Wimp won’t have to recache the translation tables as often as if the files were drawn in their usual order. This should also give a boost in true colour modes or other situations by making better use of the cache.

Feb 27, 2014 1:26pm Jon Abbott (1421) 2661 posts	I did wonder why desktop 8bpp modes were so slow under RO5, I can now see why. Where is the brute force routine? It sounds like it could probably do with a rewrite if it’s still optimised for ARM3. Everything you’ve suggested sounds like a good route to take. I’ve never looked at ColourTrans or had and reason to use it, but could it not cache translated palettes and just return the pointer to one it’s seen before? I don’t know how it works, so am not sure if that’s actually possible.

Feb 27, 2014 2:45pm Jeffrey Lee (213) 6048 posts	Where is the brute force routine? ColourTrans_SelectTable calls best_colour_safe 256 times, in order to map each source palette entry to a destination palette entry. And best_colour_safe is merely a wrapper for best_colour_fast, which is just a wrapper for the FindCol macro, which will invoke the CompErr macro 256 times in a loop in order to find the closest output colour for the given input colour. (all those routines are in that same source file) You can’t really avoid calling best_colour_safe 256 times (after all, you have 256 palette entries to find the nearest colour for), so the key bit is either making CompErr faster or avoiding calling it 256 times from within FindCol. (Or cache the results better) There’s also best_colour256_safe, which is what’s used when ColourTrans detects the default 256 colour palette. That one’s significantly faster because it only loops 4 times (one per tint value) instead of 256 times. It sounds like it could probably do with a rewrite if it’s still optimised for ARM3. Yeah, I’ve basically ended up with three versions of each routine: One optimised for ARMv3 and below (where MUL is slow for ‘large’ numbers) One for ARMv4-ARMv5 (fast MUL) One for ARMv6+ (fast MUL, MUL allows Rd==Rn, some useful new instructions like UXTB, and the instructions scheduled for Cortex-A8) So far I’ve only been testing the code on my Iyonix, but tonight I’m planning on trying the BB and RiscPC to get some timings from that. But having only touched ColourTrans so far, the prognosis is good – I’ve fixed the default palette detection so that full 256 colour modes aren’t hideously slow any more, I’ve added the optimised greyscale function (which makes 256 greyscale modes about 20% faster than 256 colour modes – in previous OS versions they would have been about 5 times slower than 256 colour modes due to using the generic best_colour_safe function), and I’ve optimised the various routines so that they’re significantly faster than before. E.g. Find256 (which is the core of best_colour256_safe) should now be over twice as fast as it was before, even on ARM2. I’ve never looked at ColourTrans or had and reason to use it, but could it not cache translated palettes and just return the pointer to one it’s seen before? I don’t know how it works, so am not sure if that’s actually possible. It caches the ‘32K’ style tables that are used to map true colour sprites to palettes (as those require significantly more work to compute – I think taking at least a second or two per table on an ARM610), but it doesn’t bother caching the regular lookup tables for mapping from one palette to another. Perhaps that would be worth implementing, at least for the case where the slow best_colour_safe routine is used (the other routines are fast enough that it’s not really necessary). However it’s also arguable that it should be the caller’s responsibility for caching the tables better, especially the Wimp which will generally be the one piece of software which uses ColourTrans the most for translating palettes.

Feb 27, 2014 6:19pm Sprow (202) 1167 posts	There’s an interesting thing I’ve spotted recently […] full 256 colour palette when you select a 256 colour mode, redrawing the desktop (particularly filer windows) is several times slower than it was before (on an Iyonix, at least). After doing a bit of profiling last night it looks like most of the time is being spent in ColourTrans_SelectTable, where the Wimp is building translation tables to map from the filer sprites to the desktop palette. If the default 256 colour palette is used then ColourTrans will use a specially optimised routine for generating the translation tables, but in full 256 colour modes the check for that optimisation is skipped (even though we’re still using the default palette) – so it falls back to a much slower brute-force routine for generating the translation tables. I recall sitting through Ben (Avison’s) quite lengthy rant about the Iyonix sprites each of which has a customised 256 entry palette, and how that thrashes the colour lookup code in the Wimp. Compare with the Ursula sprites, which all use a default desktop palette. From your list Making the Wimp skip the calculation for icons that aren’t going to be plotted seems a nice win Improving ColourTrans’ calculation as it sounds like you’re already doing I’d not bother doing anything special with 256 greys. Cache strategies sound less good since I bet there’ll be mostly misses, and make the code even less readable. On the other hand if the Wimp pool has (say) 300 sprites in it then it’s only 75kB of RMA to keep every one to hand.

Feb 27, 2014 11:44pm Jeffrey Lee (213) 6048 posts	I’d not bother doing anything special with 256 greys. Too late – I’ve already written the code! The ColourTrans changes have now been submitted to CVS. Here are some unscientific timings, based around how long it takes to redraw the desktop in a 256 colour mode, with the only window visible being a filer window containing around 80 files. Iyonix (1920×1200) Apart from the obvious fixes, there are also some minor gains thanks to changes such as not converting the errors to absolute values before squaring them. Old C256/G256 (i.e. best_colour_safe routine) = 254cs Old C64 (i.e. best_colour256_safe) = 44cs New C256/C64 (improved best_colour256_safe) = 41cs New G256 (best_colour256_grey) = 38cs New C256/G256/C64 non-default palette (i.e. improved best_colour_safe) = 234cs BB-xM (1280×1024) Some significant gains here for non-default palettes due to creating a CompErr implementation that is scheduled for the A8 pipeline. Old C256/G256 = 211cs Old C64 = 13cs New C256/C64 = 9cs New G256 = 9cs New non-default palette = 118cs StrongARM RiscPC (1280×1024) Being an IOMD build, this has to support ARMv3, and so the code to deal with slow multiplies is still in place. So best_colour_safe doesn’t see any gains (and actually gets a bit slower somehow) Old C256/G256 = 584cs Old C64 = 75cs New C256/C64 = 70cs New G256 = 62cs New non-default palette = 587cs

Feb 28, 2014 7:06am WPB (1391) 353 posts	Great results! You must be pleased with that. Do you expect similar speed-ups on the R-Pi, or is the instruction scheduling not going to result in the same gains?

Feb 28, 2014 9:27am Chris Evans (457) 1614 posts	Thanks. Great work. For comparison do you have any figures for 16m colour modes. i.e. is 256C still slower than 16M? As your time is very valuable maybe just on an Iyonix!

Feb 28, 2014 10:18am Jeffrey Lee (213) 6048 posts	Do you expect similar speed-ups on the R-Pi, or is the instruction scheduling not going to result in the same gains? I’d expect the gains to be somewhere inbetween the gains for the Iyonix and the BB, although I’m not quite sure where. Reduced instruction count (due to some ARMv6+ instructions) will definitely help, as should the better scheduling. But the Cortex-A8 should always gain the most from proper scheduling because it has two pipelines to fill instead of one (and so can do roughly twice as much work while waiting for the results of slow instructions like MUL/MLA). For comparison do you have any figures for 16m colour modes. i.e. is 256C still slower than 16M? As your time is very valuable maybe just on an Iyonix! Luckily for you I did do a brief bit of testing in true-colour modes. These changes won’t speed them up at all, but the figures were (roughly) 60cs for 32K/64K modes and 90cs for 16M colour modes. So 256 colour modes are back as being the fastest available on the Iyonix.

Feb 28, 2014 1:03pm Jon Abbott (1421) 2661 posts	Out of interest, have you tried StrongARM without the fix for slow MUL? Certainly sounds like some great improvements, I’ll be going back to a 256 colour desktop as soon as it’s available :) Regards the Wimp, I agree it should do the caching, perhaps it should cache all sprite palettes once translated. Sprow mentioned it’s using the RMA, would it not be advisable to move it to it’s own DA?

Feb 28, 2014 1:43pm Jeffrey Lee (213) 6048 posts	Out of interest, have you tried StrongARM without the fix for slow MUL? No. But I’d expect it to see around the same gains as the Iyonix – i.e. about 10% faster for non-default palettes, and maybe a small gain for default palettes. Certainly sounds like some great improvements, I’ll be going back to a 256 colour desktop as soon as it’s available :) It’s available now – the changes are in today’s development ROMs. Regards the Wimp, I agree it should do the caching, perhaps it should cache all sprite palettes once translated. Sprow mentioned it’s using the RMA, would it not be advisable to move it to it’s own DA? At the moment I’m only thinking about caching the most recently used translation table, whether the sprite is from the Wimp sprite pools or elsewhere. But extending it to cache translation tables for all the sprites in the Wimp sprite pools would probably be wise, and shouldn’t be too tricky to implement considering that the Wimp tries to keep those sprite pools shielded from direct manipulation. I.e. we can assume that nothing’s going to be modifying Wimp sprite palettes without the Wimp knowing about it. One thing I’m interested in seeing is how much the filer redraw order will affect things, so I’ll definitely be trying that out at some point. The good thing about that change is that it should help all screen modes, not just palettised ones.

Mar 1, 2014 11:12pm Jeffrey Lee (213) 6048 posts	Now in CVS: Basic translation table caching in the Wimp (just the last-used table is cached) Fixes for cases where tables were generated but not used. Including the case of shaded/inverted sprites where it used to first generate the standard translation table and then throw it away in order to generate the one used for shading/inversion! Optimal redraw ordering for filer windows (for most cases, at least) Some brief stats, using the same basic setup as before (although not all modes tested): Iyonix C256 redraw time has dropped to 35cs Non-default palette time has dropped to 93cs As I was hoping, the optimal redraw order has also helped true-colour modes. 32K/64K colour mode redraw times have dropped from 65cs to 54cs 16M colour modes have dropped from 80cs to 70cs BB-xM I think C256 is now 6cs or 7cs (I didn’t note down the time! But it was definitely faster than the 9cs I was getting before) Non-default palettes are at 38cs RiscPC C256 is now at 56cs Non-default palettes are at 215cs I wonder how long it will take me to get used to the new filer window redraw order – it’s odd seeing them redraw in what looks like a random order. But it is definitely faster than it was before!

Mar 2, 2014 8:10am Jon Abbott (1421) 2661 posts	I’ve just taken a quick look at the code, how accurate does the result have to be? Instead of using the Pythagorean equation in CompErr to work out the 3d vector distance between two colours, could we not use an approximation? FindCol could also exit early for an exact match. A further improvement may be to pre-calculate a distance table for each palette when it’s loaded that contains the distance from (0,0,0) for each entry and then in FindCol calculate the distance from (0,0,0) for the colour you’re searching for and look for the closest value, eliminating CompErr entirely during the redraw phase. What are all the references to load/loading? Is that a weighting to correct the RGB colour space?

Mar 2, 2014 10:20am William Harden (2174) 244 posts	Jeffrey: a less technical addition to the problem – but why do the Raspberry Pi sprites have custom palettes at all? Could we not redo them with default palettes?

Mar 2, 2014 10:53am Michael Drake (88) 336 posts	Because they don’t use colours that exist in the default palette, and have subtle graduated fills. You can convert them to use the default palette using error diffusion, but the result is pretty nasty looking. Anyway, iirc all the toolsprites share the same palette anyway, so it should be quite optimal with caching. Jeffrey: Is there any more scope for optimisation of 16M colour modes? I don’t think I’ve had reason to use the desktop with <16M colours for years.

Mar 2, 2014 1:44pm Jeffrey Lee (213) 6048 posts	I’ve just taken a quick look at the code, how accurate does the result have to be? Instead of using the Pythagorean equation in CompErr to work out the 3d vector distance between two colours, could we not use an approximation? It basically already is an approximation, when you consider that it’s operating in RGB colourspace with an unspecified gamma curve :) A further improvement may be to pre-calculate a distance table for each palette when it’s loaded that contains the distance from (0,0,0) for each entry and then in FindCol calculate the distance from (0,0,0) for the colour you’re searching for and look for the closest value, eliminating CompErr entirely during the redraw phase. You can’t just calculate the distance from (0,0,0) in the source and destination palettes and match them up. (r1² + g1² + b1²) – (r2² + g2² + b2²) != ((r1-r2)² + (g1-g2)² + (b1-b2)²) However, you could generate a table mapping arbitrary source RGB values to the closest palette entry (which is what the “32K” style tables are). Then when calculating the translation table for a palette you use the source palette to get the RGB, and then use the table to get the destination palette entry. The only trouble is that generating such a table takes time, potentially more time than would be spent if the other algorithm was used (consider the case of a frequently changing palette). At the moment the “32K” style tables are just a flat [R][G][B] array, with between 4 and 6 bits of precision per component. This is generally OK, but will give sub-optimal results for some palettes (e.g. you’ll get a fair bit of banding in greyscale modes). And since it’s the non-standard palettes which we’re looking to optimise (colour lookup for standard palettes is now back at O(1)), it would probably be best to find a better table structure to use, e.g. something like an octree which will be able to adjust itself to exactly the right level of accuracy that’s needed. Octrees are commonly used ^{(citation needed)} for colour quantization (e.g. generating a custom palette for a true colour image), I see no reason why they couldn’t also be used for pre-existing palettes. What are all the references to load/loading? Is that a weighting to correct the RGB colour space? Yes, those are weighting values (PRM 3-339). Surprisingly the APIs to control the values are all “for internal use only”, which makes me wonder if anything actually modifies them – if the weightings were fixed then it would allow for extra optimisations. It’s also worth pointing out that different algorithms appear to use different weighting methods: The “32K” style tables just calculate their error using (abs(r1-r2) + abs(g1-g2) + abs(b1-b2)). No squaring and no weighting! However the builtin 32K table for the default 256 colour palette uses the original sum-of-squares approach, with RISC OS 3 standard (2,4,1) error loading InverseTable contains an algorithm which uses ((r1-r2)² + (g1-g2)² + (b1-b2)²), with no weighting. However that algorithm effectively never gets used, because in 2000 the module was switched over to use ColourTrans for generating the tables. BlendTable (or at least my implementation of it) has a few different algorithms it picks from: For <256 colour modes, it uses ((r1-r2)² + (g1-g2)² + (b1-b2)²), with no weighting For greyscale 256 colour modes it just averages the RGB values and uses a lookup table to find the nearest greyscale palette entry – which should effectively give the same result as the above algorithm For other 256 colour modes it just falls back on using a ColourTrans 32K style table So there’s definitely room for improvement by settling on a standard error weighting metric. Jeffrey: Is there any more scope for optimisation of 16M colour modes? I don’t think I’ve had reason to use the desktop with <16M colours for years. The best way of dealing with true colour modes (and palettised modes, really) would be to make better use of GPU acceleration. At 1920×1200×32bpp it takes my Iyonix 50cs to redraw an empty screen. Remove the backdrop and the time drops to 4cs. The pinboard creates a cached version of the backdrop sprite for the current screen mode, so when it’s drawn all that happens is that it uses the kernel OS_SpriteOp that uses a (somewhat) optimised block copy. So all the time is being spent waiting for the memory bus (in particular the slow VRAM writes over the PCI bus). If there was a mechanism for creating and rendering GPU-optimised sprites then the pinboard, Wimp sprite pool, font rendering, etc. could all be changed to use that and the CPU would rarely have to touch the screen itself. In fact, I should probably add this to the todo list since it’s a pretty major bottleneck, and should be relatively easy to implement once screen memory handling is revamped. Other improvements could come from screen memory caching, or an implementation of ROL’s tiled OS_SpriteOp. Even without GPU accelerated sprites the tiled OS_SpriteOp could be a big help – the implementation could just draw the sprite once and then use the existing rectangle copy GraphicsV operation to copy it to all the other tiles. Or for situations where that isn’t possible (masks, alpha blending, etc.) a custom plotter could be used which tries to ensure the sprite is kept in the CPU cache (i.e. draw the first row to all the tiles, then draw the second row to all the tiles, then the third, etc. so that the current sprite row stays in the cache).