DVDs (and why I rip them)
Rick Murray (539) 13850 posts |
Taken to Aldershot…
A non-software example. DVDs. The main reason I rip is to watch on my phone (I don’t have a TV, I have a composite video to VGA adapter and a flat panel monitor). But two things really seriously bug the hell out of me:
[I usually watch movies with subtitles on, a habit from the days of 888, I still do it with Netflix; but most DVD subs are an exercise in “how can we make this unbelievably UGLY”?] |
Jeff Doggett (257) 234 posts |
About what seems like 100 years ago, I made sure that before I purchased a DVD player that the firmware could be hacked to remove the UOP (User Operation Prohibition) so that these things could be skipped. I also made sure that it could be made multi region. With RiscOS games etc. the first thing that I would always do was to remove the “protection”.
Me too, I also often have the Audio Description switched on as well! |
Rick Murray (539) 13850 posts |
Subs are useful, especially with the modern trend of mumbling dialogue that’s barely audible over the background music. Something went above and beyond. The other day, on Netflix, I was watching something – it might have been Spectros which, if it was, needed subs being in Brazilian Portuguese (English dub available, but I tend to avoid those). Anyway, there was a scene of somebody making a phone call. Seen through a window. With the dialogue inaudible (because window) and the subs faithfully transcribing it all. Other amusing ones are where swear words are muted or, worse, switched out for lesser words. But not in the subs. ;-) I don’t use audio description. I was watching something on… I think it was BBC One a couple of years back, and the satellite receiver gave me the choice of English or Described. So I switched to described and tried to keep a straight face as I saw a door slam shut, heard a door slam shut, and then a disembodied voice that sounded bored cut in to say “door slams closed”. Sometimes I get descriptions on Netflix subs. Like “perky music” or “low growl” or “woman screams”.
The first one I ever bought, when DVDs were a new thing (and I got it just to watch Fly Away Home) had a complicated sequence to enter on the remote control to switch region. All the DVD players since then have been picked up from boot sales and the like, so not much chance of choosing for features. More a case of buying it if the price is right [or as in the case of earlier in the year, turning it down even for a mere fiver because of a missing remote – I know that many DVD players are next to useless without the controller] |
Raik (463) 2061 posts |
I rip to watch on Ti with mplayer or Kino. |
David Feugey (2125) 2709 posts |
DVD is old school, but sill very common. I wonder if Cino could play a DVD with a modern computer. Or at least play the ripped MPEG2. |
Rick Murray (539) 13850 posts |
Technically anything from the Pi2 onward ought to have enough grunt to decode SD MPEG-2 (DVDs) in software. But…alas…technically RISC OS uses only one of the four cores, so the question becomes “is one core capable?”. Probably the way to find out is to copy a VOB file to media and see if anything plays it satisfactorily. |
Steve Pampling (1551) 8172 posts |
I’ve no idea how far Jeffreys work on multicore support has progressed but if it’s merely at a throw a task at a second core and let it run stage then the distinction between the player front end and the back end ‘engine’ is probably important. |
Rick Murray (539) 13850 posts |
Well… Just about to start watching Anon. A little notice at the top left said “Rated 16+” and “Violence”. Then a few seconds later “THIS PROGRAMME CONTAINS PRODUCT PLACEMENT”. Never seen that before, so either Netflix has changed something, or the placement is so damned blatant that they’re calling themselves (it’s a Netflix film!) out on it. |
Adrian Lees (1349) 122 posts |
Cino plays DVD streams at around 18-20fps on a single-core Cortex-A9 device such as the ARMX6/WandBoard without any hardware acceleration, and the code is entirely XScale-era, ie. no use of NEON acceleration presently. The colour space conversion/upsampling can definitely be accelerated using NEON, and that change alone would probably attain the requisite 25fps. Ideally, YUV overlay hardware would do this to reduce further the CPU load; a GraphicsV API extension was introduced by Jeffrey to facilitate this, although I don’t know how far it’s progressed on the various targets. The real sticking point on the XScale’s IOP321 was the memory access latency of the CPU when missing in L1 Dcache, heavily impacting the B-frame prediction code, but more recent SOCs have a L2 cache and a lower-latency memory system so they perhaps wouldn’t even need any reworking of that code. In short I believe fully software DVD playback to be attainable on all of: PandaBoard, Pi3/ARMbook, WandBoard/ARMX6/ARMini.m even with a single core, and perhaps even the BeagleBoard/derivatives. At the show KinoAMP was playing full screen on the Titanium (albeit with the occasional frame drop) and on the Pi4. |
Raik (463) 2061 posts |
Sorry, it looks like my description was not clear. |
André Timmermans (100) 655 posts |
Dunno for other machines as I only have a Pi3, but on that machine before any support for hardware overlays or NEON, KinoAmp was able to replay DVDs at full speed if you didn’t apply any scaling (and with occasional hiccups due to disc access). On the other hand bi-linear interpolation scaling to the even nearest available resolution dropped the frame rate to something like 60%. Since then, thanks to Jeffrey we have overlays which make YUV→RGB conversion and scaling to Full HD a breeze. Unfortunately overlays are far from enough to decode Full-HD MPEG2s so I added some optimizations to IDCT, motion compensation and de-interlacing. Even so I could only get between 13 and 18 fps on the MPlayer test samples. It should still help the replay of DVD on older machines, since the 60% is what I get now for bi-linear scaling to Full HD in the absence of overlays. |
Rick Murray (539) 13850 posts |
I presume this is software scaling up to FullHD? I wonder how fast it would be capable of if using no hardware mode changes, selecting a resolution under RISC OS that is close to native DVD (720×576 ought to fit into 800×600), plotting directly into that, and leaving it to the GPU to scale it up to monitor resolution…? [I think this topic is in danger of needing to leave Aldershot!] |
Rick Murray (539) 13850 posts |
# You can check out any time you like, but… |
Steve Pampling (1551) 8172 posts |
You are Glenn Frey, and I claim my £5 |
Jeffrey Lee (213) 6048 posts |
Support is implemented on the Pi, OMAP3, OMAP4, and OMAP5. Implementing it for the Titanium will be a job for Elesar, due to the closed-source nature of the video driver. Iyonix support is buried somewhere in my todo list (there’s some code in the BeOS driver for configuring the overlays). I can’t remember the details offhand, but iMX6 support will be tricky. Not sure about the ARMbook. This thread has also reminded me that I still need to update the companion VideoOverlay module to add proper official for using overlays from single-tasking programs. |
Adrian Lees (1349) 122 posts |
@André – there may be faster ways to upscale the image in sw. nearest-neighbour resampling can be done most quickly in my experience by something akin to: for (x = 0; x < w; x++) colOffset[] = ; // Noting that the selection of source columns is invariant w.r.t. scanline... for (y = 0; y < h; y++) for (x = 0; x < w; x++) pDest[x] = pSrc[colOffset[x]]; so in assember the inner loop becomes just: LDR off,[offsets],#4 LDR pix,[src,off] STR pix,[dst],#4 that’s per-pixel of course, and in reality you’d want to advance the LDR instructions, use as many registers as possible, and delay the use of those registers…by interleaving the processing of n separate pixels, doing a batch of n pixels at a time. And for bilinear interpolation, as you probably know, use a separable kernel, performing the vertical interpolation first to reuse as many intermediates/calculations as possible as you move along the scanline, and for the horizontal interpolation instead of the ‘obvious’ approach: w.×0 + (1-w).×1you can rearrange as: w.(x0 – x1) + x1or in practice: ((x1 << n) + wmod*(x0 – x1)) >> nwhere wmod is w as m.n fixed-point weighting |
André Timmermans (100) 655 posts |
@Adrian
so its probably a matter of how fast the extra ldr compares to mov + add. Regarding bilinear interpolation, the rearrangement to use a single multiplication is actually conterproductive for 2 reasons: Anyway I have just finished the neon version of bilinear scaling. The following tests we performed on a Pi3 with audio off, all frames shown without delay (no syncing, ignoring the timestamps) scaled to 1920×1080. The summaries give an idea of how neon code helps in various part.
The neon optimisation nearly halves the scaling time (which seems consistent with other neon optimised parts when I timed them) and in pratice lets me display around 90% of the frames of a video making it very watchable. Of course it doesn’t old a candle to the use of overlays (it around 2.5 times faster in the test). One funny unxepected side effect of the overlays in is an appaent speedup of the disc access. This does not occur in a normal play so my guess is that reducing the time spent between discs accesses affects the performance of this old harddisc. |
Adrian Lees (1349) 122 posts |
@André I did read the KinoAMP upsampling kernel before posting. Whether the two LDRs or one+two DP instructions is faster perhaps depends upon the microarchitecture these days because of the parallel internal pipes. On a Cortex-A15 there are two arithmetic pipes and two load/store so perhaps out-of-order execution and register renaming allows it to absorb the extra instructions but, of course, the short list of ‘offset’ values in the three instruction approach will be small enough that it just sits in the cache so on simpler ARMs I’d certainly expect it to be cheaper, provided that you interleave the processing of multiple pixels as suggested. Similarly, dropping multiplications is useful if you are suffering from the /latency/ of the result, which is of course higher than the single cycle /throughput/. On the register renaming front, interestingly in the Cortex-A72 TRM, ARM recommends the use of a simple “LDRD r3,r4,[r0,#off]:STRD r3,r4,[r0,#off]” for implementing memcpy/memmove, with no use of more than those two data registers even though the sequence is repeated 8 times. A telling example of how the optimisation rules change over the years ;) I take your point about operating on multiple components of already-interleaved data – a trick I’ve used myself in the past – but I think for maximal performance you’d ideally be operating on the sub-sampled YUV planar out of the decoder core, and performing both colour space conversion and then upsampling as a single operation on-the-fly without storing out any unscaled RGB data. That’s the software-only case, of course, and I realise that it likely doesn’t fit well with the current structure of the software. As you note, hardware overlay is obviously the way to go; believe me I spent enough time writing tuned per-pixel processing software for video feeds in my previous employment and bemoaning the fact that “this really isn’t a job software should be doing!” |
André Timmermans (100) 655 posts |
Structure wise I have already 2 separate ways with YUV → RGB conversion followed by scaling and YUV → overlay format so it should not be too difficult to cope with a third one. The issue is that the YUV → RGB is already just barely in term of number of registers: 3 planes (Y, U, V), destination, 3 conversion tables (standard conversion adjusted with brightness, contrast and color settings), loop counters, that is just enough for work registers. One possibility to explore for the bilinear scaling case: perform the vertical scaling in YUV mode while at the same time switching from planar YUV to a packed format and use that as input for horizontal scaling and conversion to RGB. |