Thinking ahead: Supporting multicore CPUs
Pages: 1 ... 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
h0bby1 (2567) 480 posts |
aaaaa |
Rick Murray (539) 13840 posts |
The reason I refer you to the crusty ancient manual is because it is a pretty detailed guide to how the system behaves. I think that you will be setting yourself up for unnecessary pain if you want to dive in without a good knowledge of how the system behaves. For instance, if you are in a timed callback in IRQ mode and the system “is threaded”, there are restrictions upon what you can do. You probably will receive “Filecore in use” if you try file operations in that state. I can imagine this will be frustrating as hell; but you should understand that your plan is rather more technically inclined than many (moreso than anything I’ve ever written) so you will need to do the homework. This is true of any OS.
It is possible to “pre-empt” a singletasking program and run it as a co-operative Wimp task. This is exactly what TaskWindow does, and this is what happens (a lot) when you compile/build programs with the DDE. Possibly GCC as well, but I’ve never used that.
And there’s the problem. You want to run a PMT task as a PMT task, not as a CMT task. The problem arises with the concept of “task switching”. Under RISC OS, every task believes that it is located at &8000. This is the logical memory mapping, which the Wimp fiddles with for every task change. 1 Personally, for things like this, I prefer to keep scheduling a CallAfter instead of a CallEvery, because I know that another event will NOT fire until I have asked for it.
Is your OS available anywhere? You make a lot of reference to Windows and Linux, so would it work on a generic “PC”? Can it boot from SD or USB?
I think the GPL is a bit of a misnomer. Unless your code is GPL or you plan to take some GPL code for this, the whole GPL question is irrelevant. RISC OS (at this level) is a GPL free zone.
…so… am I reading this correctly? You want to be able to do something pretty low level and technical without first bothering to know about the system that you want to do this upon?
I can imagine that would be “doable”. If you design your code to run in the RMA (I’ll duck from the incoming missiles), that never gets swapped out so you can switch around as much as you like with fixed-location tasks (as opposed to “all at &8000”). You could do similar within a regular application, but you need to periodically Poll so other tasks get a chance to run, but the problem here is if another task blocks, your task will be waiting (like all the rest). Really, this one isn’t a million miles from what TaskWindow does. |
h0bby1 (2567) 480 posts |
aaaaa |
Rick Murray (539) 13840 posts |
Well, true, you literally can’t get better than straight from the source’s mouth. Sorry. |
h0bby1 (2567) 480 posts |
aaaaa |
Ronald May (387) 407 posts |
On linux the way their kernel and X11 is made is just very bad for this. It’s a little bit less boring to program stuff like icecast or server things, but for multimedia it’s not good You haven’t mentioned the linux rtkernels available that appear to be for this purpose. Are they any good? I noted that the standard linux kernel was referred to somewhere as ‘soft realtime’. |
h0bby1 (2567) 480 posts |
aaaaa |
Rick Murray (539) 13840 posts |
I find that VLC and SMPlayer (Windows), VLC (iOS) and MXPlayer (Android) do not experience problems1 on a variety of hardware (x86 and ARM) with a variety of OS types. Both reading from local files and streaming from the internet. 1 Suffice to say that my PC cannot handle a lot of H.264/720P content; and is floored by H.264/1080P.
? I don’t know about the technicalities, however I would be extremely surprised if the Windows application is directly driving the playback. Why? Because Windows has quite a bit of latency with its context switches, it isn’t RTOS so makes no guarantees about return time, and has a seriously bad habit of “doing housekeeping” when you don’t want it to and claiming more than its fair share of processor time (or, rather, I think multitasking suspends for some internal tasks). And that isn’t even considering the fact that Windows 8 is supposed to be even worse at context switching. Instead, all the front-end application has to do is make sure the right data is in the right buffers on time. The back end will be taking the data upon interrupt – probably some sort of “buffer is emptying” threshold rather than a ticker as frame sizes (bytes) can vary greatly.
To put this into context; MPlayer on RISC OS can manage a 320×240 video while performing all of the decoding in software. Sure, it is pure ARM and could probably benefit from using some Neon/SIMD code, but I doubt you’d push it much beyond 480P. To do what RaspBMC does, you need to involve the GPU and actually hand off a lot of the work to something else and when that has been done, your process switching times are less important. Just as long as the buffer is big enough that it won’t run out… Back to my (infamous!) PVR. There’s no way a 200MHz ARM is going to be capable of either recording nor playing back 640×480 video at 25fps. Ain’t happening. But it does. Magic? No. The secret is in the fact that when the unit is recording, it can take several seconds for the on-screen display to appear. There’s a lot of behaviour (file operations and such) all running low-level on interrupts. When the DSP has encoded some video, it has to be written and it has to be written now. All of that takes precedence. You’ll notice that the vrecorder application doesn’t do a lot. It passes some parameters to a (closed source) module and then busy-waits on a response. The front end UI thing? That’s like the least important part. It gets some processing time if nothing else gets in. Ditto telnet tty, lighttp, and the other rubbish built into the PVR. It all works during recording, but it works slowly. If you don’t want to hear about my PVR any more, then here’s another example. XP running on a 466MHz Celeron cannot keep up with a raw 115kbps serial stream. My satellite receiver can spit out a dump of its firmware and settings (about 4MiB I think) but as it is a basic serial port, it has no concept of flow control. And that machine just didn’t have the latency to do it in Windows. I wrote a DOS program which should have worked, but DOS isn’t like that in XP, so even in “DOS mode” it would lose data. The 16550 compatible UART ought to have an 8 or 16 byte FIFO, yet XP still couldn’t keep up.
Which means it is a matter of raw CPU power. Switching tasks is expensive (cache pollution, TLB flush, to to mention the overheads of saving state and messing with memory). Different processor families take different approaches to try to minimise the problem, however one could imagine that a typical context switch between different processes can take around 30µs on modern(ish) hardware, and exponentially longer on older hardware.
Thinking about it, a thread is more like a “lightweight” process context switch. Lightweight in that you tend to use the same memory space between threads, so while you may suffer cache pollution, you probably won’t need to worry about the TLB or altering the memory mapping. Of course, we are chasing a moving target here. For background activity that uses interrupts (USB, ticker, network, keyboard, video…) will mess with the cache and possibly the TLB. On more advanced systems that manage virtual memory by trapping page faults, there might be memory mapping involved. After all, a (single core) processor can only do one thing at a time so…
How old are you? They fairly quickly burned through eight versions of DirectX before they had something that was actually usable. I remember each demo game in the Windows95 days would come with a later yet later release of DirectX.
By necessity. This, mind you, is a huge leap away from the UI task switching malarkey. You could run Quake in a window if you wanted, and on modern machines you might be able to run it at a decent speed with a resolution worth looking at. On older machines, it was painful. Because the power just wasn’t there and you just couldn’t get enough done in the task slice allocated.
What sort of interrupt? I would say a ticker interrupt can be used for pre-emption; however a “buffer needs data!” interrupt really ought to go directly to a low level driver that will copy data from a “big buffer” to hardware. Application -> Big buffer -> Driver -> Hardware
I’m not sure these two things can be reconciled. The more task switching you have, the worse latency will become. The method of switching (CMT or PMT) doesn’t really matter here; every switch will suffer the overheads, and the more tasks that are being run by the machine, the slower everything will become.
If I was engineering the system, the application’s job would be to open the file. Split it into the appropriate streams, and put some data (perhaps partially decoded depending upon specifics) into a large buffer. The only job the application would need to do asides from “the UI stuff” would be “is there space in the buffer? if so, fill it with something”.
Split off the things that need low latency from the things that don’t. Handle them specifically. Use buffers to deal with differences between the backend and the application. Or you could be like !DigitalCD and have the back end module deal with reading and decoding the data all by itself. [ http://www.riscos-digitalcd.net/digitalcd/player/intro.htm ] If you handle interrupts by switching to the appropriate task to deal with the interrupt, not only is there the latency of the task switch, but there is also the risk of hijacking the system with a “favoured” task. My machine’s average latency is about 115µs. The maximum peaks at around 3500µs. Loading SMPlayer peaked at 17000µs before dropping to 450-1500 during (lowish resolution) video playback. In short – you can’t really make assumptions about how fast the application level latency will be. So it is perhaps better to deal with the low level stuff in a specific “driver” module. If nothing else, it stops application code from directly messing with hardware.
Funny. My 800MHz ARM (Pi) can wipe the floor with what the PC (Atom slightly overclocked to 1.8GHz) can manage. Okay, it probably isn’t fair to compare a quad 1.5GHz processor with a faking-two-cores 1.6GHz processor; however the fact that the Pi can easily best the PC suggests that, really, Linux ain’t so bad…
See my previous rant pointing out that Windows is more or less compatible with Windows across all sorts of different processor types and PC builds. Linux, on the other hand, is spectacularly bad in this respect and software either needs to be supplied as source, or it needs to be prebuilt for half a dozen “popular” versions of Linux – such as here: http://www.skype.com/en/download-skype/skype-for-linux/ |
h0bby1 (2567) 480 posts |
aaaaa |
Rick Murray (539) 13840 posts |
I would differ on that opinion. Looking at the datasheet of my PVR (since modern GPUs tend to be NDA), I can see it deals with Huffman blocks in hardware, colourspace in hardware, realtime image rescaling in hardware – and this isn’t a proper GPU, it’s a DSP with some image-specific parts added.
Ah. And here is the reason for the confusion. We’re talking about pre-emptive multitasking and interrupt pre-emption at the same time. When I refer to pre-emption, I’m referring to multitasking because, as you observe, there is not much point in using the word “pre-emption” with interrupts as they cannot sensibly work any other way.
Which is possibly why hardware drivers should be special modules that can respond rapidly via the interrupt dispatch, as opposed to anything running as a multitasked user application.
The big question here is – what was the signal system devised for? I rather suspect it was in a day before GUIs were commonplace and system architecture was different to nowadays. So maybe expecting signal to work with X11 is about as hopeful as expecting BBC VDU calls to work under the Wimp (they can, put maybe not quite as a Beeb programmer would expect).
Isn’t that just the normal POSIX way?
Again, it perhaps works well for console tasks, maybe not so good for GUI ones.
RISC OS method: Write a module to receive data from the device at low latency and buffer it. When data is available, the module sets a “pollword”. The UI application can process slowly (say Wimp_PollIdle asking to be called once per second) and when data is available, the Wimp will call the application with the pollword event. The Wimp actually passes you the value of the pollword so you could embed information in it, like the low byte records how many bytes of data are available, or something like that. NEVER should the multitasked program attempt to directly interrogate the hardware. Not only is it bad form, but most OSs cannot provide guarantees as to how often the program will be called.
;-) Things like this are complicated. That’s why most programmers write fart apps these days… [look on any App Store for the sheer amount of dross]
My observation was that a mobile ARM based device can wipe the floor with my PC. So either Linux isn’t so bad at media recording and playback, or the ARM is so far ahead of x86 in terms of “getting actual stuff done” that it is unreal. Take your pick.
No, you don’t need to decode 10,000 frames (indeed decoding that lot would take a while) but you do need large enough buffers to smooth things out and make sure everything “just works”. How and what depends a lot on the type of system and its capabilities.
You DO realise, I hope, that most pre-emptive multitaskers use a timer interrupt to call the schedular to switch tasks. Okay, some might be more capable and respond to other things too; but time sliced multitasking is… well, there’s a clue in the name. As for your example – games have been written for RISC OS, I’m quite partial to Cannon Fodder and SF3000 is pretty fluid on modern ARMs. So either we’ve all reinvented threads over and over, or there are other ways that work. ;-) Okay, time to go out and pretend its an interesting weekend… |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
h0bby1 (2567) 480 posts |
aaaaa |
Chris Mahoney (1684) 2165 posts |
I see that there’s a new Raspberry Pi 2… and it’s quad core :) |
andym (447) 473 posts |
Just bought one from RS (available here now!) and am wondering if I’ve wasted £27 that I could have instead spent on RISC OS hardware? Anyone from ROOL shed any light on whether RISC OS will run? |
Rick Murray (539) 13840 posts |
If it doesn’t today, it probably will by the time you receive it. Won’t do quad core mind you :-) though I think the barrier to support for and playing with multiple cores has just dropped through the floor. More or less the same price as the B+? How is that even possible? |
Tony Noble (1579) 62 posts |
According to the Pi folks, it’s fully binary compatible, so it should work. Of course, should doesn’t always translate to does… |
Neil Young (2592) 2 posts |
“More or less the same price as the B+? How is that even possible?” Raspi’s use a very cheap off the shelf SoC, the new one is probably nearly identical in price. Yay for economies of scale! |
Wouter Rademaker (458) 197 posts |
The arm-core is just a small part of the SoC. It sounds like that the (small?) redesign needed is done by Raspi foundation-members. |
Anthony Vaughan Bartram (2454) 458 posts |
Is there a branch in SVN/CVS where any tasks or elements relating to re-factoring etc for multi-core is being prototyped?. I’m a newbie to Risc OS (so am trying to avoid saying something way off target or naive), however I’ve worked on re-factoring personally & professionally over the years + I have written various threaded systems… I’ve been reading David Feugey’s posts on AMP and it sounds like an interesting way forward. Having recently got a PI 2 (which is gathering a little dust whilst I try and get a computer game released), I am tempted to try and contribute to this effort a little. 1) I see this problem as a re-factoring problem. In order to achieve large scale re-factoring in the past, I have applied the adaptor & facade design patterns successfully. 2) Would rapid application development help? i.e. I could for example, try and write something using LDREX and STREX (or other perhaps other opcodes depending which RiscOS CPU target was being built i.e. SWP for legacy)- which I think are intended for semaphore/lock construction. I was going to try this, to dip my toe in to ARM coding… I have wondered whether cooperative multi-tasking and multiple cores are not mutually exclusive i.e. the scheduling method, whilst ideally pre-emptive, would still work on multiple cores if they simply cooperated. Threading SWIs could be built using the same paradigm i.e. child cooperatively scheduled light weight tasks. I wonder whether, resources / operating system features that suffer contention in a multi-core/multi-process environment, could be single threaded and service a job request FIFO queue that is polled cooperatively. Rather than, requiring to be isolated with a new interface put on it i.e. instead of changing interfaces, create an adaptor which the existing code calls. Thus, preventing that code being changed or made aware that the operating system feature or resource is no longer local to the core it is running on. Could an SWI’s parameters be queued (via a ring buffer) & re-directed via a different SWI handler on one core to a master handler on the main core? Perhaps responses could be returned and managed via a ring buffer to avoid serialisation between cores and to improve asynchronous execution of code. |
Pages: 1 ... 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26