Thinking ahead: Supporting multicore CPUs

636 posts, 79 voices

Pages: 1 ... 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Jan 28, 2015 4:49pm h0bby1 (2567) 480 posts	aaaaa

Jan 28, 2015 7:14pm Rick Murray (539) 13840 posts	I won’t really spend ages to investigate which 20 year old application works fine or not with a 15 year old manual and nothing else to write some patch and stuff. The reason I refer you to the crusty ancient manual is because it is a pretty detailed guide to how the system behaves. I think that you will be setting yourself up for unnecessary pain if you want to dive in without a good knowledge of how the system behaves. For instance, if you are in a timed callback in IRQ mode and the system “is threaded”, there are restrictions upon what you can do. You probably will receive “Filecore in use” if you try file operations in that state. I can imagine this will be frustrating as hell; but you should understand that your plan is rather more technically inclined than many (moreso than anything I’ve ever written) so you will need to do the homework. This is true of any OS. If this cannot be done on riscos, then i move on and forget about it It is possible to “pre-empt” a singletasking program and run it as a co-operative Wimp task. This is exactly what TaskWindow does, and this is what happens (a lot) when you compile/build programs with the DDE. Possibly GCC as well, but I’ve never used that. because it has no interest in the current state of things for doing things how i want to do And there’s the problem. You want to run a PMT task as a PMT task, not as a CMT task. The problem arises with the concept of “task switching”. Under RISC OS, every task believes that it is located at &8000. This is the logical memory mapping, which the Wimp fiddles with for every task change. If you want to run your own PMT scheduler, I don’t think it would be too difficult to set up. Register a CallAfter¹ and when that fires, register a Callback. The Callback will return when RISC OS is “not busy”, so it’ll be a point where it is safe to mess with stuff. Problem is, if you switch out your current task for a different one, how do you inform the Wimp? I don’t know about nowadays with the kernel looking after task switching, but in the days of RISC OS 3.10 if you messed with the memory map, the Wimp would throw its toys out of its pram and sulk. This is my big question. Not how do you interrupt, but how do you switch ? ¹ Personally, for things like this, I prefer to keep scheduling a CallAfter instead of a CallEvery, because I know that another event will NOT fire until I have asked for it. and i’d be better starting from my own os who is already preemptive and have plenty of things, wrote 95% in C and could be made to work on the PI probably without too much trouble, Is your OS available anywhere? You make a lot of reference to Windows and Linux, so would it work on a generic “PC”? Can it boot from SD or USB? If it’s impossible to due to some licence I think the GPL is a bit of a misnomer. Unless your code is GPL or you plan to take some GPL code for this, the whole GPL question is irrelevant. RISC OS (at this level) is a GPL free zone. because it’s too much bother for me to get through all this 20 year old stuff without anything else than a 20 year old manual and i don’t know anything about riscos applications and the riscos general way of programming, and i don’t really care much as long as i can do what i want to do with it. …so… am I reading this correctly? You want to be able to do something pretty low level and technical without first bothering to know about the system that you want to do this upon? As far as i can tell, i could just even do all the task switching internally in my own application for my own things and having internal preemptive task in my own application instead of making it system wide like i wanted to do initially and that would just be fine for me. I can imagine that would be “doable”. If you design your code to run in the RMA (I’ll duck from the incoming missiles), that never gets swapped out so you can switch around as much as you like with fixed-location tasks (as opposed to “all at &8000”). You could do similar within a regular application, but you need to periodically Poll so other tasks get a chance to run, but the problem here is if another task blocks, your task will be waiting (like all the rest). Really, this one isn’t a million miles from what TaskWindow does.

Jan 28, 2015 7:44pm h0bby1 (2567) 480 posts	aaaaa

Jan 28, 2015 10:00pm Rick Murray (539) 13840 posts	i’m fine with reading the manual, but it just doesn’t really document the thing i need to know about. Well, true, you literally can’t get better than straight from the source’s mouth. Sorry. Sorry. I’ll go get my coat.

Jan 29, 2015 7:40am h0bby1 (2567) 480 posts	aaaaa

Jan 29, 2015 11:38pm Ronald May (387) 407 posts	On linux the way their kernel and X11 is made is just very bad for this. It’s a little bit less boring to program stuff like icecast or server things, but for multimedia it’s not good You haven’t mentioned the linux rtkernels available that appear to be for this purpose. Are they any good? I just checked and the rt kernel project seems to be alive and well. (20-Jan-2015) https://www.kernel.org/pub/linux/kernel/projects/rt/ I noted that the standard linux kernel was referred to somewhere as ‘soft realtime’.

Jan 30, 2015 2:20am h0bby1 (2567) 480 posts	aaaaa

Jan 30, 2015 9:21pm Rick Murray (539) 13840 posts	Again try making streaming video player, I find that VLC and SMPlayer (Windows), VLC (iOS) and MXPlayer (Android) do not experience problems¹ on a variety of hardware (x86 and ARM) with a variety of OS types. Both reading from local files and streaming from the internet. ¹ Suffice to say that my PC cannot handle a lot of H.264/720P content; and is floored by H.264/1080P. As it need to synchronize 4 cpu, audio card, network card, graphic card, main cpu, who all communicate with each other, and the whole main loop must have something like 10ms latency period. Over 40 ms you start to loose frames. And it need to manage the decoding, the ui, and synchronize with all the hardware. ? I don’t know about the technicalities, however I would be extremely surprised if the Windows application is directly driving the playback. Why? Because Windows has quite a bit of latency with its context switches, it isn’t RTOS so makes no guarantees about return time, and has a seriously bad habit of “doing housekeeping” when you don’t want it to and claiming more than its fair share of processor time (or, rather, I think multitasking suspends for some internal tasks). And that isn’t even considering the fact that Windows 8 is supposed to be even worse at context switching. Instead, all the front-end application has to do is make sure the right data is in the right buffers on time. The back end will be taking the data upon interrupt – probably some sort of “buffer is emptying” threshold rather than a ticker as frame sizes (bytes) can vary greatly. This stuff need to be clockwork, only the fact that you show the interrupt/polling alternative as viable option to do this already speak volume lol To put this into context; MPlayer on RISC OS can manage a 320×240 video while performing all of the decoding in software. Sure, it is pure ARM and could probably benefit from using some Neon/SIMD code, but I doubt you’d push it much beyond 480P. To do what RaspBMC does, you need to involve the GPU and actually hand off a lot of the work to something else and when that has been done, your process switching times are less important. Just as long as the buffer is big enough that it won’t run out… Back to my (infamous!) PVR. There’s no way a 200MHz ARM is going to be capable of either recording nor playing back 640×480 video at 25fps. Ain’t happening. But it does. Magic? No. The secret is in the fact that when the unit is recording, it can take several seconds for the on-screen display to appear. There’s a lot of behaviour (file operations and such) all running low-level on interrupts. When the DSP has encoded some video, it has to be written and it has to be written now. All of that takes precedence. You’ll notice that the vrecorder application doesn’t do a lot. It passes some parameters to a (closed source) module and then busy-waits on a response. The front end UI thing? That’s like the least important part. It gets some processing time if nothing else gets in. Ditto telnet tty, lighttp, and the other rubbish built into the PVR. It all works during recording, but it works slowly. If you don’t want to hear about my PVR any more, then here’s another example. XP running on a 466MHz Celeron cannot keep up with a raw 115kbps serial stream. My satellite receiver can spit out a dump of its firmware and settings (about 4MiB I think) but as it is a basic serial port, it has no concept of flow control. And that machine just didn’t have the latency to do it in Windows. I wrote a DOS program which should have worked, but DOS isn’t like that in XP, so even in “DOS mode” it would lose data. The 16550 compatible UART ought to have an 8 or 16 byte FIFO, yet XP still couldn’t keep up. I needed to use my 1GHz AMD box in order not to lose serial data. For what it is worth, I built a RISC OS version of the command line program and it worked fine, single tasking, on a 40MHz RiscPC. I never thought to try it in a taskwindow… And it’s not only matter of raw cpu power, or only preemptiveness, but minimize latency between different operation , and minimizing memory copy and that sort of things. Which means it is a matter of raw CPU power. Switching tasks is expensive (cache pollution, TLB flush, to to mention the overheads of saving state and messing with memory). Different processor families take different approaches to try to minimise the problem, however one could imagine that a typical context switch between different processes can take around 30µs on modern(ish) hardware, and exponentially longer on older hardware. This is, I reckon, one of the reasons for the likes of the “Hyperthreaded” Atom. The Atom within my EeePC presents itself as a dual core processor. It lies, it is a single core processor with two execution units. If tasks can be split across the two “cores”, then if the OS is intelligent enough (hmmm, does that let Windows out?) it should be possible to balance the tasks across the two cores so context switching will hurt less. This stuff can be done without threads but it’s very hard, and once you want to think it a little in term of framework or libraries, or pipelining thing to be used inside of other application or as component, without thread or anything asynchronous it become almost impossible. Thinking about it, a thread is more like a “lightweight” process context switch. Lightweight in that you tend to use the same memory space between threads, so while you may suffer cache pollution, you probably won’t need to worry about the TLB or altering the memory mapping. Of course, we are chasing a moving target here. For background activity that uses interrupts (USB, ticker, network, keyboard, video…) will mess with the cache and possibly the TLB. On more advanced systems that manage virtual memory by trapping page faults, there might be memory mapping involved. After all, a (single core) processor can only do one thing at a time so… Windows for this it’s good because it really have good connection with hardware, How old are you? They fairly quickly burned through eight versions of DirectX before they had something that was actually usable. I remember each demo game in the Windows95 days would come with a later yet later release of DirectX. they are both plugged on the interrupt or pretty close to the hardware. By necessity. This, mind you, is a huge leap away from the UI task switching malarkey. You could run Quake in a window if you wanted, and on modern machines you might be able to run it at a decent speed with a resolution worth looking at. On older machines, it was painful. Because the power just wasn’t there and you just couldn’t get enough done in the task slice allocated. Interrupt can preempt the application, that’s what they are here for, but they are not to be seen as an application routine. What sort of interrupt? I would say a ticker interrupt can be used for pre-emption; however a “buffer needs data!” interrupt really ought to go directly to a low level driver that will copy data from a “big buffer” to hardware. Like this: Application -> Big buffer -> Driver -> Hardware To have low latency on this, it’s really best with preemptiveness, I’m not sure these two things can be reconciled. The more task switching you have, the worse latency will become. The method of switching (CMT or PMT) doesn’t really matter here; every switch will suffer the overheads, and the more tasks that are being run by the machine, the slower everything will become. Really, interrupt level I/O that is dependent upon response time should be running outside of normal application behaviour. A relocatable module in RISC OS terms, a kernel module (perhaps?) in Unix terms, a driver (??) in Windows terms. If you can’t understand the difference between an application being put to sleep and wokeup when there is a event, and having the application need to wait for the system to have nothing else to do before the application can keep going forward with the event, well add three or four ‘system loop’ like that in the pipeline of a video streamer, and it just make it impossible to have decent playback. If I was engineering the system, the application’s job would be to open the file. Split it into the appropriate streams, and put some data (perhaps partially decoded depending upon specifics) into a large buffer. The only job the application would need to do asides from “the UI stuff” would be “is there space in the buffer? if so, fill it with something”. The core of the decoder would work with the hardware and would respond not to the needs of the application, but rather to the needs of the hardware. Because when the hardware needs data, it needs data. It cannot necessarily wait for n tasks to cycle through and the application part to gets its act together. Playing a film on SMPlayer wasn’t a big help from ProcessExplorer just pointed me to three threads that were msvcrt.dll, collectively using about 20% CPU time, and racking up around 50,000 context switches in the first minute… It’s about how many time will elapse between the time the system has receive an hardware interrupt, (which can be an ui message (mouse/kb), or other hardware event that application are waiting for either it’s network, file system, audio, and the time where the application routine to process this data will be run. Without application or module level preemptive behavior, it’s impossible to have low latency for this kind of things. Split off the things that need low latency from the things that don’t. Handle them specifically. Use buffers to deal with differences between the backend and the application. Or you could be like !DigitalCD and have the back end module deal with reading and decoding the data all by itself. [ http://www.riscos-digitalcd.net/digitalcd/player/intro.htm ] If you handle interrupts by switching to the appropriate task to deal with the interrupt, not only is there the latency of the task switch, but there is also the risk of hijacking the system with a “favoured” task. My machine’s average latency is about 115µs. The maximum peaks at around 3500µs. Loading SMPlayer peaked at 17000µs before dropping to 450-1500 during (lowish resolution) video playback. In short – you can’t really make assumptions about how fast the application level latency will be. So it is perhaps better to deal with the low level stuff in a specific “driver” module. If nothing else, it stops application code from directly messing with hardware. I know a little bit the guy who did vlc too, well it works fine, but it’s not really that performent for video streaming under linux, less performent than what you can have under windows. But maybe they will finally do something about it, and get something a little bit more suited for multi media and real time destkop application, video game, or that sort things. Funny. My 800MHz ARM (Pi) can wipe the floor with what the PC (Atom slightly overclocked to 1.8GHz) can manage. My 1GHz ARM (Android) phone can cope without undue bother with recording 720P HD video; and play it back. My new quad core 1.5GHz (IIRC) Android phone can happily play back 720P video scaled into a user defined “window” while using the operating system and nothing is laggy. It can also record and play back 1080P that the PC would struggle to manage maybe a frame or two per second. Okay, it probably isn’t fair to compare a quad 1.5GHz processor with a faking-two-cores 1.6GHz processor; however the fact that the Pi can easily best the PC suggests that, really, Linux ain’t so bad… If you think otherwise, maybe it’s just because the x86 family has always had p***-poor latency → the 80386 ISR latency described: http://datasheets.chipdb.org/Intel/x86/386/technote/2153.pdf Things are better with more modern processors, but I think it is generally pretty inefficient in this respect… There never been good video game under linux. See my previous rant pointing out that Windows is more or less compatible with Windows across all sorts of different processor types and PC builds. Linux, on the other hand, is spectacularly bad in this respect and software either needs to be supplied as source, or it needs to be prebuilt for half a dozen “popular” versions of Linux – such as here: http://www.skype.com/en/download-skype/skype-for-linux/ This sort of thing is surely an impediment…

Jan 31, 2015 10:21am h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 12:19pm Rick Murray (539) 13840 posts	The only thing the gpu accelerate on typical setup is the yuv decoding. I would differ on that opinion. Looking at the datasheet of my PVR (since modern GPUs tend to be NDA), I can see it deals with Huffman blocks in hardware, colourspace in hardware, realtime image rescaling in hardware – and this isn’t a proper GPU, it’s a DSP with some image-specific parts added. I think you might be surprised by how much hardware assistance can be provided these days – perhaps down to the level of the GPU software being capable of doing a lot of the decoding by itself. Hell, if modern GPUs can mine bitcoins and crack WPA, is it too much of a leap to assume that they could also do most of the grunt work of displaying a video? Any interupt preempt a program. That’s the whole point of an interupt. Ah. And here is the reason for the confusion. We’re talking about pre-emptive multitasking and interrupt pre-emption at the same time. When I refer to pre-emption, I’m referring to multitasking because, as you observe, there is not much point in using the word “pre-emption” with interrupts as they cannot sensibly work any other way. and it’s not a problem of cpu latency, the main cpu is much faster than anything else, it’s the problem of latency that the os introduce between recieving the interupt and executing the actual code to handle it. Which is possibly why hardware drivers should be special modules that can respond rapidly via the interrupt dispatch, as opposed to anything running as a multitasked user application. Last time i checked this whole signal stuff and investigated in depth, i dug up whole lot of mail from kernel dev, and basically this system is broken, should not be expect to work as you think it’s supposed to work in many condition, it’s a little bit there more to provide compatibility with bsd system, but it’s not really useable for any thing. The big question here is – what was the signal system devised for? I rather suspect it was in a day before GUIs were commonplace and system architecture was different to nowadays. So maybe expecting signal to work with X11 is about as hopeful as expecting BBC VDU calls to work under the Wimp (they can, put maybe not quite as a Beeb programmer would expect). The day i will see an x11 or icecast/apache working with signal instead of their pipe thing, then i will start to believe lol Isn’t that just the normal POSIX way? It is one of the ways you can spot a half-assed attempt at porting something to RISC OS or DOS. If you run it with no parameters, it sits there doing nothing because the default seems to be to assume input from stdin and the person porting didn’t bother to modify this to behave better on systems that don’t sensibly support such a thing as the POSIX pipe idea. For now i consider the signal system on linux more like decoration. Again, it perhaps works well for console tasks, maybe not so good for GUI ones. Let’s imagine a process that executes a long computational loop at low priority but needs to process incoming data as soon as possible. If this process is responding to new observations available from some sort of data acquisition peripheral, it would like to know immediately when new data is available. This application could be written to call poll regularly to check for data, but, for many situations, there is a better way. By enabling asynchronous notification, this application can receive a signal whenever data becomes available and need not concern itself with polling. RISC OS method: Write a module to receive data from the device at low latency and buffer it. When data is available, the module sets a “pollword”. The UI application can process slowly (say Wimp_PollIdle asking to be called once per second) and when data is available, the Wimp will call the application with the pollword event. The Wimp actually passes you the value of the pollword so you could embed information in it, like the low byte records how many bytes of data are available, or something like that. NEVER should the multitasked program attempt to directly interrogate the hardware. Not only is it bad form, but most OSs cannot provide guarantees as to how often the program will be called. Asynchronous I/O on linux or: Welcome to hell. ;-) Things like this are complicated. That’s why most programmers write fart apps these days… [look on any App Store for the sheer amount of dross] IF you want something fast for multi media and video game, windows is more suited. My observation was that a mobile ARM based device can wipe the floor with my PC. So either Linux isn’t so bad at media recording and playback, or the ARM is so far ahead of x86 in terms of “getting actual stuff done” that it is unreal. Take your pick. You can’t keep 10 000 decompressed frames in memory, same for audio, so everything need to be decoded just on time, No, you don’t need to decode 10,000 frames (indeed decoding that lot would take a while) but you do need large enough buffers to smooth things out and make sure everything “just works”. How and what depends a lot on the type of system and its capabilities. think you’re going to handle that with either only synchronous function without preemption only with the timer interrupt ? You DO realise, I hope, that most pre-emptive multitaskers use a timer interrupt to call the schedular to switch tasks. Okay, some might be more capable and respond to other things too; but time sliced multitasking is… well, there’s a clue in the name. As for your example – games have been written for RISC OS, I’m quite partial to Cannon Fodder and SF3000 is pretty fluid on modern ARMs. So either we’ve all reinvented threads over and over, or there are other ways that work. ;-) Okay, time to go out and pretend its an interesting weekend…

Jan 31, 2015 12:21pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 12:26pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 12:28pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 12:33pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 12:38pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 12:47pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 1:08pm h0bby1 (2567) 480 posts	aaaaa

Jan 31, 2015 2:54pm h0bby1 (2567) 480 posts	aaaaa

Feb 2, 2015 1:47am Chris Mahoney (1684) 2165 posts	I see that there’s a new Raspberry Pi 2… and it’s quad core :)

Feb 2, 2015 8:06am andym (447) 473 posts	I see that there’s a new Raspberry Pi 2… and it’s quad core :) Just bought one from RS (available here now!) and am wondering if I’ve wasted £27 that I could have instead spent on RISC OS hardware? Anyone from ROOL shed any light on whether RISC OS will run?

Feb 2, 2015 8:11am Rick Murray (539) 13840 posts	If it doesn’t today, it probably will by the time you receive it. Won’t do quad core mind you :-) though I think the barrier to support for and playing with multiple cores has just dropped through the floor. More or less the same price as the B+? How is that even possible?

Feb 2, 2015 9:55am Tony Noble (1579) 62 posts	According to the Pi folks, it’s fully binary compatible, so it should work. Of course, should doesn’t always translate to does…

Feb 2, 2015 10:01am Neil Young (2592) 2 posts	“More or less the same price as the B+? How is that even possible?” Raspi’s use a very cheap off the shelf SoC, the new one is probably nearly identical in price. Yay for economies of scale!

Feb 2, 2015 10:23am Wouter Rademaker (458) 197 posts	The arm-core is just a small part of the SoC. It sounds like that the (small?) redesign needed is done by Raspi foundation-members.

Feb 17, 2015 12:16am Anthony Vaughan Bartram (2454) 458 posts	Is there a branch in SVN/CVS where any tasks or elements relating to re-factoring etc for multi-core is being prototyped?. I’m a newbie to Risc OS (so am trying to avoid saying something way off target or naive), however I’ve worked on re-factoring personally & professionally over the years + I have written various threaded systems… I’ve been reading David Feugey’s posts on AMP and it sounds like an interesting way forward. Having recently got a PI 2 (which is gathering a little dust whilst I try and get a computer game released), I am tempted to try and contribute to this effort a little. 1) I see this problem as a re-factoring problem. In order to achieve large scale re-factoring in the past, I have applied the adaptor & facade design patterns successfully. => The objective being to avoid significant changes to code & to present interfaces & functionality in an unchanged way as much as possible. => As, in my experience, attempts to over re-factor and re-purpose code is highly expensive and often fails due to poor adoption where existing user & end system contracts are broken and projects take too long to deliver (and then get cancelled.. :-( ). 2) Would rapid application development help? i.e. 2.1) Deliver small useful additions that can underpin multiple possible solutions e.g. a. A semaphore b. a FIFO queue / shared memory block / ring buffer 2.2) Prioritise and break down development tasks into small and simple sub-tasks. 2.3) Deliver incremental change whilst ensuring the system still functions e.g. offering limited PMT or cooperative asymmetric multi-core processing via limited communication between cores. I could for example, try and write something using LDREX and STREX (or other perhaps other opcodes depending which RiscOS CPU target was being built i.e. SWP for legacy)- which I think are intended for semaphore/lock construction. I was going to try this, to dip my toe in to ARM coding… I have wondered whether cooperative multi-tasking and multiple cores are not mutually exclusive i.e. the scheduling method, whilst ideally pre-emptive, would still work on multiple cores if they simply cooperated. Threading SWIs could be built using the same paradigm i.e. child cooperatively scheduled light weight tasks. I wonder whether, resources / operating system features that suffer contention in a multi-core/multi-process environment, could be single threaded and service a job request FIFO queue that is polled cooperatively. Rather than, requiring to be isolated with a new interface put on it i.e. instead of changing interfaces, create an adaptor which the existing code calls. Thus, preventing that code being changed or made aware that the operating system feature or resource is no longer local to the core it is running on. Could an SWI’s parameters be queued (via a ring buffer) & re-directed via a different SWI handler on one core to a master handler on the main core? Perhaps responses could be returned and managed via a ring buffer to avoid serialisation between cores and to improve asynchronous execution of code.