Thinking ahead: Supporting multicore CPUs

636 posts, 79 voices

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 ... 26

Oct 30, 2012 10:54am Jeffrey Lee (213) 6048 posts	Except the microkernel approach is much less of an ugly hack :) From a program’s perspective they’d be pretty much the same (in the beginning, at least), but from an OS developer’s perspective the Hydra approach doesn’t aim to solve any of the fundamental problems that the OS is facing. And (initially, at least) I don’t think the microkernel approach is guaranteed to be significantly more complex than the Hydra approach. After all, with the Hydra approach you’ll still need some kind of microkernel to manage the extra cores – the only difference between the two is that with the Hydra approach the microkernel will be a child of RISC OS, and with the microkernel approach RISC OS will be a child of the microkernel.

Oct 30, 2012 11:12am nemo (145) 2552 posts	I wrote: What problem are we actually trying to solve? And got two silly answers, so it was obviously a silly question. :-) What I meant was “do we really want multiple applications responding to Poll returns simultaneously, or do we want a parallel form of ‘background’ processing?”. Eric suggested Make new software preemptive You can’t make Wimp applications pre-emptive in general. Wimp2 sort of proved that (in that it was a fun exercise but not production quality and never could be). I’ve mentioned the Wimp protocols before but it’s really important… exactly how is the Clipboard Protocol going to work if the application holding the clipboard data is busy when you press ^V? How is the Filer supposed to handle a double-click if applications are printing, or recalculating, or connecting to a server? No. In short, the UI – the Wimp and applications using it – must be (or often must act as) a single thread. Malcolm worried: I keep on seeing the word “thread”. Yes, it’s not only the prevailing model it’s also a sensible generalisation – threads are applicable to single-core machines too. They are the programmer’s model – the programmer doesn’t want to have to write different code for multi-core processors. Ported code will be using threads of some kind, probably pthreads. I’ve had a lot of experience not only of writing and debugging threaded code, but also of converting a huge, monolithic and single-threaded application to be multithreaded to take advantage of multiple cores. Multithreading isn’t hard unless you make it so. Rick wrote: I think that a SWI call “Wimp_WillBlock” should be added, to alert the Wimp when a task will knowingly block the system. A prime candidate here is the !Printers stuff. I’m not sure what you meant by “!Printers stuff” as I’ve never been aware of !Printers tying up the machine. Perhaps you meant printing (which doesn’t involve !Printers). It has long been possible to print cooperatively, but it’s fiddly and the granularity is much too coarse – it’s the usual scalability problem: The printer drivers do the same thing with a 24MB sprite that they’d do with a 24KB one, and that’s no help to the application. Incidentally that problem also affected Wimp2 – all the pre-emption in the world didn’t help when you double-clicked on a 24MB sprite and had to wait for one very long OS_File,255 to complete. As an experiment I wrote a sprite editor which loaded and saved large files using GBPB under null polls (updating its window like a browser does). That was much more responsive, but requires cheating to get the desktop save protocol to work (which otherwise wouldn’t). Hence my concern that people might believe that pre-empting applications is trivial… it is, but you’ll break the Wimp. Jeffrey said several of the tasks that need doing may well end up as stepping stones towards making the OS multi-thread safe (memory protection, tighter process management, a common threading library the network/USB/FS stacks can use, etc.) Making the OS thread-safe can be trivial – mutex the SWI interface from USR mode. That’s mostly it. There are a few calls that would need safe versions (just as there were a few that needed 32bit versions, and there will be many FS calls that will need 64bit versions), but probably not many. Even that can be avoided if one takes the Hydra approach, but that is a lot less useful, and wouldn’t in any way mitigate the ‘large file loading’ and ‘slow printing’ problems. They are ideal cases for multithreading, but that cannot be done independently of the Wimp because of the protocols – if you double-click a file in the Filer while an application is printing, the only sensible result is that the Filer should show an hourglass until the printing app Polls again. The Wimp’s protocols are co-operative, and cannot be pre-empted without breaking (or changing) them. Like it or not, this affects anything beyond the lowly ‘number crunching coprocessors’ aspiration.

Oct 30, 2012 11:51am Colin (478) 2433 posts	I don’t have any experience of programming PMT systems or its terminology for that matter but from the wimp programmers perspective it seems to me that we have 2 problems 1) Device blocking 2) User hogging cpu time So, at the moment, for: 1 – Programmer either waits or yields and continues on next null event if the device is non blocking 2 – Programmer yields when he feels appropriate and continues on next null event To this end the programmer either chops up his program so that it can continue on null events or devises a simple threading system where the program is paused on a yield instruction and continued on a null event. But I don’t want to be responsible for when I yield as I can’t know how many other tasks require attention so don’t know if I cant take a big time slice or not. Currently Taskwindows happily multitask solving (2) – may not be perfect but it works. Couldn’t a similar system be used inside a task where swi multitask_start, swi multitask_end would be used to bracket a section of code that would be pre-empted when the task would have recieved a null event. As the number of tasks requesting null events deminishes the efficiency of the pmt improves. So basically your program is in a CMT environment with PMT subtasks. Just a thought – and taking an interest.

Oct 30, 2012 12:58pm Andrew Daniel (376) 76 posts	Nemo I’m curious as to your thoughts on utilising the extra ARM cores in modern cpus? Also so how fast would a 26 bit ARM emulator written in thumb2 run on one of those Cortex M3 cores?

Oct 30, 2012 12:58pm Rick Murray (539) 13850 posts	Currently Taskwindows happily multitask solving (2) – may not be perfect but it works. Couldn’t a similar system be used inside a task where swi multitask_start, swi multitask_end would be used to bracket a section of code that would be pre-empted when the task would have recieved a null event. In a word, no. Sorry. The thing is, taskwindows work with singletasking linear applications where 99% of the time it doesn’t matter if or where you pause them. Now consider an application. It is busy-waiting on data from a slow GPS dongle, or somesuch. While the user is waiting, they click Menu over your iconbar icon. Or look in your window for a “Cancel” icon. Or…

Oct 30, 2012 2:56pm Martin Bazley (331) 379 posts	Paul Fellows recently related the story of how Arthur was originally developed to ROUGOL, and it was very much like a more literal version of the ‘co-processor’ technique mentioned above – as in, the ARM evaluation system was plugged into the Tube, the BBC powered it, and over time responsibility for more and more system functions was migrated to ARM code with the 6502 becoming little more than a bootstrap. Before anyone objects to the microkernel approach on the grounds that ‘it isn’t RISC OS’ to have the present kernel be the ‘child’ of some other entity, remember that exactly that has basically already happened with the addition of the HAL.

Oct 30, 2012 2:56pm Colin (478) 2433 posts	If you are waiting for a slow device the main poll loop of your program continues as it does now and you deal with inputs like you do now, you just get time slices for the pmt parts of your program in place of null events. e.g instead of mouse click → if not already reading device start reading device null event → continue reading device until a null event results in a device read, callback thread which initiated read you do mouse_click → if not already reading device multitask_start read_device multitask_stop the multitasking would only happen when the system is idle ie at a null event so it doesn’t affect any other task or at least it is no different to what we have now is it?

Oct 30, 2012 3:12pm nemo (145) 2552 posts	Nemo I’m curious as to your thoughts on utilising the extra ARM cores in modern cpus? Well, modulo over-enthusiasm about Making The Wimp PMT!!! I’m broadly in agreement with all the other sensible contributors here – there’s a number of choices which, as they get more attractive for the user get much harder for the OS developer. We need threading, by which I mean RISC OS threading, not DeskLib (or whatever) specific threading. A broad correlation with pthreads would be extremely helpful Said threading to take advantage of multiple cores, perhaps with a thread pool layer for performance (spawning threads is too easy, and produces lots of context switches) Beyond that depends on the steel and resolve of contributors, it gets a lot harder after this The thing is, cleverly written co-operative applications will appear much more responsive than poorly written pre-empted ones (compare RISC OS with any Windows for proof)… but a poorly written co-operative application can be far less responsive than a pre-empted one (ie the vast majority of non-trivial RISC OS programs – until fairly recently the clock stopped when dragging a window with panes for heaven’s sake!). Also so how fast would a 26 bit ARM emulator written in thumb2 run on one of those Cortex M3 cores? Ah, that’s a much easier question. I have absolutely no idea.

Oct 30, 2012 3:21pm Jeffrey Lee (213) 6048 posts	Also so how fast would a 26 bit ARM emulator written in thumb2 run on one of those Cortex M3 cores? AIUI the Cortex M cores are designed purely for low performance tasks, e.g. keeping your phones’ OS and comms hardware ticking over while the phone is in standby. So in short: A hell of a lot slower than the main CPU would be. Paul Fellows recently related the story of how Arthur was originally developed to ROUGOL, and it was very much like a more literal version of the ‘co-processor’ technique mentioned above – as in, the ARM evaluation system was plugged into the Tube, the BBC powered it, and over time responsibility for more and more system functions was migrated to ARM code with the 6502 becoming little more than a bootstrap. Which is a perfectly sensible way of developing the OS/hardware, and not really any different to how any games console manufacturer, mobile phone manufacturer, etc. would develop the hardware & OS for their latest devices, except instead of using the Tube they’d be using JTAG (and probably a few other assorted interfaces). Unfortunately this anecdote doesn’t help us much with making RISC OS multi-core friendly :) (Except perhaps as a reminder that JTAG is invaluable for many low-level debugging tasks)

Oct 30, 2012 3:26pm nemo (145) 2552 posts	Now consider an application. It is busy-waiting on data from a slow GPS dongle, or somesuch. While the user is waiting, they click Menu over your iconbar icon. Or look in your window for a “Cancel” icon. Or… Actually that’s not really a problem at all – the user can accept that that program is busy – the message will queue up and the menu appear in a mo. No, far worse than that is the Filer problem – you’re interacting with a Filer window, and it’s all perfectly responsive. You’re not paying attention to whether that other program has finished recalculating or printing or whatever. You open the Filer menu, it works. You select a file, it works. You double-click a file… and nothing happens, so you double-click again… still nothing happens. You give up and drag it to a program. Some time later the application that was printing opens two copies of the file. Other scenarios: You press ^C in one program, switch to another and press ^V and nothing happens. You save a file into another program and nothing seems to happen, so you drag it to a Filer window instead. Seconds later your program crashes. There are many, many more. Even ^F12 could be broken by pre-emption. If you want to multithread the desktop (the UI – all the applications, basically) then the Wimp needs to be fully involved so that it can synchronise all the (wimp) threads at Wimp_Poll to allow certain Message protocols to work. Otherwise the desktop is completely broken. Quite how much synchronisation is necessary I’m not sure, but there are other difficulties including redrawing (which Wimp2 skirted somewhat) – if you pre-empt a task while it is redrawing you then allow other tasks to move their windows – this can then invalidate things the redrawing task knows, such as exactly which pixels to touch. That can only be fixed (in a multithreaded redrawing sense) by having tasks redraw to their own surfaces and not directly to the screen. That of course has a performance and responsiveness impact which is precisely what one is trying to avoid! Hence my previous point: The Wimp must be, or often appear to be, a single thread.

Oct 30, 2012 4:35pm Eric Rucker (325) 232 posts	Actually, having tasks redraw to their own surfaces only lets you do interesting things regarding performance. First, the amount of redrawing that tasks have to do is significantly reduced. Second, if those surfaces are OpenGL surfaces, you can use the GPU to accelerate all UI operations. (This is what most modern operating systems do. Yes, RISC OS has a fast UI, but it can be made even faster that way. Yes, most modern operating systems have slow UIs, that’s because they’re using the GPU to display more shiny crap, and when you’re using integrated graphics, the performance sucks.) But, I believe PMT OSes that do use redrawing to display things properly just… tell the program to redraw again. Also, there is the whole Mac OS 8.6 thing again, where a program would keep all UI stuff in the WIMP, but spin things off to the threading system (which would be able to preempt the WIMP itself).

Oct 30, 2012 6:47pm Rick Murray (539) 13850 posts	[quotes from all over the place – if you see yours, wave and say “hi!”] And got two silly answers, so it was obviously a silly question. :-) Gee, thanks. :-) Perhaps you meant printing (which doesn’t involve !Printers). Printing doesn’t involve !Printers? While they (printing and !Printers) are different things, they are tied up together, a sort of symbiotic relationship. Either way, printing does hairy things (try reporting an error without calling the AbortJob SWI and watch it all fall apart) which I would imagine would need serious reworking in a PMT environment. How is the Filer supposed to handle a double-click if applications are printing [etc] Doesn’t that depend upon the Wimp? Surely a message is passed around to see if a task can handle the request. At the moment, I would imagine the message would be queued. However, if the Wimp is upgraded to understand blocking software, it probably shouldn’t bother trying to give a message it expects a reply from to a program that isn’t actively polling. Thus, the filer will get a NAK and it will start a new instance of the application… … which will either load (like you can have several copies of the same program around) or it should notice an earlier instance of itself and fire off a WimpMessage (asking the Wimp to queue it) pointing to the file to load. The new instance will then report something like “Zyzzy is currently busy. Your file will be loaded in just a few moments…” (to save the user from repeatedly trying to load the same file). New instance quits, and the previous one, when ready, will get the load message and load the file. Remember that the Cortex-M CPUs only support the Thumb instruction set, […] To get RISC OS to run on them, there’s a hell of a lot of assembler code in the ROM that would need rewriting, Whoa! I’m not talking about porting RISC OS to the Cortex-M; I’m just thinking that it might be a possibility to task off some of the boring repetitive stuff to them – like the mechanism for the centisecond ticker and/or syncing the system clock? Stuff that happens routinely in the background, doesn’t require much oomph, and could be left to get on with it. I’ve had a lot of experience not only of writing and debugging threaded code, but also of converting a huge, monolithic and single-threaded application to be multithreaded to take advantage of multiple cores. Multithreading isn’t hard unless you make it so. Ah, but how much work did you do vs how much the operating system assisted? We aren’t talking about porting an application, we’re talking about an operating system. Slightly different ballpark. I could probably write a new OS from the ground up better than anything around now. <cough!> I couldn’t. I’d like to be able to, but I at least have an idea of my capabilities (except that one where I marry a cute Japanese girl and live happily ever after with love bubbles and sparkles and crap), and writing a new OS “better than anything around now” isn’t one of them! (^_^) Before anyone objects to the microkernel approach on the grounds that ‘it isn’t RISC OS’ I don’t think many would object to that. I think we’d object to the “look and feel” not being RISC OS. In short, the API would change dramatically. How much of what arrives at the end will retain the spirit of RISC OS? You want a detailed description of a multitasking thread-capable OS? I can give it to you with bells on top. It’s called Minix, not only are sources available, but there’s a very very detailed book describing every aspect of how the system works. [ there are CHM and PDF versions floating around if you are so inclined; though I prefer paper… ] Now, if I was a little more clever and a little less lazy, it probably wouldn’t be overly hard to get the basics of that running on the ARM and build a sort of RISC OS layer on top. The thing is, I fear that the more the RISC OS layer would be created, the more I’d need to either bugger up existing APIs, or try to devise new concepts…until I reach the point where I realise I’ve long forgotten RISC OS and have just written YetAnotherDamnUnixClone. In short, the question isn’t “how simple is it to make pthreads” (why “p” prefix? this some sort of reverse polish notation thing?); but rather “how can we make RISC OS support multithreaded activity without completely buggering up what RISC OS is and how it works?” That said, the topic title talks about multiple core CPUs; which is not necessarily the same thing as multiple threads. ;-)

Oct 30, 2012 7:56pm Eric Rucker (325) 232 posts	The thing is, you can’t rely on the Cortex-M cores being there on any SoC, so relying on them is IMO unwise except for implementing VERY platform-specific stuff. And, like I said, look at Mac OS 8.6. APIs stayed the same (even the Multiprocessing Services ones, except the thread-safe APIs were extended, I believe), but GUI stuff still had to run as a task within the Mac OS process, and the APIs didn’t change at all there. And, if anything, Mac OS was in worse shape than RISC OS – at least RISC OS 2.0 broke backwards compatibility with single-tasking GUI stuff when it came out, whereas Mac OS was trying to graft full multitasking onto a system where, while a cooperative model was in place, it was only designed to be used to allow very limited applications to multitask alongside full applications.

Oct 30, 2012 8:35pm Malcolm Hussain-Gambles (1596) 811 posts	Rick: Sorry, think my point was missed! (Maybe that was my fault, just got over a bad cold) I wasn’t saying could actually do it, heck I’m just grasping the basics of RISC OS again! More if I had infinite time, after the 1000000’th (or more) re-write and didn’t find something else that distracted me over the billion years I’d probably need. The point was to reflect on: 1) Amount of resources required 2) Time to getting it in RISC OS How much man-time is required to complete each suggested idea, and a possible likelyhood of said idea based on willingness and avaliablility. Especially given the current speed of essential improvements in the more complex areas of RISC OS.

Oct 30, 2012 8:46pm Eric Rucker (325) 232 posts	There’s a third thing to consider, though: 3) Time saved in implementing future improvements to RISC OS once it’s done

Oct 30, 2012 8:59pm Rick Murray (539) 13850 posts	at least RISC OS 2.0 broke backwards compatibility with single-tasking GUI stuff when it came out What, you mean the three programs that actually used the Arthur GUI? (^_^) That said, I think “single-tasking GUI” makes about as much sense as “paid volunteer”. Slight aside: Back in those days, ARMBE (a single-tasking program) was the hot thing. Although I note that it is in Library of the RPi installation I have – so either it’s there for a joke or somebody actually still uses ARMBE!

Oct 30, 2012 9:05pm Malcolm Hussain-Gambles (1596) 811 posts	Eric: That’s probably the most important point as well!

Oct 30, 2012 9:19pm Rick Murray (539) 13850 posts	Malcolm: You too with the cold huh? I’m recovering from my second in as many months. Pffft! Is it valid to keep comparing against MacOS without knowing the ins and outs? Windows did the same sort of thing in the transition from Windows 3.1x to NT/95+ (in addition to some 16/32bit thunking that makes the brain ache). I am led to believe that Mac’s way of doing it is to run all of the co-operative programs as a single pre-empted thread. An interesting idea, certainly, though I’d hope this thread would have more priority/time the more tasks it is running. Anyway – in answer to your questions, I think a major determining factor is: a) Do we modify RISC OS to run in a multiprocessor/multithread way? or: b) Do we write something completely new using parts of the RISC OS paradigm (OS as a set of modules, etc) and then drop in a compatibility layer later? Of course, you do realise, I hope, that in any case, we’d need to build the OS in mostly pure assembler? There’s a reason RISC OS is blindingly nippy… so take any estimate you had in mind and double it. Twice. And once again for good measure.

Oct 30, 2012 9:22pm Rick Murray (539) 13850 posts	Of course, there’s always fibers (sic, or “fibre” for us Brits).

Oct 30, 2012 9:57pm Jeffrey Lee (213) 6048 posts	Of course, you do realise, I hope, that in any case, we’d need to build the OS in mostly pure assembler? There’s a reason RISC OS is blindingly nippy There’s also a reason why it’s such a PITA to maintain. Plus the OS being written in “mostly pure assembler” isn’t true anymore. Taking a typical OMAP3 ROM and totalling up the sizes of all the modules I can see that the breakdown is 1313KB assembler, 1608KB C, and 57KB BASIC. I could sit here and rant for ages about how writing most of the OS in assembler isn’t the right way to go, but I’d sincerely hope that you’re smart enough to realise that fact for yourself.

Oct 30, 2012 10:34pm Eric Rucker (325) 232 posts	Rick: That would be correct, that cooperative programs are run as a single pre-emptable thread. However, it’s the only thread that has access to unsafe APIs, much like the WIMP in a theoretical “RISC OS 5.6”, if you will, would. And, my understanding is that any user program must start as a cooperative stub, then call Multiprocessing Services and spin off any threads it wants to run (and those threads can call back into the cooperative stub as needed to access APIs that aren’t thread-safe). Also, a scheduler could be designed around the knowledge that there are multiple individual user-facing tasks running in that cooperative thread. If you wanted, the scheduler could even replace the WIMP scheduler in theory.

Oct 30, 2012 11:10pm GavinWraith (26) 1563 posts	Could some of you knowledgeable ladies and gents give a few pointers about where to find information about multicore systems to someone like myself who has had no experience of them. Excuse these, no doubt naive, questions: 1) how do the cores communicate? 2) are cores provided with private RAM that other cores can read from but to which they cannot write? 3) is one core the sole owner of IO? 4) can one core trigger an interrupt in another (maybe part of question 1)? Excuse me interrupting the thread with this exposure of ignorance. With your help I will go off and do some reading and perhaps much later I will rejoin you.

Oct 30, 2012 11:28pm Jeffrey Lee (213) 6048 posts	Could some of you knowledgeable ladies and gents give a few pointers about where to find information about multicore systems to someone like myself who has had no experience of them. I’m sure there are many good textbooks on the subject, and there must be a few decent resources on the ‘net, but I can’t think of anything offhand. 1) how do the cores communicate? A mixture of shared memory, hardware FIFOs/mailboxes, and interrupts (often tied to the FIFOs/mailboxes). Depending on how shared memory is mapped, various types of safeguards may be needed to make sure it’s accessed in a safe manner by all the different cores. Of course not all multi core/multi processor systems are the same, but we’re lucky in that modern ones try to make sharing memory as safe and easy as possible. 2) are cores provided with private RAM that other cores can read from but to which they cannot write? Each core has an independent set of page table pointers, so the cores are free to do whatever they want with memory (all private, all shared, partially shared, etc.) 3) is one core the sole owner of IO? The interrupt controller allows individual IRQs to be routed to individual cores. It’s also possible for the same interrupt to be routed to multiple processors – e.g. the timer used for thread scheduling would be a good choice for this. For other IRQs the OS should probably take into account the processing cost of handling each IRQ and distribute them appropriately between cores. 4) can one core trigger an interrupt in another (maybe part of question 1)? Yes.

Oct 30, 2012 11:52pm Rick Murray (539) 13850 posts	There’s also a reason why it’s such a PITA to maintain. Oh, I don’t deny that. Plus the OS being written in “mostly pure assembler” isn’t true anymore. Funny, most of the core I’ve been poking around recently (amusement, more than anything else, I’m a sick sick person…) has been assembler. I know there’s a big wodge of C, not just the built-in apps but I’d guess the networking/sockets stuff as well. but I’d sincerely hope that you’re smart enough to realise that fact for yourself. (^_^) There must be, I guess, another reason why a mobile phone twice as powerful as a Pi takes eight times longer to reach a useable state… Or why my Linux-based PVR boots in three minutes (but the custom microkernel¹ PVR on an identical SoC boots in 27 seconds). I’ll stop here as this could easily turn into a “why Linux sucks on small devices” rant of my own. ¹ From the little I could figure out of the semi-scrambled firmware update file. I ought to hook the JTAG to the DM320 and just copy out the raw firmware for examination.

Oct 30, 2012 11:57pm Eric Rucker (325) 232 posts	And, those devices aren’t booting slowly because of C, they’re booting slowly because their kernels are doing a LOT. (And, the problem with hand-optimized assembler is, next CPU generation, your optimization is now crap. Not as big of a problem on an architecture like ARM, where micro-ops are almost never used (and before ARMv7, never used), but still…)