Thinking ahead: Supporting multicore CPUs
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 26
Jess Hampshire (158) 865 posts |
A version of Linux that looks and acts like RISC OS and runs Linux programs would be good enough, if it runs some RISC OS stuff too then that would be icing on the cake. |
Malcolm Hussain-Gambles (1596) 811 posts |
From my perspective the main difference between linux desktops (apart from massive bloat,sluggishness and instability of applications [YES EVOLUTION I MEAN YOU!]) is the file management. Just my 1.2 pence |
Simon Willcocks (1499) 513 posts |
Thanks for the plug, Bryan! I haven’t had a chance to read through the thread yet, but I’ll try to answer a few of the questions about ROLF.
Almost certainly not, yet.
It will allow you to write programs that make use of the multiple cores and multi-threading right now, you just have to write them using Linux libraries and keep the GUI interfacing to a single thread. It should be that redraws from multiple processes can progress simultaneously at the moment, the window positions are locked until the last rectangle is redrawn.
Not since 2007, and that has disappeared off the internet. I won’t have much time until next week to look at it, but I’ll try to come up with something asap. Maybe with a RO5 rom image for the modules. |
Chris Evans (457) 1614 posts |
Sorry if I’m being thick but does that mean that if you follow the above you could write something that would run on a standard RISC OS computer e.g. Iyonix/Risc PC and that it would also work under ROLF. And is “Multi Process simultaneously” meaning multi processes on the same core or Multicores |
Simon Willcocks (1499) 513 posts |
No, you would be writing a ROLF program, not a RISC OS one. That said, if you wrote a multi-threaded RISC OS program that used a Unix standard threading model, recompiling it for ROLF would probably give you multi-core use. |
Timothy Baldwin (184) 242 posts |
On second thoughts my suggestion was overkill. However simply locking around SWI calls is not sufficient:
|
Andrew Hodgkinson (6) 465 posts |
FWIW, Mac OS X puts the onus on the programmer to ensure they only do GUI updates in the main thread. There is robust API support to help other threads very easily “ask” the main thread to perform actions for background activities which need to update the UI. This works well provided the programmer avoids heavy blocking operations on the main thread, as that’d prevent GUI updates from being processed. Instead, the UI thread only does lightweight stuff and worker threads should be used for any heavyweight processing. There is a huge collection of very low to very high level mechanisms for doing different kinds of thread processing with this model in mind. Coupled with a preemptive scheduling model between processes, the worst that might happen is that an application’s UI becomes unresponsive but the rest of the system does not. Note the two problems being solved here, which it is important to keep distinct:
Personally I’d approach this problem from the requirements first, then the API, then the implementation and iterate if necessary. First work out what you want application programmers and the system to be able to do. Then work out how you’d like to present those facilities as an API. Finally, consider the implementation; it may be that certain requirements and/or APIs are not technically feasible, so you go around the loop again adjusting things. (As far as APIs and implementations go, since Grand Central Dispatch is open source and integrated into BSD these days, it would be an obvious place to start for inspiration when it comes to the low level side of multithreaded programming). Jumping from zero to 100% of all possible features isn’t necessary. Consider a roadmap. What things can be written in a self-contained way, that gradually build and work together to present the end goal functionality? That way, you have a series of testable, releasable changes which provide useful functional extensions to the OS at each step. Right now I think we’re in danger of being mired in the minutiae while missing the big picture. |
Rick Murray (539) 13840 posts |
I think something to consider, unfortunately, is “what does Linux do?”. While it would be great to think of the possibilities that could be available – this perhaps needs to be tempered with some consideration for porting stuff from afar – is it to our benefit if our future sexy multi processor implementation is as difficult to port multi threading code as it is now? That said, I like the simplicity and directness of the RISC OS API, I hope for this to continue… |
Colin (478) 2433 posts |
Forget about linux its licence makes it a non starter. |
Rob Kendrick (86) 50 posts |
Colin: Would you like to expand on that statement a bit, or admit you’re wrong? |
Colin (478) 2433 posts |
If you can get code source used on linux with a licence compatible with castles then yes you can use linux code to go in the RISC OS rom but generally it’s a non starter. If you think I’m wrong then I’m wrong. |
nemo (145) 2546 posts |
I’m most familiar with the POSIX threads model, so of course I would suggest that supporting that API directly would be highly advantageous. However, pthreads doesn’t make multithreading any easier to do correctly, so it may be better to adopt the kind of parallelisation model that Javascript and OpenCL use – and that’s quite different from the expectation that every thread/core sees a complete and somehow autonomous RISC OS. Add to that off-screen rendering (for legacy apps, probably by intercepting where the screen seems to be) and you may be some way towards delivering the grunt of multiple processors while maintaining the existing GUI and Wimp idioms. |
Simon Willcocks (1499) 513 posts |
If you ran RISC OS code on top of the Linux kernel, it would just be running as processes (and, possibly, drivers) in Linux, and it’s fine to run closed-source code on Linux. The changes to RISC OS code to get it to run on Linux might possibly be against Castle’s licence, but I don’t see how the problem could be with the GPL. |
Rick Murray (539) 13840 posts |
??? I think you completely grabbed the wrong end of the stick. I was emphatically not suggesting that Linux code be brought into RISC OS – I respect what the GPL has done, but I really hate the GPL itself (it is compatible only with itself and v3 is a joke – I wrote a long blog post on how the supposed freedoms are an illusion, so refer to that if you’re interested). What I was referring to was to consider “how Linux does it” so if we make a multithreaded/multicore arrangement, it might make it easier for people to port software written for Linux. Not port bits of Linux, not run RISC OS under Linux, but just simply porting stuff. Perhaps _nemo_’s suggestion is one to follow up? I don’t know enough about JavaScript (which I’ve mostly tried to avoid thanks to a history of slightly not-quite-the-same implementations) and I know zip about OpenCL. One could ask if it would be possible to support such a thing from within the Wimp too? I mean, the operating system totally lies to you already (no, we aren’t all running a &8000!), so while mucking around with where a task is in the memory map, it could also be assigned a core? |
Andrew Hodgkinson (6) 465 posts |
While having a Linux compatibility layer (pthreads etc.) on top of whatever solution gets implemented will be great for porting, it’s not a good idea in today’s world as your primary multithreading/multiprocessor model. It’s very low level and really, just not very good; extremely hard to use correctly, extremely easy to make mistakes. Beyond threadsThe problem here is being mired in the idea of threads. That’s like programming via front panel flip switches and punch cards when it’s 2013 and we’re meant to be talking about high level languages. You’re a programmer; you have work to do; you want to think in terms of those tasks, those units of work, not in terms of the underlying hardware cores you’re going to try and use to implement that. Let the operating system choose whether to put your work units on one thread, or many threads; on one CPU, on many CPUs; on homogenous or heterogenous CPUs; or even distributed across multiple machines. The programmer shouldn’t need to care, at least for the general use case. Grand Central Dispatch achieves this even at a low level with a ‘C’ API that makes it crazy easy in the simple examples to replace e.g. a “for” loop that does some heavy lifting into a “for” loop that does heavy lifting which the operating system assigns across available processing resources transparently. It’s not your job as a programmer to faff about worrying over spawning threads (and how many to spawn), what the thread API is, how you manage synchronisation and getting results back from them, what happens if a thread should get stuck/crash, what happens if your application is closed down during processing and so-on. It’s necessary to have this stuff available but really it must be possible to get parallel processing done without it. IMHO it would be insanity to design bottom-up a system that intentionally enforced all that kind of baggage on the programmer, rather than merely making the baggage available for those who wanted/needed it, but provided a far better, more robust, more managed approach for the lion’s share of programming. Simple exampleConcurrency isn’t something you should put much effort into. If you can iterate over an array linearly: for (size_t i = 0; i < array_length; i++) { do_complicated_things(array[i]); } …then it should be easy to make that operation run in parallel – remember, you’re looking at one of the lowest level APIs here: queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); dispatch_apply(array_length, queue, ^(size_t i) { do_complicated_things(array[i]); }); …the “ Now of course this is a really simplistic example and for “real work” things start getting more complicated pretty quickly. But surely, you want the easiest possible API to start with, so that your brain can deal with the thorny unavoidable difficult problems, rather than getting overloaded just dealing with naff APIs and pointless housekeeping. Essential readingThe following is a must-read document for anyone considering designing or implementing concurrency related APIs in the 21st century ;-) You can find more useful examples at the URL below, but it has Objective C syntax all over it which some people might find confusing: Understanding Objective C method calls in 10 seconds“ “ [ Edit: Before anyone Who Knows complains :) the above is entirely illustrative in terms of understanding the syntax’s intent at a high level. In fact Objective C messaging is just that – messages, not function calls. In terms of what actually happens when the message is sent, it’s very like RISC OS SWI calls. You send a message (call a SWI) which the run-time dispatcher (RISC OS SWI dispatcher) sends to the target object (is sent to the intended module via SWI chunk). The lack of “null” checking in some of the example code is misleading; the run-time includes a very sensible condition that any message sent to “null” returns the result “null” immediately. It means you can write a whole chain of code that might perform complex allocations, any one of which fail, but only have to check for “null” right at the end as any intermediate “null” results will just fall through cleanly. ] |
nemo (145) 2546 posts |
Indeed, unless of course the prospective author is most familiar with that model. I’m not suggesting that we implement pthreads, ThreadX, GCD compatibility layers on top of the actual implementation in the vain hope of getting somebody, somewhere, to write something… but I think pthreads is a no-brainer for a number of reasons (as one API, but not the only one).
Let us assume that we are discussing the subject from the point of view of implementation. Anyone who does not understand “threads”, an API and model like pthreads or has no experience of the practicalities of multicore work is unlikely to be able to contribute much to the implementation discussion. So “mired” is a little strong. The subject is necessarily complicated, which is why I think it should be wrapped up in a very simple model for the programmer. In reality, spawning hundreds of threads is absolutely not the way to achieve high performance, especially on OSes with high context switch granularity such as Windows. Hence any implementation is necessarily going to be a form of thread pool system whose dimensions reflect the number of cores available and their context switching capability, amongst other things. Which is what GCD uses, happily. So having an abstraction of “work to do in parallel” (and of course, serial) is very valuable, which is what the models I cited provide. One of the reasons I mention Web Workers is that not only do they also work by message passing (or “events”) they also have the important distinction between the main program (that enjoys the whole familiar environment) and the Worker (or “block” in GCD parlance) that does not. I think that distinction is crucial from a pragmatic viewpoint for RISC OS. Andrew’s example of automagically splitting a loop into multiple loops on multiple processors doesn’t generalise to the higher level actions (the words “processes” and “tasks” are both misleading here) that make up large applications, and in particular I think we may be more concerned about whether we can still move windows about while a spreadsheet recalculates than only about how quickly that spreadsheet recalculates. I don’t think the C# syntax is helpful to a RISC OS audience, and obscures the actual requirements and restrictions that would apply to our case. We might be better starting with Event or Vector Handlers and then expanding those more familiar concepts to the multithreaded world. I make no apology for using “thread” here rather than “core” – the programmer should not need to care how many cores there are, physical or virtual, nor what their hyperthreading capabilities are – the API should work on an ARM2. |
Rick Murray (539) 13840 posts |
I’m going to (mostly) duck out of the discussion as while I can visualise ways this can (should?) work; I don’t have experience in programming such things. However, the above quote is crucially important. A working API should isolate the programmer from hardware specifics. The programmer should say they want to set up a new thread and the OS deals with how this maps to cores and such. |
Andrew Hodgkinson (6) 465 posts |
IMHO the programmer should say they want “ nemo: C#? Spit |
nemo (145) 2546 posts |
There’s an extremely large body of work that disagrees with you there.
As you were. Can’t read. I blame a large quantity of opiates (long story). What I meant was that illustrating a language-specific way of making use of the underlying functionality does not necessarily make the underlying functionality clearer, and may in fact obscure it. For example: Is the Worker/Block in the task slot? Does that imply you expect all workers/threads to have completed/joined before the owner calls Wimp_Poll? If so, how is that policed? Are you suggesting a functionality that does NOT allow a Task’s Workers to continue while it is paged out (single core)? When Workers are active on a multicore, is the whole application space paged in on multiple cores, or only pages containing Workers? I understand the GCD model, but I’ve no idea how it is implemented on a BSD/Linux-based OS, nor therefore whether that translates meaningfully to RISC OS with its paged memory model. :-/ |
Rick Murray (539) 13840 posts |
Forgive my obvious stupidity here, but…
|
Andrew Hodgkinson (6) 465 posts |
To answer that, please read the “About concurrency and application design” document linked to in my post of September 2nd. |
Andrew Hodgkinson (6) 465 posts |
(…further reading, the “concurrency APIs and pitfalls” link in the same post goes into more details about why threads are usually a bad model for getting work done). |
Rick Murray (539) 13840 posts |
I suspect we might be getting hung up on the technical description of what one means by “thread”. |
Andrew Hodgkinson (6) 465 posts |
There is surely nothing upon which to get hung :-) – the definition is very clear. Again, please read the earlier references. They explain things much more clearly than I can.
Blocks might not. You don’t know nor need to know. You simply pass a unit of work to the OS and let it decide how to best schedule it given the available hardware resources and the software load from other processes at any particular instant.
That’s a separate issue and if low latency is required then real-time thread priorities have to be built in (and, indeed, are). Unlike the current WIMP where you can just block everything by not calling Wimp_Poll, in a multithreaded system with pre-emptive multitasking one of the problems that always exists is that it’s impossible to request 100% of the CPU and block everything else, by design. That in turn means two processes which both require such a lock running concurrently will both fail to get the real-time performance they require. That’s an insoluble problem at the root; the user simply tried to do more concurrently than their hardware and software was capable of.
(Again, please see the above links. All I can do is repeat or rephrase what they say which kind of wastes everyone’s time). There’s no non-trivial way for a non-kernel piece of software to know what the “bringing machine to knees” threshold is. It is a combination of both hardware resource availability, hardware and software context switch efficiency, latencies and efficiencies in communicating data between computing resources (e.g. consider the barrier to sending and receiving data to a graphics card when trying to use the GPU via OpenCL for some number crunching), the overhead of different thread types (as the OS will usually offer more than one) and the instantaneous varying-by-the-microsecond load presented by software and external device interrupts. The only thing that stands a chance of getting close to figuring that out correctly and continuously reevaluating the situation – and it’s a extremely hard problem in and of itself – is the kernel. That’s why applications shouldn’t be trying to use threads, since that presents them with the problem of figuring out how many they should have and what the priorities should be. Threads are a crude low level resource. You should no more try to work in the domain of threads than you would try to low-level allocate specific pages of RAM – you let the kernel deal with the pages, instead just telling it what amount of memory you require and give hints about what it is to be used for (at least in modern unified memory architecture systems where the file system, VM subsystem and the allocator are one and the same). That’s not the best ever analogy but the point is you don’t say “I want 4 threads” because you (think you have) 4 CPU cores and run code manually on them. That’s crazy crude (you’re not the only thing running on the system, so what relevance has 4 CPU cores to 4 threads?). You say “I have this chunk of code that I want running in a parallel fashion”, with hints perhaps about your preferred level or maximum level of concurrency and an indication of the kind of job priority you require (usually “normal”, occasionally “real time”, but only very rarely “background” because of priority inversion risk). Worked example: foobar2000. FB2K is a nice audio player for Windows. It includes a good quality built in format converter. You can select a bunch of files in the playlist (potentially thousands of them – the most I’ve ever done in one go so far was over 19,000 files) and it’ll convert them from one format to another (in my most recent case, FLAC to low bitrate HE-AAC). It’s very good at preserving metadata while it does so. Now, this is clearly a CPU intensive task; CODECs are heavy. So the author took this approach: Count the number of CPU cores. Spawn one thread per core. Process the queue of files on this thread pool. Each time a thread finishes processing a file, give it another one.
Contrast with this approach.
If the OS is dumb then the worst that happens is it does as badly as FB2K does when dealing with raw threads; but at least FB2K’s code would be far simpler to write and much easier to read. And in the best case, the OS has a far better idea of what can be done to process the concurrent queue and is far more likely to do so in a manner appropriate for the operating environment at any given time. Meanwhile the user is unlikely to sit and stare at a progress bar so they get on with other things – browsing the web, YouTube videos, probably playing music in the background, whatever. And since all of that involves units of processing managed by the kernel, we’ve a good chance that use of all the machine’s resources will be appropriately balanced from moment to moment. |
Trevor Johnson (329) 1645 posts |
Isn’t it possible to design in1 retention of a legacy operation mode of cooperative multitasking (either current single core, and/or an interim implementation utilising multi-cores in other ways, should one be developed)? Engaging such an operation mode would only be done with explicit user confirmation (Style Guide should recommend standard ways of asking/setting config options, etc.). And the option to revert to pre-emptive operation2 should also be offered (again, per Style Guide recommendations). Then anything which really needs/benefits from 100% CPU time3, e.g. realtime applications, overnight file format conversion, or whatever, would be able to demand it. 1 Even if only theoretically, i.e. potentially too much of an added complication to realistically expect any bounty to be sufficient to cover such work. 2 Dependent on the nature of the single-tasking operation, the wait for a convenient point to switch over might possibly be such that the task is complete before it continues under reverted pre-emptive mode – in any case, perhaps an obvious (but not too processor intensive) visual cue could be employed, e.g. special hourglass only displayed when waiting to switch between modes. 3 Or as close as practicably possible. |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 26