Thinking ahead: Supporting multicore CPUs
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... 26
Rick Murray (539) 13840 posts |
If you think about it, this makes sense in the idea of a microkernel approach for the nth core; where the setup is a lot like the Beeb Tube in that the software on the non-OS core gets to have the run of the processor, however when it needs interaction it will call a SWI which will cause the “slave” core to interrupt the “master” core; in essence, the program on the other core requesting something from RISC OS. Without the ability of the cores to interrupt each other, how would important events be signaled across the system? |
Jeffrey Lee (213) 6048 posts |
Bad code is bad code, no matter what language it’s written in. Take for example OS_ChangeDynamicArea, or OS_SpriteOp (I’d like to link to the bug tracker for this, but it’s down. Basically IIRC appending/merging sprite areas is currently implemented as O(N^2), with memory copies at each step, resulting in a delay of several minutes to merge a few hundred sprites on ARM7500) |
André Timmermans (100) 655 posts |
There are two things to consider: From my own experience with audio/video players: - Porting code/libraries from other platforms: they are usually not designed for cooperative mode as you need split the work in small chunks to let other tasks work. Even then you always have the problem that file I/O blocks everything on RISC OS. In audio, I perform I/O and decoding from callbacks so that it is not bloqued by “un-cooperative” tasks. I use seperates callbacks for I/O and decoding so that it works fine when streaming radios but network file I/O can be a problem because if it relies on callbacks the FS call may never return since their callbacks will only be started when mine is finished. In video, I’d like to have at least non-blocking I/O or the ability to perform I/O in a seperate thread so that it get stuck waiting for the I/O to complete I could continue decoding in the meantime. In other words I am not much concerned by CPU intensive redraw but more with PMTing/multithreading CPU intensive non-invisible work and avoiding the CPU being idle when waiting for I/O. So my idea is for the moment to leave the Wimp and existing applications as is but provide developers an API to start tasks or threads that do not require any display in PMT mode. This means, step 1: step 2: step 3, if we are still around by then: step 4, somewhere before armageddon: |
nemo (145) 2546 posts |
Rick said
Or thread pooling, as I suggested. Note that these are optimisations of some usages of, not replacements for, conventional threads. André wrote
Indeed – though Callbacks are How One Does It on a single threaded RISC OS, but for multiple cores we’d want a thread abstraction… but the point about competing Callbacks also applies to threads too. Now RISC OS is a long way from being a RTOS but didn’t Acorn do something about Quality Of Service? I forget the details. Rick asked:
The p stands for POSIX, basically. One way of avoiding the buggeration is to prohibit threads from doing much more than number crunching. That’s obviously not very attractive. OTOH, PMT GUI OSes don’t allow arbitrary threads to interact with the UI, they rely on message passing between threads to, in effect, synchronise all UI actions in a single thread. This isn’t only for historical reasons. Eric suggested:
Actually it’s not that simple. That’s why I wrote “cleverly written co-operative applications will appear much more responsive than poorly written pre-empted ones” and “having tasks redraw to their own surfaces … has a performance and responsiveness impact which is precisely what one is trying to avoid”. The key word here is “appear”. The problem with drawing to an off-screen buffer and then updating that to the screen is not that it is slower… but that it appears slower. For applications that do complex redrawing, such as a vector graphics application, it is not only the time taken to complete the redraw that is important, but also the time spent appearing to redraw the window. OSes such as Windows demonstrate the problem – drag something across a window and it gets redrawn white, and then some time later the contents appear. With co-operative direct-to-screen redraw the user sees the window updated immediately. With an off-screen buffer there’s a delay. Now monitoring the ChangeBox of the buffer and periodically blitting it to the screen during the pre-empted redraw mitigates that… but is not often done. Incidentally, it must be pointed out that the Wimp has always done a very bad job of coalescing dirty rectangles, though that was more of a performance problem in the low-colour days. I wrote a little routine to re-coalesce the rectangles to maximise pixel runs. It made quite a difference for 256 colour modes as it massively reduced the number of partial word accesses. I digress. |
Rick Murray (539) 13840 posts |
Looking, thinking (what? Me? Think?), it seems that something highish on the wishlist should be a new OS_File API that offers non-blocking file access. As has been said, the system shouldn’t be tied up for long durations loading big files… P for posix. I guess that makes sense, but do we want to be stuck with a situation where the accepted way to start a new process is to fork() an entire copy into two? It’s a bit ridiculous to do that if said process is going to then do something completely different…
I think once we reach a certain point, the Wimp ought to just say “redraw it all”, and two small disparate redraws should probably be merged. Now we’ve left the 8Mhz ARM2 long behind, perhaps the strategy could be re-examined, for a one-off large redraw might be quicker than lots of partials? |
Eric Rucker (325) 232 posts |
nemo: And, in compositing window managers, that problem is completely avoided. Yes, when using XPDM or older (which includes Vista/7 with Aero turned off or in “Aero Basic”), dragging a window over another will cause the window to go white until it gets a chance to redraw. However, in a compositing window manager, the program never even has to redraw – the GPU maintains the contents of all windows, and does all the redrawing itself as needed. Granted, a compositing WIMP is really, REALLY far down my wishlist for RISC OS, and relying on it would be crazy until we know that RISC OS can get access to the GPUs on ARM platforms reliably (right now, I think the Raspberry Pi is the only SoC where RISC OS can get access to the GPU, due to closed source drivers for everyone else (yes, I know, the secret sauce is still closed for the RPi, but unlike other SoCs where the secret sauce runs on the ARM, that doesn’t matter as far as using the GPU from within RISC OS on the RPi), and then there’s IyonixMesa on the Nvidia desktop GPUs in the Iyonix). |
Steve Revill (20) 1361 posts |
Just to add something into the mix, I’m assuming everyone commenting here is aware of the RTSupport module and the DThreads library, both of which are in the ROOL CVS repository? I’m not saying they solve any/all of the problems that have been discussed (but I am pretty sure they do address a few of them) – all I’m saying here is if you’re contributing to a discussion on PMT, threads, et al, you should at least be aware of the current state of play (and UnixLib pthreads has already been mentioned). |
Jeffrey Lee (213) 6048 posts |
RTSupport: Yes, DThreads: No. I can see that they’re both designed for different things (RTSupport for code which doesn’t want to use non-reentrant SWIs, and needs something better than IRQs and callbacks, and DThreads for code which does need reentrant SWIs). And I can also see that they’re a bit ugly (sorry!). RTSupport is too dependent on the use of pollwords as mutexes, which means each thread is always in the “potentially runnable” pool, impacting performance as the number of extant threads increases. Plus there’s no “system idle thread”, so blocking on an event with IRQs disabled (while waiting for an IRQ process to occur and trigger said event) can result in failure if there aren’t any other (IRQ-enabled) threads in the runnable state – RTSupport will immediately return to your IRQ-disabled thread which will then (presumably) check to see if the event has been flagged and then call back into RT_Yield. DThreads suffers from the obvious problem that it’ll only work while in the Wimp. A fully-featured threading system is something we could really do with, as without it it looks like we’re heading on a path of constantly reinventing the wheel. And since plenty of code is already making use of RTSupport, there’s proof that we don’t need to make the entire OS thread safe/aware in order for threading to be possible. |
Steve Fryatt (216) 2105 posts |
Isn’t the problem there more to do with the Data Transfer Protocol getting upset if you try to multitask mid way through the process? Keeping files open is already possible, although risky due to the possibility of something else doing a CLOSE#0 by mistake. |
Steve Revill (20) 1361 posts |
He, he. I didn’t say these are the answer – they are both little more than sticking plaster solutions to particular programming problems relating to threading. But at least now you know of something buried in CVS that you didn’t previously know about. |
nemo (145) 2546 posts |
Eric claimed:
Rubbish. Any editor will be updating its window throughout every interaction by the user. If that interaction requires many redraw actions (such as a vector graphics program) then there can be an appreciable delay between the start and the end of that redrawing. In CMT the redrawing starts immediately and proceeds visibly. In PMT the redrawing proceeds invisibly until it is complete, and then appears. It is this that can be mitigated by partial paints. |
nemo (145) 2546 posts |
Steve pointed out:
This is just one example of co-operative protocols that were not envisioned to span extended periods of time (or context switches, to be clear). However, it is possible to multitask during the DTP, as long as you don’t expect too much of either application during the process. If both are written appropriately though, it can be completely robust. The wheeze is this: The DTP messages are sent Recorded, so they bounce if not replied to. An application calling Wimp_Poll (either directly or indirectly via eg Wimp2) during the protocol will appear to have not replied – the sender of the message will then get a bounce and assume the transfer has failed. However, the recipient of the recorded message can instead Acknowledge it. That stops it from bouncing, but leaves the sender in an intermediate state. The recipient can later (having loaded or saved the file ‘slowly’) send the delayed reply (with the right reference) and the original sender will continue with the protocol (if the user hasn’t interfered with it in the mean time and it isn’t written very strangely indeed). The awkward bit is that if, during the delay in the protocol, the recipient decides the protocol must fail, then the sender needs to get the bounce message. There isn’t a way to send a bounce message – that’s an Acknowledge! However, one can employ a PostFilter to mutate a special message into a bounce and hence complete the protocol ‘legally’. If the wheeze of pausing the DTP like this is allowed for by both authors then nothing can go wrong. If not (most likely) then it will usually be absolutely fine as long as you don’t try to initiate another transfer involving the sender or, obviously, quit it. Having said that, despite being fundamental to the RISC OS desktop experience, it’s astounding how few authors have managed to implement it correctly even as it stands. Expecting people to allow for the paused variant is asking a lot, frankly. :-( At the risk of repeating myself, it’s DataRun which is the killer. |
nemo (145) 2546 posts |
Steve said:
Aha, I knew there was some kind of QoS thing somewhere.
Horrible. Never mention it again. ;-) |
Eric Rucker (325) 232 posts |
Derp. I should’ve said, in a compositing WM using the GPU, the program and CPU wouldn’t have to repaint when a window is dragged over the program’s window. (Obviously there will still be repaints by the program, but I was replying to how on XPDM, moving a window over another can cause a situation where the need for a repaint is visible to the end user. On WDDM (Vista/7), that’s no longer the case.) (Because the compositing WM has the GPU keep the program’s current framebuffer in memory, and it redraws it immediately.) |
Neil Fazakerley (464) 124 posts |
Could I add a still, small voice from the sidelines, in this high-level debate about the future direction of RISC OS? One of the great things RO still has going for it is that it’s a superb OS for robotics. This is because it is one of the few windowing systems left that comes close to having a ‘real time’ mode. Basic V, single tasking under RISC OS, is one of the quietest environments available right now for directly monitoring and controlling high-speed sensors and IO. Any OS that works on an enforced, time-sliced basis (i.e. PMT) is useless for high-speed, real-time robotics or computer control. RISC OS, on the other hand, has retained its ability to ‘stop the clock’ when necessary and devote itself completely to a single task when required to do so. Please, please, please, whatever other innovations or changes may be incorporated in future RISC OS versions, please ensure that this almost unique ability to drop out of the desktop and /truly/ single task is retained in any future iterations. |
Eric Rucker (325) 232 posts |
Keep in mind that there are plenty of real-time operating systems that are pre-emptive – essentially, making sure that certain events occur in a certain amount of time. And, that’s the job of the scheduler (and interrupt handlers). |
Jeffrey Lee (213) 6048 posts |
That’s a perfectly valid request. I’m not sure how easy it would be to fulfil though. Stopping desktop tasks from interfering would be easy enough (we’re a long way away from a proper PMT Wimp, so just running a single-tasking app will be enough). But stopping system threads would be a bit trickier, since they’d generally be there for a good reason. E.g. there’s a bounty for updating the USB stack, and in order to update to the latest BSD sources we might find that we’re forced to use threads instead of the current callback-heavy system. Similarly with networking, the current BSD internet stack is likely to be a very different beast to our current stack (from 1994!). I’d hope that the code is written well enough that the background threads will be idle if nothing’s going on, but obviously if you’re reliant on either of the stacks for a robotics project you might find that your code has to deal with a bit more background noise than usual. Of course if we got as far as adding multi-core support it should be pretty trivial to give programs the power to take full control over one core by forcing all the other threads onto the other core(s). |
Eric Rucker (325) 232 posts |
That actually seems like it’s going towards the Propeller approach – rather than bother with making an effective PMT RTOS, throw more cores at the problem, and use one core per task. (Except the Propeller is a MCU, not as big of a system as we’re talking about here.) Gotta say, that approach is probably the easiest approach to program. There would have to be special support, though, for a “designate this core as for a single-tasking application” mode, right? (Certainly easier than making an RTOS to run underneath RISC OS, though, although maybe an existing RTOS could be used?) |
Rick Murray (539) 13840 posts |
I like the use of the fancy term “compositing Window Manager” to describe something that redraws to an image instead of directly to the window. You know, VisualBasic offers this behaviour if the window is set to “autoredraw”. Perhaps even RISC OS could support it one day? |
Jeffrey Lee (213) 6048 posts |
Nothing more complex than a way to override the processor affinity mask for each thread. As long as no code is written in a way that will cause it to fall over if it can’t get a core it’s specifically requested, there shouldn’t be any issues with allowing user apps to override affinity masks. |
nemo (145) 2546 posts |
Rick puckishly penned:
Isn’t it amazing the things they think of?! The problem with trying to impose off-screen buffering on existing apps is that some do direct screen access (I know mine do!) and are most likely to have read the screen base address on a mode change message. Their model is of a screen-sized (not window-sized) buffer, so will entail considerable memory wastage or copying… neither of which is desirable when the applications is likely to be using direct screen access for speed.* So such a thing can’t really be imposed. It can be selected of course (but then, it always could). *Although one might be tempted to suggest sending applications a dummy mode-change message immediately before the redraw, one must consider that apps can do other expensive things during such a message, including reading and analysing the palette and caching sprites. So that’s not a good idea either. |
Eric Rucker (325) 232 posts |
Alright, I’m gonna bump this… Jeffrey: How much work do you think the microkernel approach would take to get a baseline level going (of almost all existing code staying as-is without the benefits of the microkernel, but the support being there for new code)? I know it’d be a lot, but do you think it’d be feasible for you to implement? And, what (if anything) do you see as more important than that for RISC OS right now? I can only speak for myself, but I’d like to see this, and while I don’t want to distract any developers from more important things, if there are developers that want to focus on this… ;) (Myself, I see some things that are important (a port to a decent Cortex-A15 platform, and wifi support being up there), but the microkernel approach may just be more important given how many multicore platforms are coming out, and would be held way back by the lack of multiprocessing support, and how many drivers and such can benefit from that. And, end-user programs could see some benefit more quickly, too (especially if Unixlib is extended to spin pthreads off onto other cores). I wish I had the knowledge and skills to contribute, but unfortunately I don’t.) |
Jeffrey Lee (213) 6048 posts |
It’ll be a fair amount of work, but nothing insurmountable I’d hope. I think one of the big issues is finding a suitable microkernel to use – preferably (?) something open source (but not GPL), with mature support for all the ARM architectures we’re interested in. Which also raises the questions of which architectures we’d be interested in! If we’re going to build some kind of compatibility layer into the OS then we only need to worry about a microkernel that supports recent architectures (I’d say ARMv7+). If we’re not building compatibility into the OS then we’d have to find a microkernel that works as far back as ARMv3 – or start dropping support for the old architectures. Support for old architectures is something we could potentially add to the microkernel ourselves.
I’d say it’s feasible for me to do it, yes. I’ve done enough kernel hacking by now to know what I’m up to in there.
Finding some more OS developers? :) There are far too many other things for us to do before we start spending serious time on frivolities like adding a microkernel which nothing will be able to use yet.
|
Eric Rucker (325) 232 posts |
What about the compromise of dropping ARMv3, but keeping ARMv4? That keeps the RiscPCs going, but drops the ones with ARM6/7 cards, and drops A7000s… computers that, to be honest, I’d be surprised if they’re running anything more than 4.02 (and I wouldn’t be surprised if the majority of ARMv3 machines still in use are running 3.6 or 3.7). That would also support a theoretical A9home port, as it’s ARMv4T (although I don’t see there being much point to it, as the A9home is the lowest performing post-Acorn RISC OS machine, and the Raspberry Pi beats it in every way already). ARMv5, however, needs to be supported IMO – there’s plenty of Iyonixes out there still, in use. As far as microkernels go, I do think that if an existing kernel is used, it should be one that’s 64-bit clean, and not especially tied to the behavior of existing ARM CPUs, due to the changes in AArch64. |
Rick Murray (539) 13840 posts |
I think before we do this, the burning question is “what exactly is the difference?”. For example, Ubuntu won’t work on the Pi due to it being “old”. But what is the difference at a technical level? I believe (off the top of my head) that the “documented” MMU system changed between either ARMv3 and ARMv4, or was it ARMv4 and ARMv5? Then there’s the VFP/VFPLite/NEON. But the problem I see here is that the way ARM makes stuff, it is all a big bag of mix’n’match. Does the Beagle’s OMAP MMU work like it says in my decade-old ARM ARM tome? I tried to wade through all that L4 interconnect guff and got lost. Though, given how ARM chips work, it would be quite feasible for TI (or whoever) to toss away ARM’s MMU and bolt in their own. Or tweak-to-fit. Or… On the face of it, I would tend to agree that it’s no big deal to drop ARMv3. My RiscPC runs an ARM710 and once upon a time I’d have been miffed at the idea, but now you can buy an ARM board for like thirty quid that’ll blow the RiscPC (any incarnation) so far out of the water it’d rest in orbit for years… and install a version of RISC OS downloaded for free from right here… seriously, if you’re annoyed at ARM6/ARM7 being dropped, you can buy an entire new board for less than the cost of a StrongARM upgrade. I consider my RiscPC to be end-of-line now. Why upgrade it when there’s so much happening with new hardware? However – for the purposes of a microkernel, what technically are the differences between the two? If it is a completely different MMU and such, it would make sense to ditch instead of cluttering up a new microkernel with support for fifteen-year-old tech; but if the difference amounts to minor things and “helluva lot slower”, then supporting it is surely “no big deal”. Over to Jeffrey… ;-) |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... 26