Thinking ahead: Supporting multicore CPUs

636 posts, 79 voices

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... 26

Oct 30, 2012 11:58pm Rick Murray (539) 13840 posts	4) can one core trigger an interrupt in another (maybe part of question 1)? Yes. If you think about it, this makes sense in the idea of a microkernel approach for the nth core; where the setup is a lot like the Beeb Tube in that the software on the non-OS core gets to have the run of the processor, however when it needs interaction it will call a SWI which will cause the “slave” core to interrupt the “master” core; in essence, the program on the other core requesting something from RISC OS. Without the ability of the cores to interrupt each other, how would important events be signaled across the system?

Oct 31, 2012 12:19am Jeffrey Lee (213) 6048 posts	(^_^) There must be, I guess, another reason why a mobile phone twice as powerful as a Pi takes eight times longer to reach a useable state… Or why my Linux-based PVR boots in three minutes (but the custom microkernel1 PVR on an identical SoC boots in 27 seconds). I’ll stop here as this could easily turn into a “why Linux sucks on small devices” rant of my own. Bad code is bad code, no matter what language it’s written in. Take for example OS_ChangeDynamicArea, or OS_SpriteOp (I’d like to link to the bug tracker for this, but it’s down. Basically IIRC appending/merging sprite areas is currently implemented as O(N^2), with memory copies at each step, resulting in a delay of several minutes to merge a few hundred sprites on ARM7500)

Oct 31, 2012 10:37am André Timmermans (100) 655 posts	There are two things to consider: - existing software with, as Nemo stated, no chance to make them work in full PMT due to the nature of the Wimp. - the more pressing needs of the few developers left. From my own experience with audio/video players: - Porting code/libraries from other platforms: they are usually not designed for cooperative mode as you need split the work in small chunks to let other tasks work. Even then you always have the problem that file I/O blocks everything on RISC OS. In audio, I perform I/O and decoding from callbacks so that it is not bloqued by “un-cooperative” tasks. I use seperates callbacks for I/O and decoding so that it works fine when streaming radios but network file I/O can be a problem because if it relies on callbacks the FS call may never return since their callbacks will only be started when mine is finished. In video, I’d like to have at least non-blocking I/O or the ability to perform I/O in a seperate thread so that it get stuck waiting for the I/O to complete I could continue decoding in the meantime. In other words I am not much concerned by CPU intensive redraw but more with PMTing/multithreading CPU intensive non-invisible work and avoiding the CPU being idle when waiting for I/O. So my idea is for the moment to leave the Wimp and existing applications as is but provide developers an API to start tasks or threads that do not require any display in PMT mode. This means, step 1: - a PMT microkernel providing basic services: tasks, threads, memory management, mutexes, interprocess messaging, … - a mean to distinguish between PMT safe modules and a mechanism to ensure that a single process at a time can access unsafe SWIs. - delegate task management from the Wimp or whatever module is currently managing them to the new kernel and modify Wimp_Poll to use mutexes so that the current Wimp task is put to sleep and the next one is reactivated. step 2: - make I/O modules PMT safe. step 3, if we are still around by then: - define a new display API and drivers that uses a context as parameter so that printing (or redirecting output to a sprite) does not need to intercept display related vectors, meaning that applications can delegate them to background threads and that they don’t leave the system unusable if something goes wrong. step 4, somewhere before armageddon: - a PMT friendly Wimp

Oct 31, 2012 11:59am nemo (145) 2546 posts	Rick said Of course, there’s always fibers Or thread pooling, as I suggested. Note that these are optimisations of some usages of, not replacements for, conventional threads. André wrote In audio, I perform I/O and decoding from callbacks so that it is not bloqued by “un-cooperative” tasks. I use seperates callbacks for I/O and decoding so that it works fine when streaming radios but network file I/O can be a problem because if it relies on callbacks the FS call may never return since their callbacks will only be started when mine is finished. Indeed – though Callbacks are How One Does It on a single threaded RISC OS, but for multiple cores we’d want a thread abstraction… but the point about competing Callbacks also applies to threads too. Now RISC OS is a long way from being a RTOS but didn’t Acorn do something about Quality Of Service? I forget the details. Rick asked: In short, the question isn’t “how simple is it to make pthreads” (why “p” prefix? this some sort of reverse polish notation thing?); but rather “how can we make RISC OS support multithreaded activity without completely buggering up what RISC OS is and how it works?” The p stands for POSIX, basically. One way of avoiding the buggeration is to prohibit threads from doing much more than number crunching. That’s obviously not very attractive. OTOH, PMT GUI OSes don’t allow arbitrary threads to interact with the UI, they rely on message passing between threads to, in effect, synchronise all UI actions in a single thread. This isn’t only for historical reasons. Eric suggested: Actually, having tasks redraw to their own surfaces only lets you do interesting things regarding performance. Actually it’s not that simple. That’s why I wrote “cleverly written co-operative applications will appear much more responsive than poorly written pre-empted ones” and “having tasks redraw to their own surfaces … has a performance and responsiveness impact which is precisely what one is trying to avoid”. The key word here is “appear”. The problem with drawing to an off-screen buffer and then updating that to the screen is not that it is slower… but that it appears slower. For applications that do complex redrawing, such as a vector graphics application, it is not only the time taken to complete the redraw that is important, but also the time spent appearing to redraw the window. OSes such as Windows demonstrate the problem – drag something across a window and it gets redrawn white, and then some time later the contents appear. With co-operative direct-to-screen redraw the user sees the window updated immediately. With an off-screen buffer there’s a delay. Now monitoring the ChangeBox of the buffer and periodically blitting it to the screen during the pre-empted redraw mitigates that… but is not often done. Incidentally, it must be pointed out that the Wimp has always done a very bad job of coalescing dirty rectangles, though that was more of a performance problem in the low-colour days. I wrote a little routine to re-coalesce the rectangles to maximise pixel runs. It made quite a difference for 256 colour modes as it massively reduced the number of partial word accesses. I digress.

Oct 31, 2012 12:53pm Rick Murray (539) 13840 posts	Looking, thinking (what? Me? Think?), it seems that something highish on the wishlist should be a new OS_File API that offers non-blocking file access. As has been said, the system shouldn’t be tied up for long durations loading big files… P for posix. I guess that makes sense, but do we want to be stuck with a situation where the accepted way to start a new process is to fork() an entire copy into two? It’s a bit ridiculous to do that if said process is going to then do something completely different… Incidentally, it must be pointed out that the Wimp has always done a very bad job of coalescing dirty rectangles, I think once we reach a certain point, the Wimp ought to just say “redraw it all”, and two small disparate redraws should probably be merged. Now we’ve left the 8Mhz ARM2 long behind, perhaps the strategy could be re-examined, for a one-off large redraw might be quicker than lots of partials? Obviously depends upon the situation…

Oct 31, 2012 1:04pm Eric Rucker (325) 232 posts	nemo: And, in compositing window managers, that problem is completely avoided. Yes, when using XPDM or older (which includes Vista/7 with Aero turned off or in “Aero Basic”), dragging a window over another will cause the window to go white until it gets a chance to redraw. However, in a compositing window manager, the program never even has to redraw – the GPU maintains the contents of all windows, and does all the redrawing itself as needed. Granted, a compositing WIMP is really, REALLY far down my wishlist for RISC OS, and relying on it would be crazy until we know that RISC OS can get access to the GPUs on ARM platforms reliably (right now, I think the Raspberry Pi is the only SoC where RISC OS can get access to the GPU, due to closed source drivers for everyone else (yes, I know, the secret sauce is still closed for the RPi, but unlike other SoCs where the secret sauce runs on the ARM, that doesn’t matter as far as using the GPU from within RISC OS on the RPi), and then there’s IyonixMesa on the Nvidia desktop GPUs in the Iyonix).

Oct 31, 2012 6:05pm Steve Revill (20) 1361 posts	Just to add something into the mix, I’m assuming everyone commenting here is aware of the RTSupport module and the DThreads library, both of which are in the ROOL CVS repository? I’m not saying they solve any/all of the problems that have been discussed (but I am pretty sure they do address a few of them) – all I’m saying here is if you’re contributing to a discussion on PMT, threads, et al, you should at least be aware of the current state of play (and UnixLib pthreads has already been mentioned).

Oct 31, 2012 6:36pm Jeffrey Lee (213) 6048 posts	Just to add something into the mix, I’m assuming everyone commenting here is aware of the RTSupport module and the DThreads library, both of which are in the ROOL CVS repository? RTSupport: Yes, DThreads: No. I can see that they’re both designed for different things (RTSupport for code which doesn’t want to use non-reentrant SWIs, and needs something better than IRQs and callbacks, and DThreads for code which does need reentrant SWIs). And I can also see that they’re a bit ugly (sorry!). RTSupport is too dependent on the use of pollwords as mutexes, which means each thread is always in the “potentially runnable” pool, impacting performance as the number of extant threads increases. Plus there’s no “system idle thread”, so blocking on an event with IRQs disabled (while waiting for an IRQ process to occur and trigger said event) can result in failure if there aren’t any other (IRQ-enabled) threads in the runnable state – RTSupport will immediately return to your IRQ-disabled thread which will then (presumably) check to see if the event has been flagged and then call back into RT_Yield. DThreads suffers from the obvious problem that it’ll only work while in the Wimp. A fully-featured threading system is something we could really do with, as without it it looks like we’re heading on a path of constantly reinventing the wheel. And since plenty of code is already making use of RTSupport, there’s proof that we don’t need to make the entire OS thread safe/aware in order for threading to be possible.

Oct 31, 2012 6:44pm Steve Fryatt (216) 2105 posts	Looking, thinking (what? Me? Think?), it seems that something highish on the wishlist should be a new OS_File API that offers non-blocking file access. As has been said, the system shouldn’t be tied up for long durations loading big files… Isn’t the problem there more to do with the Data Transfer Protocol getting upset if you try to multitask mid way through the process? Keeping files open is already possible, although risky due to the possibility of something else doing a CLOSE#0 by mistake.

Oct 31, 2012 7:38pm Steve Revill (20) 1361 posts	I can also see that they’re a bit ugly (sorry!) He, he. I didn’t say these are the answer – they are both little more than sticking plaster solutions to particular programming problems relating to threading. But at least now you know of something buried in CVS that you didn’t previously know about.

Oct 31, 2012 7:53pm nemo (145) 2546 posts	Eric claimed: However, in a compositing window manager, the program never even has to redraw Rubbish. Any editor will be updating its window throughout every interaction by the user. If that interaction requires many redraw actions (such as a vector graphics program) then there can be an appreciable delay between the start and the end of that redrawing. In CMT the redrawing starts immediately and proceeds visibly. In PMT the redrawing proceeds invisibly until it is complete, and then appears. It is this that can be mitigated by partial paints.

Oct 31, 2012 8:06pm nemo (145) 2546 posts	Steve pointed out: Isn’t the problem there more to do with the Data Transfer Protocol getting upset if you try to multitask mid way through the process? This is just one example of co-operative protocols that were not envisioned to span extended periods of time (or context switches, to be clear). However, it is possible to multitask during the DTP, as long as you don’t expect too much of either application during the process. If both are written appropriately though, it can be completely robust. The wheeze is this: The DTP messages are sent Recorded, so they bounce if not replied to. An application calling Wimp_Poll (either directly or indirectly via eg Wimp2) during the protocol will appear to have not replied – the sender of the message will then get a bounce and assume the transfer has failed. However, the recipient of the recorded message can instead Acknowledge it. That stops it from bouncing, but leaves the sender in an intermediate state. The recipient can later (having loaded or saved the file ‘slowly’) send the delayed reply (with the right reference) and the original sender will continue with the protocol (if the user hasn’t interfered with it in the mean time and it isn’t written very strangely indeed). The awkward bit is that if, during the delay in the protocol, the recipient decides the protocol must fail, then the sender needs to get the bounce message. There isn’t a way to send a bounce message – that’s an Acknowledge! However, one can employ a PostFilter to mutate a special message into a bounce and hence complete the protocol ‘legally’. If the wheeze of pausing the DTP like this is allowed for by both authors then nothing can go wrong. If not (most likely) then it will usually be absolutely fine as long as you don’t try to initiate another transfer involving the sender or, obviously, quit it. Having said that, despite being fundamental to the RISC OS desktop experience, it’s astounding how few authors have managed to implement it correctly even as it stands. Expecting people to allow for the paused variant is asking a lot, frankly. :-( At the risk of repeating myself, it’s DataRun which is the killer.

Oct 31, 2012 8:15pm nemo (145) 2546 posts	Steve said: I’m assuming everyone commenting here is aware of the RTSupport module Aha, I knew there was some kind of QoS thing somewhere. and the DThreads library Horrible. Never mention it again. ;-)

Oct 31, 2012 10:26pm Eric Rucker (325) 232 posts	Derp. I should’ve said, in a compositing WM using the GPU, the program and CPU wouldn’t have to repaint when a window is dragged over the program’s window. (Obviously there will still be repaints by the program, but I was replying to how on XPDM, moving a window over another can cause a situation where the need for a repaint is visible to the end user. On WDDM (Vista/7), that’s no longer the case.) (Because the compositing WM has the GPU keep the program’s current framebuffer in memory, and it redraws it immediately.)

Nov 1, 2012 9:38am Neil Fazakerley (464) 124 posts	Could I add a still, small voice from the sidelines, in this high-level debate about the future direction of RISC OS? One of the great things RO still has going for it is that it’s a superb OS for robotics. This is because it is one of the few windowing systems left that comes close to having a ‘real time’ mode. Basic V, single tasking under RISC OS, is one of the quietest environments available right now for directly monitoring and controlling high-speed sensors and IO. Any OS that works on an enforced, time-sliced basis (i.e. PMT) is useless for high-speed, real-time robotics or computer control. RISC OS, on the other hand, has retained its ability to ‘stop the clock’ when necessary and devote itself completely to a single task when required to do so. Please, please, please, whatever other innovations or changes may be incorporated in future RISC OS versions, please ensure that this almost unique ability to drop out of the desktop and /truly/ single task is retained in any future iterations.

Nov 1, 2012 10:15am Eric Rucker (325) 232 posts	Keep in mind that there are plenty of real-time operating systems that are pre-emptive – essentially, making sure that certain events occur in a certain amount of time. And, that’s the job of the scheduler (and interrupt handlers).

Nov 1, 2012 10:20am Jeffrey Lee (213) 6048 posts	Please, please, please, whatever other innovations or changes may be incorporated in future RISC OS versions, please ensure that this almost unique ability to drop out of the desktop and /truly/ single task is retained in any future iterations. That’s a perfectly valid request. I’m not sure how easy it would be to fulfil though. Stopping desktop tasks from interfering would be easy enough (we’re a long way away from a proper PMT Wimp, so just running a single-tasking app will be enough). But stopping system threads would be a bit trickier, since they’d generally be there for a good reason. E.g. there’s a bounty for updating the USB stack, and in order to update to the latest BSD sources we might find that we’re forced to use threads instead of the current callback-heavy system. Similarly with networking, the current BSD internet stack is likely to be a very different beast to our current stack (from 1994!). I’d hope that the code is written well enough that the background threads will be idle if nothing’s going on, but obviously if you’re reliant on either of the stacks for a robotics project you might find that your code has to deal with a bit more background noise than usual. Of course if we got as far as adding multi-core support it should be pretty trivial to give programs the power to take full control over one core by forcing all the other threads onto the other core(s).

Nov 1, 2012 10:40am Eric Rucker (325) 232 posts	That actually seems like it’s going towards the Propeller approach – rather than bother with making an effective PMT RTOS, throw more cores at the problem, and use one core per task. (Except the Propeller is a MCU, not as big of a system as we’re talking about here.) Gotta say, that approach is probably the easiest approach to program. There would have to be special support, though, for a “designate this core as for a single-tasking application” mode, right? (Certainly easier than making an RTOS to run underneath RISC OS, though, although maybe an existing RTOS could be used?)

Nov 1, 2012 10:46am Rick Murray (539) 13840 posts	in a compositing WM using the GPU, the program and CPU wouldn’t have to repaint when a window is dragged over the program’s window. I like the use of the fancy term “compositing Window Manager” to describe something that redraws to an image instead of directly to the window. You know, VisualBasic offers this behaviour if the window is set to “autoredraw”. Perhaps even RISC OS could support it one day? (^_^)

Nov 1, 2012 10:57am Jeffrey Lee (213) 6048 posts	There would have to be special support, though, for a “designate this core as for a single-tasking application” mode, right? Nothing more complex than a way to override the processor affinity mask for each thread. As long as no code is written in a way that will cause it to fall over if it can’t get a core it’s specifically requested, there shouldn’t be any issues with allowing user apps to override affinity masks.

Nov 1, 2012 3:04pm nemo (145) 2546 posts	Rick puckishly penned: Perhaps even RISC OS could support it one day? (^_^) Isn’t it amazing the things they think of?! The problem with trying to impose off-screen buffering on existing apps is that some do direct screen access (I know mine do!) and are most likely to have read the screen base address on a mode change message. Their model is of a screen-sized (not window-sized) buffer, so will entail considerable memory wastage or copying… neither of which is desirable when the applications is likely to be using direct screen access for speed.* So such a thing can’t really be imposed. It can be selected of course (but then, it always could). *Although one might be tempted to suggest sending applications a dummy mode-change message immediately before the redraw, one must consider that apps can do other expensive things during such a message, including reading and analysing the palette and caching sprites. So that’s not a good idea either.

Nov 13, 2012 4:27pm Eric Rucker (325) 232 posts	Alright, I’m gonna bump this… Jeffrey: How much work do you think the microkernel approach would take to get a baseline level going (of almost all existing code staying as-is without the benefits of the microkernel, but the support being there for new code)? I know it’d be a lot, but do you think it’d be feasible for you to implement? And, what (if anything) do you see as more important than that for RISC OS right now? I can only speak for myself, but I’d like to see this, and while I don’t want to distract any developers from more important things, if there are developers that want to focus on this… ;) (Myself, I see some things that are important (a port to a decent Cortex-A15 platform, and wifi support being up there), but the microkernel approach may just be more important given how many multicore platforms are coming out, and would be held way back by the lack of multiprocessing support, and how many drivers and such can benefit from that. And, end-user programs could see some benefit more quickly, too (especially if Unixlib is extended to spin pthreads off onto other cores). I wish I had the knowledge and skills to contribute, but unfortunately I don’t.)

Nov 13, 2012 6:51pm Jeffrey Lee (213) 6048 posts	How much work do you think the microkernel approach would take to get a baseline level going (of almost all existing code staying as-is without the benefits of the microkernel, but the support being there for new code)? It’ll be a fair amount of work, but nothing insurmountable I’d hope. I think one of the big issues is finding a suitable microkernel to use – preferably (?) something open source (but not GPL), with mature support for all the ARM architectures we’re interested in. Which also raises the questions of which architectures we’d be interested in! If we’re going to build some kind of compatibility layer into the OS then we only need to worry about a microkernel that supports recent architectures (I’d say ARMv7+). If we’re not building compatibility into the OS then we’d have to find a microkernel that works as far back as ARMv3 – or start dropping support for the old architectures. Support for old architectures is something we could potentially add to the microkernel ourselves. I know it’d be a lot, but do you think it’d be feasible for you to implement? I’d say it’s feasible for me to do it, yes. I’ve done enough kernel hacking by now to know what I’m up to in there. And, what (if anything) do you see as more important than that for RISC OS right now? Finding some more OS developers? :) There are far too many other things for us to do before we start spending serious time on frivolities like adding a microkernel which nothing will be able to use yet. There’s a fair amount of work left to do on the Raspberry Pi port (whether bug fixes or new functionality), but for assorted reasons none of the people who worked on the port at the OS level in the lead up to the release appear to be maintaining that focus post-release. There’s at least a couple of large tasks I’ve been working on (new disassembler, zero page relocation) which would/could help with microkernel adoption and with RISC OS software development in general if they were finished Still large chunks of work left to do which has come up as a result of the OMAP3 port – apart from a couple of Pandora bits which I’m hoping to get done in the next few weeks, I think the biggest thing which anyone stands a chance of caring about is the GraphicsV overhaul, which will benefit all platforms in one way or another. It might also be worth creating a sensible OS-level threading library (i.e. a better version of RTSupport) before tackling the microkernel stuff. But on the other hand it would be good to get the microkernel integrated and then design the threading API around the functions the microkernel offers) Plus assorted bugs to fix all over the place.

Nov 13, 2012 7:41pm Eric Rucker (325) 232 posts	What about the compromise of dropping ARMv3, but keeping ARMv4? That keeps the RiscPCs going, but drops the ones with ARM6/7 cards, and drops A7000s… computers that, to be honest, I’d be surprised if they’re running anything more than 4.02 (and I wouldn’t be surprised if the majority of ARMv3 machines still in use are running 3.6 or 3.7). That would also support a theoretical A9home port, as it’s ARMv4T (although I don’t see there being much point to it, as the A9home is the lowest performing post-Acorn RISC OS machine, and the Raspberry Pi beats it in every way already). ARMv5, however, needs to be supported IMO – there’s plenty of Iyonixes out there still, in use. As far as microkernels go, I do think that if an existing kernel is used, it should be one that’s 64-bit clean, and not especially tied to the behavior of existing ARM CPUs, due to the changes in AArch64.

Nov 13, 2012 9:26pm Rick Murray (539) 13840 posts	What about the compromise of dropping ARMv3, but keeping ARMv4? I think before we do this, the burning question is “what exactly is the difference?”. For example, Ubuntu won’t work on the Pi due to it being “old”. But what is the difference at a technical level? I believe (off the top of my head) that the “documented” MMU system changed between either ARMv3 and ARMv4, or was it ARMv4 and ARMv5? Then there’s the VFP/VFPLite/NEON. But the problem I see here is that the way ARM makes stuff, it is all a big bag of mix’n’match. Does the Beagle’s OMAP MMU work like it says in my decade-old ARM ARM tome? I tried to wade through all that L4 interconnect guff and got lost. Though, given how ARM chips work, it would be quite feasible for TI (or whoever) to toss away ARM’s MMU and bolt in their own. Or tweak-to-fit. Or… On the face of it, I would tend to agree that it’s no big deal to drop ARMv3. My RiscPC runs an ARM710 and once upon a time I’d have been miffed at the idea, but now you can buy an ARM board for like thirty quid that’ll blow the RiscPC (any incarnation) so far out of the water it’d rest in orbit for years… and install a version of RISC OS downloaded for free from right here… seriously, if you’re annoyed at ARM6/ARM7 being dropped, you can buy an entire new board for less than the cost of a StrongARM upgrade. I consider my RiscPC to be end-of-line now. Why upgrade it when there’s so much happening with new hardware? However – for the purposes of a microkernel, what technically are the differences between the two? If it is a completely different MMU and such, it would make sense to ditch instead of cluttering up a new microkernel with support for fifteen-year-old tech; but if the difference amounts to minor things and “helluva lot slower”, then supporting it is surely “no big deal”. Over to Jeffrey… ;-)