Thinking ahead: Supporting multicore CPUs

636 posts, 79 voices

Pages: 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Dec 9, 2015 10:55pm Martin Avison (27) 1494 posts	To add to what Chris said, it is also true that each additional processor consumes some processor just to schedule tasks on a processor, and keep track of them. There are also instances where a processor has to wait until another processor has finished something, which also adds to the ‘unused’ time. When in 1976 IBM first introduced ‘Attached’ then ‘Multi’ processor IBM/370 mainframe machines the overheads were quite large, but were slowly reduced over the years as the OS and hardware were improved. Any use of multiple cores in RISC OS I would expect to follow a similar pattern – probably small (but useful) gains to start with.

Dec 10, 2015 7:05am David Feugey (2125) 2709 posts	One solution is to use only one core, and to provide a ‘light threads’ library to use the power of other cores from the main OS. So no need for a SMP kernel or a ‘giant lock’ free system, but just a kind of multicore taskwindows. IMHO, taskwindows could/should be the bridge to SMP and PMT.

Dec 10, 2015 7:55am Rick Murray (539) 13840 posts	The problem is…when one of your cores wants to read a big wodge of data from disc. Remember, if you move away from plain BASIC to more advanced code in C or assembler, you run the risk of crashing into all sorts of gotchas. Example? Let’s say you want to periodically read some sensor and log the results. Well a simple CallEvery will get the OS to prod you for reading the sensor. But you can’t write the data anywhere. Other programs, and maybe bits of the OS, will be “threaded”, and if one of those bits is a filesystem operation, touching the filesystem yourself will result in a “FileCore in use” error. So you need to stash the data and schedule a CallBack where the OS will prod you again, this time when it isn’t busy and nothing else is actively running. The problem here is that there is no specific time frame. Most system activity is quick so you will likely not notice any delay been the CallEvery event and its associated CallBack. But this is not guaranteed, so you will need to be capable of buffering multiple values (and the applicable time of each), plus be smart enough to interlock so you don’t set a CallBack when one is pending (as that makes RISC OS a bit stressy).¹ If this is the fun of doing periodic events with disc activity on a single core, try to imagine the joy that multi core activity would bring! ;-) ¹ This issue identified in the creation of my server. It listens to a port twice a second (CallEvery) and when something happens it takes the expected action; however the action code must run as a CallBack or else things go boobies-in-the-air big time as we crash into the very valid risk that every non reentrant call will fail – and one that is absolutely guaranteed to fail is any filesystem access. So many FileCore in use errors that I almost thought I was in a time warp and running RISC OS 2!

Dec 10, 2015 8:12am Rick Murray (539) 13840 posts	TaskWindows could “leverage” (!) the ability to offload code onto other cores, but I don’t think they should be the primary interface. Perhaps as the very minimum you might not even need much support on the slave cores, you could trap all SWIs and kick them over to the primary core. This implies that other cores won’t necessarily see much of a speed up; but it’s a place to start. More autonomous support for each core can be added in time, little by little, and likely in tandem with work on the main OS to stop it freaking out…which task where? EEK! Question is – what can be done with FileCore? So much of the core OS doesn’t deal with the idea of multiple threads, never mind multiple cores! We are, essentially, stuck with the interesting challenge of bolting multiple core support on to a slicker version of the BBC MOS. There is precedence, the Beeb’s TUBE interface permitted various processors (65C02, Z80, etc) to be connected to a Beeb. They contained their own memory, and all I/O was routed through the TUBE to the host machine, which basically acted as a terminal (keyboard, VDU, disc etc) to the second processor. It is probable that any RISC OS implementation will be broadly similar, although RISC OS itself will need to be more than “just a terminal”.

Dec 10, 2015 11:12am David Feugey (2125) 2709 posts	TaskWindows could “leverage” (!) the ability to offload code onto other cores, but I don’t think they should be the primary interface. Not directly. But a light threads API could use TaskWindows, as they are already PMT compliant (and could become SMP compliant too, without the need to rewrite all the OS). Of course, that’s just a first step. There is precedence, the Beeb’s TUBE interface permitted various processors (65C02, Z80, etc) to be connected to a Beeb. Yep, it’s closer to what we had. Another argument is that we would could keep the same tools for an heterogeneous multicore system. For example to get access to the 2 Cortex-M of the OMAP5, or to make some clustering through a network. SMP is not so helpful. It seems to be cool, but once you make multicore apps, you’ll run into problems when trying to extend you application for an heterogeneous multicore architecture. A SMP/PMT core limited to what a TaskWindow can currently do would be enough for most code. TaskWindows could even rely on it. So the idea is to say: 1/ a multicore compliant version of TaskWindow (good to make tests and launch some code), and 2/ a multicore engine separated from the shell. When a second core is available, a monitor runs on it, and the whole is referenced as a resource for the multicore engine. You can also attach other resources, for example a slave system available on the network. - A bit like the sound management :) You attach resources, and can use them, not to play sound, but to play code. - A bit like the FS too. You write interfaces for a generic compute resources provider. You could for example provide a tool to make a cluster with SSH only computers (OK: not really efficient). You could use this from a TaskWindow (to launch tasks on other cores with a specific exec command…), from the Basic ASM (with a ‘launch on resource x’ command) or from your other code (with a light threads library, or as a task, through the TaskWindow module). All non interactive part of the code could use it. Image renderers, multimedia, gaming, emulation, compression, etc. Specific DA and SWIs could be used to exchange data and commands. Endless possibilities.

Dec 10, 2015 11:24am David Feugey (2125) 2709 posts	I used something similar a long time ago. With a small Basic library, I was able to launch tasks inside a TaskWindow, or on the PC Card. I did use it to make accelerators for tools like Gzip. It was possible to launch Gzip on the ARM, with PMT support (TaskWindow), or on the second core, the 486. Exactly the same thing. Of course it was only tricks, not an optimised module, but it did work and with impressive gains (even if I copy data and launch the tasks with a simple mix of CLI and DOS commands). Some modern SWIs would be more efficient, but I did like this easy way to test my idea.

Dec 10, 2015 11:45am Rick Murray (539) 13840 posts	Off topic – David, do you still have the code to start up the x86? I’m wondering how that mechanism worked (and don’t fancy trying to make sense of !PC).

Dec 10, 2015 6:35pm David Feugey (2125) 2709 posts	I did use a CLI tool that was able to launch software on the PC Card. I can’t remember it’s name (not mine). Very Basic… just to validate the idea. For Gzip, I checked if the PC Card was present, and (no) launch Gzip or (yes) copy data, launch Gzip on the PC card and get back data. Gain was important with Gzip, despite of the transfer of data to and from the PC partition. It would be much better to use the DDK: http://www.riscos.info/index.php/PC

Dec 10, 2015 8:15pm Rick Murray (539) 13840 posts	A SMP/PMT core limited to what a TaskWindow can currently do would be enough for most code. Therein lies the problem. My use of TaskWindow (and I am a geek) is for compiling stuff, reading DADebug logs, and issuing random commands as it is 2015 so we ought to have moved beyond ShellCLI by now… I really don’t think we will see benefits from other cores until actual applications* can run on them. That said, RISC OS has a rather sanitised view of application handling (everything is “the only app and it starts at address &8000”) that might lend itself to running on other cores – nothing is supposed to make too many assumptions about the state of the system it is running on. I am not sure I’m hopeful though. We still haven’t figured out the mess that is coherent UTF-8 support with non-UTF-8 apps; so I can’t imagine how one could sensibly invoke the Wimp_Message mechanism with multiple applications that can take arbitrary amounts of time to execute. Maybe this is the time to consider my suggestion for pre-emption-lite (a pair of SWI calls that can inform the Wimp that the following code does not poll but takes time, feel free to pre-empt it). I’ve already discussed the idea. Anyway, on a single core system, the Wimp could assign such code a time-slice and yank control away when the time is up. On a multi core system, the Wimp could kick such an app over to another core to run in peace (and pre-empt between them if multiple such apps exist), while the primary core runs the primary apps as normal. It would be nice to have the Wimp capable of dispersing apps across all ‘n’ cores, but in order for this to work, we’d really need to take a long hard look at how protocols such as User_MessageRecorded work, not to mention behaviour of other cores for tasks that would ordinarily block the system. It’s merely a pain in the ass when ChangeFSI renders a JPEG in hundreds of times longer than it takes SwiftJPEG; however it is much more critical when the Printer Manager is in use (due to how the printing system hooks its tentacles everywhere). Soooo many questions. I did use a CLI tool that was able to launch software on the PC Card. I can’t remember it’s name (not mine). Very Basic… just to validate the idea. Ah, you are probably thinking of ARMEdit to pass data/commands to a DOS session. A bit higher level than I was hoping for.

Dec 11, 2015 6:46am David Feugey (2125) 2709 posts	Therein lies the problem. My use of TaskWindow (and I am a geek) is for compiling stuff, reading DADebug logs, and issuing random *commands as it is 2015 so we ought to have moved beyond ShellCLI by now… I was not talking strictly of the TaskWindow itself, but of the module that permits to launch tasks from your code in PMT mode. Very useful for front-ends, calculation offloading, etc. From a developer point of view there are many ways to use it to separate non interactive code from the Wimp one, then to launch it in PMT mode, or even on another Core. That would be a start (and a good change for tools as ChangeFSI). I really don’t think we will see benefits from other cores until actual applications can run on them. Of course. But do we need lots of multicore apps to be happy? ArchiEmu and Mplayer would be almost enough for me :) so I can’t imagine how one could sensibly invoke the Wimp_Message mechanism with multiple applications that can take arbitrary amounts of time to execute. I can’t either. That’s why I said that multicore code should not be more complex and ‘interactive’ than the code that currently runs under a TaskWindow. Of course no Wimp call. I see more multicore as a way to offload tasks from an app that works on the main core. With the TaskWindow module, or with a a light threads library, or even from BBC Basic ASM with a specific directive. a pair of SWI calls that can inform the Wimp that the following code does not poll but takes time That’s what a task running in a TaskWindow is :) (does not poll, but takes time). So I agree. We need something to manage more globally PMT tasks. PMT tasks, by design (and because of the limits of RISC OS) will be CMT compliant… and closed to light threads as seen on other systems (the task will get access to limited API/SWI). It would be nice to have the Wimp Yes and no. Since these tasks will not be Wimp tasks, perhaps it’s better to do this at a lower level. That’s why I suggested too to move the TaskWindow module from Wimp to CLI (not an easy task). Ah, you are probably thinking of ARMEdit to pass data/commands to a DOS session. A bit higher level than I was hoping for. That was a (working) demo. The worst case. Anyway I think that both ways should be provided. Code level, Exe level. I note that some people managed to use second core for some tasks (GPIO?). Very first step would be to get some working example code: - reserve memory (overlay on a DA?), start the core x. - send code to core x, receive results from core x. - stop and clean all. RISC OS FR have still a private bounty reserved to this (with around 500 E depending on the availability of the code [ARMX6, Titanium, Pi2, etc.]). He has also the same bounty for Geminus mode for Titanium/Panda. There is another one for Brandy (complete refresh, Windows, DOS, ROS, Linux + collect all patches + add few things). This one will be financed by a private company. And some smaller ones around 1/ a RISC OS launcher for Linux made for the titanium, 2/ a complete refresh of DOSBox – these one are specifically for me :) I should make some announcement around this. Anyway, anybody can contact me trough RISC OS FR mail.

Dec 11, 2015 9:31am Colin (478) 2433 posts	the module that permits to launch tasks from your code in PMT mode. Taskwindows don’t do PMT they do co-operative multi tasking by sleeping when a swi is called. You can test this very easily with the c program `#include "swis.h" int main(void) { unsigned int t; for (;;) { _swix(OS_ReadMonotonicTime, _OUT(0), &t); } return 0; }` If you run the program in a taskwindow it will multitask but if you comment out the swi it will lock the machine up – as it will if the swi doesn’t return.

Dec 11, 2015 10:18am Colin (478) 2433 posts	so I can’t imagine how one could sensibly invoke the Wimp_Message mechanism with multiple applications that can take arbitrary amounts of time to execute. How about… At the moment all applications except the application that calls the wimp message are stalled ie they have called Wimp_Poll and it hasn’t returned. So the wimp just has to iterate through the list of stalled applications posting the messages. With a PMT multithreaded system the other applications may not be stalled but when the wimp iterates through the applications to post the message it stalls until the application calls wimp_poll then posts the message.

Dec 11, 2015 10:28am David Feugey (2125) 2709 posts	Taskwindows don’t do PMT they do co-operative multi tasking by sleeping when a swi is called. You can test this very easily with the c program True. When calling SWIs, all PMT is gone.

Dec 11, 2015 10:54am Rick Murray (539) 13840 posts	Wimp_MessageRecorded? DataSave?

Dec 11, 2015 11:13am Colin (478) 2433 posts	Wimp_MessageRecorded? DataSave? Must go down as your shortest post Rick :-) I presume it was directed at my post. Where’s the problem with Wimp_MessageRecorded and DataSave. The wimp stalling until an application has called wimp_poll makes any message work as it does now doesn’t it?

Dec 11, 2015 11:29am Rick Murray (539) 13840 posts	Cough. I’m at work. Cough. ;-) Thing is, both of my examples refer to a task that is expecting a reply to its message, so the calling task will also need to stall until the recipient has been polled.

Dec 11, 2015 11:44am Colin (478) 2433 posts	Yes but the calling task is stalled anyway in a state that the wimp can pass it messages when it calls wimp_poll to get the reply. Wimp_poll always stalls an application – a wimp event on that application makes the application continue.

Dec 24, 2015 12:00am David J. Ruck (33) 1635 posts	Going back five and a half years to the original post, I have to say realistically option 6 is most likely and option 2 is the only other one that is practical. Although we’d all love full symmetric multi-processing like grown up OS’s, even if the work was done, there’s only a handful of applications still in active development which could be modified to take advantage of it, everything else would either at best gain no advantage or at worse break and never be fixed. Keep the co-op and use the additional cores for specific offload processing. But if we going to try to make real SMP use of multi-core ARM processors I’d go for something between 3 and 4, and to be trendy call it RISC OS containers. As RISC OS is essentially a single client OS, with a co-operative Window manager bolted on top, without a complete re-write to make it fully thread safe i.e. it can be used by multiple clients (tasks in different threads or on different processors) simultaneously, the easiest way is run multiple copies of RISC OS. Each application would be invoked in a separate thread running a copy of RISC OS. An underlying pre-emptive kernel would switch between the threads (and allocate them to multiple cores). A layer of glue logic would both virtualise the hardware so each copy of RISC OS thinks it has exclusive access, but also provide a legacy message passing system so applications could communicate with each other despite running on different OS instances. This would allow for applications to run in parallel when handling input events, performing screen redraws and null processing. At message passing events, the current blocking semantics would be enforced causing threads to stall until the message has handled. The evolution of this method would be to gradual move functionality from the separate RISC OS instances in to the underlying fully thread safe layer. Eventually coming up with a new Wimp which would support a different pre-emptive compatible message passing protocol, which would open up the way for a new class of application while still maintaining compatibility for legacy applications running in their own OS instances. But pigs are more likely to fly.

Dec 24, 2015 6:53am David Feugey (2125) 2709 posts	Keep the co-op and use the additional cores for specific offload processing. Yep it can be used by multiple clients (tasks in different threads or on different processors) simultaneously That’s not SMP, but AMP. One session of RISC OS on each core (of course, only one will make the I/O). IMHO it’s the best approach. SMP seems to be cool since it manages all. But it’s not. When going outside the processor, you’ll still have the same problem with your applications. SMP cannot grow outside the CPU. AMP can, since it’s similar to Numa or cluster approach. Once your software is AMP compliant, IE capable of being spread across several sessions of RISC OS, it can be used in AMP mode, or on a cluster, unchanged. SMP is a facility, but not a very good idea (and it forces you to use PMT, so to loose some predictability).

Mar 11, 2016 7:22pm William Harden (2174) 244 posts	Right – not in a position to test this (mainly because my monitor is VGA and I’ve used a Pi1 with X100 connected to it – I have a Pi3 ready to go but need HDMI availability to play with it properly). However, a read this evening would suggest that setting up the Pi for multicore is more straightforward than the Panda. The cores are all enabled on boot, and basically sat waiting for an address for code to run from, which is supplied in the mailbox (the Panda’s from what I read needed turning on). Looks like the mailbox is stored at a physical address, (0×4000008C + 0×10 * core_number). So is it possible to load some code into logical address space, get its physical address, then push the physical address into the mailbox address above? (Clearly we’re not talking ‘et voila, useful multicore’ – just a question of whether it’s possible to demonstrate more than one core being in actual use at once as a very first baby step).

Mar 11, 2016 8:54pm Jeffrey Lee (213) 6048 posts	So is it possible to load some code into logical address space, get its physical address, then push the physical address into the mailbox address above? In theory yes. There are at least a couple of threads which discuss this (1, 2, 3) but no success stories yet. more straightforward than the Panda I thought the Panda was pretty straightforward – AIUI the boot ROM will put the second core into a similar sleep loop and all you need to do to wake it up is write to AUX_CORE_BOOT_1 (boot address), AUX_CORE_BOOT_0 (“jump to boot address on next event” flag), then send an event via SEV. Check the TRM! Edit: “the Panda’s from what I read needed turning on” – Ah yes, it’s possible there’s some power management setup which is needed. Ignore me :-)

Mar 29, 2016 11:17pm Jeffrey Lee (213) 6048 posts	TaskWindows could “leverage” (!) the ability to offload code onto other cores I’m starting to think that that’s an easily achievable goal, with useful short-term gains, and it can act as the foundation for a lot of future work. Make the memory map for the slave cores contain just the ROM, the appslot for the executing task, and a small amount of kernel workspace. (Having the ROM available isn’t strictly necessary, but it will allow apps which use BASIC or CLib to function without being pushed onto the main core all the time) ROM is readable from user mode, appslot is read/write, everything else is only accessible from privileged modes. Tasks are restricted from entering privileged modes. If the task running on the slave core calls a SWI or triggers an abort (e.g. trying to access a dynamic area which isn’t mapped in, like the RMA) then suspend execution at that point and push the task into a queue of tasks which are waiting for the main core. Then at a later point the taskwindow module on the main core will switch to that task (as normal) and restart the aborting instruction. The task will run on the primary core for that timeslice, after which it will be put back into the pool of tasks which are available for the slave cores to run. With tomorrow’s Pi ROM I think it would be possible to write it as a drop-in replacement for the current taskwindow module – no changes to the kernel or the rest of the OS required. Once the basic system is up and running we could start extending it with more capabilities, e.g.: Concurrent execution of threads for the same task (would need some kernel changes to allow pages to be marked as shared in the page tables – for the initial version we can get by without shareable pages because of the excessive cache flushing we do on task switches). UnixLib’s pthreads could be extended to support it. Add a SWI dispatcher to the slave kernel and start flagging modules/SWIs as being MP-safe – to reduce the frequency that tasks need to switch back to the primary core. This would also require the RMA (and all other DAs?) to be mapped in on the slave cores (and also shareable pages again). But because the user tasks aren’t MP-safe we’d probably make the DAs all inaccessible from user mode (on the slave cores) so that only the MP-safe modules will be allowed access. This might help foster adoption of better memory protection/usage in the OS, e.g. a read-only RMA for containing module code/const data (so it can safely be mapped in with user read access across all cores), the use of APIs where the caller is responsible for allocating input/output buffers (avoids shared global data being changed while a task is still looking at it), etc. The ability to flag modules/SWIs as MP-safe would also be the stepping stone towards allowing modules to use threads, so we can upgrade to the latest BSD network/USB stacks without having to worry about the fact they’re entirely dependent on threads Work out how to add Wimp support, so Wimp tasks can (partially) run on the other cores. The tricky bit here is that you’ve got things like Wimp_TransferBlock which can be used to peek at the memory of other tasks, complex message protocols, etc. So chances are the task would have to indicate when it’s in a state where it can be safely offloaded to another core (i.e. it knows that there aren’t any complex message sequences going on, and it knows that no other task is about to read its memory via Wimp_TransferBlock)

Mar 30, 2016 5:59am David Feugey (2125) 2709 posts	Good plan. Perhaps even better than a full SMP system. We have so many problems with locks and other things, that it’s almost always better to say that core 0 should be the only responsible of GUI and other core functions. If we can use Basic, C, ASM code on other cores, that’s good. If we can have light threads, that’s good too. And if you have a solution to go back and forth to core 0 for system calls… hey, that’s perfect. The only problem are TaskWindows. It would be perfect if they could work without the Wimp, directly from a CLI only system.

Mar 30, 2016 5:59am David Feugey (2125) 2709 posts	Nota: that could be a bounty.

Mar 30, 2016 7:39am John Williams (567) 768 posts	[TaskWindows could “leverage” (!) the ability to offload code onto other cores] I’m starting to think that that’s an easily achievable goal, with useful short-term gains, and it can act as the foundation for a lot of future work. Make the memory map for the slave cores contain just the ROM, the appslot for the executing task, and a small amount of kernel workspace. (Having the ROM available isn’t strictly necessary, but it will allow apps which use BASIC or CLib to function without being pushed onto the main core all the time) I hesitate to ask this, not really having much (any?) understanding of the matter, but does the above imply, to reach my wish to have a RISC OS computer which could access a Linux browser, perhaps in a desktop window, that RISC OS could run on the primary core, and a (simplified?) version of Linux could run, say Firefox, on another, presenting it to the RISCOS desktop in that window? If that were possible, I suspect it would tick all my boxes! Is it just a totally misconceived idea that couldn’t possibly work?