Thinking ahead: Supporting multicore CPUs
Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 26
Jeffrey Lee (213) 6048 posts |
I think we’re all in agreement that it would be a good thing if RISC OS were able to take advantage of multicore CPUs. And I think we all understand that it isn’t likely to be easy to find/implement a solution which is both easy for programmers to make use of and allows the additional cores to be used to their fullest. And those of us who have been watching the CPU market for the last few years are probably in agreement that the days of the high-performance single-core CPU are numbered; sooner or later the only high-performance ARM CPUs available will be multicore-only. So if we’re serious about the long-term future of RISC OS we need to have some kind of discussion about what our options are with regards to supporting multicore CPUs, so that we can work out what steps we can take to ensure that one day we reach our goal. This thread intends to be the place for that discussion to take place (plus somewhere for me to point people towards whenever they mention OMAP4 ;)) The problemFor the uninitiated, the problem with supporting multiple cores is communication. Whenever a core tries to access a shared resources it must make sure that that resource isn’t already in use by another core. This shared resource could be almost anything from a physical device like a printer right down to the smallest resource possible, an individual bit of memory. But no matter what the resource is, the outcome is always the same if two programs/threads/cores fail to negotiate with each other for access: something bad will happen and your data will be corrupted or your programs will crash or malfunction. At the moment the most basic method RISC OS uses for preventing concurrent access to a resource is to disable interrupts in the PSR. This ensures that unless something unexpected like a data abort occurs, the current program has 100% control over what the CPU is doing. But in a multicore environment this simply won’t work – the other cores will continue to run as normal. This means that all the existing pieces of code that use this method will need updating to use new methods. The fact that the code that’s currently used for preventing concurrent accesses is insufficient, coupled with the fact that shared resource could be almost anything, is the reason why adding multicore support isn’t just a simple case of updating the Wimp to allow two tasks to run at once. The Wimp has no way of knowing what resources are and aren’t shared, so as soon as a program tries calling a SWI or tries to access memory outside its application space it will have to be blocked until any other programs (which are currently accessing potentially shared resources) explicitly yield by calling Wimp_Poll. The optionsThe way I see it, we have several options available with regards to getting RISC OS to run on a multicore CPU.
Other things that need consideringAny kind of message passing system – events, upcalls, vectors, broadcast wimp messages, etc. These may need to be changed completely in order to make sense in a multicore/multithreaded environment. Global environment state, like system variables or the CSD - moving to a more process-based system where these state variables are replicated for each process may be required in order to ensure programs don’t conflict with each other. See also
Any thoughts? |
Terje Slettebø (285) 275 posts |
Good initiative, Jeffrey. I agree with your analysis of the available options. The following are some thoughts from someone who’s mostly a novice when it comes to programming with concurrency (doing web application development, you largely get concurrency for “free”, when using a “share nothing” architecture like PHP). As you touched at, yourself, option 5 may be one of the most reasonable approaches. If it could be made to work, it would allow a gradual migration to concurrency, without having to wait for a full OS rewrite á la option 1. Furthermore, we wouldn’t have to deal with preemption, “only” synchronisation of access to any shared state, or alternatively, where it makes sense, unsharing state by making each task something like a process (with its own system variables, etc.), as you suggest. It should be possible to move from this, to a fully preemptive system later, without a full rewrite, since the above things would be needed in such a system, as well. |
Jeffrey Lee (213) 6048 posts |
Option 5 effectively is preemption. If a program’s currently running, and it isn’t trying to access the big lump of shared resources (or it’s currently waiting to be given access) then it can be pre-empted and replaced with any other runnable program. The only thing we’d need to add to make it fully preemptive would be a thread scheduler that switches threads/processes at arbitrary points in time instead of just due to resource conflicts. If we’re aiming for a multicore/thread-aware OS, then how about the following for a list of things we can start doing now. Apart from helping bring us closer to the multicore/multithread goal, they’ll all result in improvements that will benefit us today.
|
Martin Bazley (331) 379 posts |
A few thoughts from someone who really doesn’t have a clue what he’s talking about: We need to do something about non-thread safe APIs which extends beyond the sensible step of making them use their own buffers. The example given in the riscos.info article is Wimp_OpenTemplate, which does not have any kind of template ‘handle’ system at all, nor any provision for the implementation of one. If we were to fix all these APIs properly, we would instantly break backwards compatibility for every application ever written. My proposal: According to the ARM ARM, on entry to the handler at address 0×8, R14_svc is set to the address of the instruction after the SWI call. Therefore, we know the address of the code which called it. Now, unless I’ve missed something, there are four different areas this could come from: application space, the RMA, another dynamic area, or the ROM. This breaks down further into the application currently paged in, any of the modules or utilities occupying the RMA, the dynamic area number, or one of the ROM modules (or possibly a different piece of ROM code). Now, presumably the Wimp must have a unique record somewhere of what it last paged into 0×8000 (in fact, are task handles allocated the moment an application is started, or only on a call to Wimp_Initialise?). I know there’s ways and means of determining the base address of a RAM or ROM module given an address inside it (we have ‘where’, after all) and dynamic areas already come with their own unique handles. Given all this, we must be able to put together a number which will uniquely identify any given process to the routine being called. After this, we simply need to implement a handle system in the non-thread safe APIs using the unique value passed by the OS, which recognises the calling task and restores the previously stored state from when the two last communicated. To avoid conflicts of interest in SWIs which expect certain registers to have certain values (are there any?), we could store it in an easily retrieved from location, such as at the top of the SVC stack, whence it could be loaded with “LDR Rd,[R13,#-4]”. I suspect it won’t be quite as simple as that… |
Terje Slettebø (285) 275 posts |
Ok, then I misunderstood. What I was thinking of, and thought you meant, was that it should still have cooperative multitasking, but more than one task would be able to run at the same time. In other words, each core would be running its own poll loop, fetching events from a shared queue, and dispatching messages to the appropriate tasks through Wimp_Poll. |
Jeffrey Lee (213) 6048 posts |
No, it won’t :)
Task handles are allocated on the call to Wimp_Initialise. But there are only two types of tasks – module tasks (which first have to be started by *RMRun) and application tasks (which must be started via Wimp_StartTask). So it shouldn’t be too hard to use those events as markers for when new processes start. Detecting when they stop is also quite easy, since they all must go via OS_Exit. Also I’m fairly certain that non-tasks can’t call Wimp_OpenTemplate, so for handling templates we only need to know the handle of the active task (which the Wimp will obviously know whenever one of the template SWIs gets called).
Well, SWIs obviously expect their parameters to be in the right registers on entry :) On entry to a SWI R0-R9 will be used for parameters, R10 is used for the SWI number (I think!), and R12+ are used for other data (module workspace pointer, stack, etc.). This should leave R11 free for a process identifier, or a pointer to a block of module-specific, process-specific memory (so modules won’t have to manually search for the block of memory they use to store all the data for a process).
Storing stuff at magic locations inside the stack is bad. Just because Acorn did it with the shared C library it doesn’t mean that we should do it too! A better approach would be a SWI which returns a pointer to the value. The value itself would be stored in a page of memory that contains other process-specific information. This page of memory then gets swapped out by the OS whenever it switches from one process to another (much like application space); therefore once the module only needs to call the SWI once when it initialises in order to get the pointer to the data. Of course this approach isn’t quite as good as passing the process ID in a register. Although I can’t think of a real example at the moment, imagine what would happen if a module schedules something that will complete at a later date (i.e. triggered by an interrupt). Interrupt handlers generally aren’t associated with processes, so when the completion interrupt occurs the value of the process ID in the magic page will just be the value of whatever the current process is, not the process that requested the interrupt-based activity to occur. So if the programmer hadn’t thought everything through when he wrote the module then he might try looking up the process ID from the magic pointer and end up associating the result of the interrupt with the wrong process. On the other hand, if the only way of getting the process ID was via a register on entry to a module’s SWI, then the programmer would be forced to keep proper track of which process is associated with each interrupt-based activity. Unfortunately I’m not 100% sure that will work, since I think some naughty modules do like to skip the kernel SWI dispatcher and jump straight to the SWI handler inside the target module. These naughty modules would need updating to preserve the process ID in R11 (and it wouldn’t work at all if it was a pointer to module-specific, process-specific memory). So unless we want to potentially break lots of code, a SWI to return a pointer to a magic page might be the only solution. |
James Peacock (318) 129 posts |
A lot of things to comment on here. I think gradually introducing more memory protection would be good thing anyway. There is always the problem of backwards compatibility, but given that most compiled applications need to be at rebuilt for new processors, it may be a good time to start investigating locking things down a little bit more. Although they are a popular example, I think the Wimp_OpenTemplate and friends are one of the easier cases to deal with w.r.t preemption: Wimp can tell which task called those SWIs and could simply allocate task specific buffers itself. How much does the kernel know about different tasks or processes? Does the Wimp still manage processes, but delegate paging to the kernel? In which case I think the kernel would need to take over responsibility for managing processes, even if for the moment it is the Wimp which asks it to switch between them. How difficult would such a change be? I ask this as there is a whole host of kernel settings to do with the graphics system, language, alphabet, sprite redirection etc. which would need to be per-process. It seems to be that there is so much to talk about on this topic that this discussion could become incoherent unless we are careful. What I’ve written above is probably |
Jeffrey Lee (213) 6048 posts |
Not as much as it should!
Yes.
If it’s just a case of relocating all the bits which the Wimp currently tracks then it should be fairly easy. Extending it to track new things might be trickier, though, simply because you run the risk of breaking compatability with existing code. Although any well-written program which makes use of shared state shouldn’t expect the settings to be preserved across calls to Wimp_Poll, so in theory the switch over to task-local state shouldn’t affect them at all. |
Jess Hampshire (158) 865 posts |
Questions (from a non programmer perspective) about the options 4. Since the “new version of RISC OS” isn’t compatible in this scenario, could it be a hybrid of the best bits of BSD (or similar – licence dependent) and RISC OS (eg like ROLF)? 5. “Initial performance poor” by this do you mean similar to not using the extra cores a lot of the time? (ie /Not/ slower that option 6) Increasing memory protection. Now this seems like a stunningly good idea, even if we were to stay stuck on one core. Ideally, however it would be good to have user control over this (plus any other changes that break old applications). Perhaps a legacy option in configure – (does RO 6 have something like this?) It would also be nice to be able to control it on the fly, so if you have one app that breaks, you can turn off the protection for the duration. eg a utility that changes the option and displays the state on the icon bar. (could be added to the run file of a bad app) |
Jeffrey Lee (213) 6048 posts |
Yes, it could.
Yes.
Yes, it should be possible to have user control over the memory protection. I’m not sure how easy it would be to have backwards-compatability options for every other change though (Although we’d obviously aim to break as little stuff as possible!) |
Steffen Huber (91) 1953 posts |
Hi Jeffrey, it is surely an interesting array of possibilities that you have presented. To make a sound decision, I think we have to establish first where we (or, based on past experience, you ;-))want to take RISC OS in the future.
If we decide to keep RISC OS similar to what it is today, I think Option 2 is the only viable solution. We keep maximum compatibility, and still have quite a lot of potential using the 2nd core (see earlier tests by Tematic to run the IP stack on a second processor). At the end of the day, I don’t think that “proper” support for multiple cores should be a priority at all. My personal top priority would be to revamp the filing system stuff. Support for large files (and consequently large ImageFSes), transfering the hardware access away from filecore clients into a proper device manager, make it easier to port foreign FSes… |
Jeffrey Lee (213) 6048 posts |
All very good questions! I think ROOL and Castle are the people best suited to answer most of them, since they are the ones ultimately in charge of the OS. All I’ll say is that I certainly don’t want RISC OS to turn into Yet Another Unix/Yet Another Linux (my skin crawls every time I see someone refer to a me-too Linux distribution as being a distinct OS!). RISC OS’s uniqueness is pretty much the only thing it’s got going for it. |
Adrian Smith (422) 2 posts |
By coincidence I was thinking about this the other day, and wrote up my thoughts here (basically option #2): http://www.databasesandlife.com/multi-core-risc-os-proposal/ |
Jess Hampshire (158) 865 posts |
Your suggestion (to me as a non programmer) appears somewhere between Jeffrey’s options 2 and 4. This leads me to the question, could a roadmap that leads to option 4 include option 2 as an earlier milestone? (Or even one that leads to option 5) |
Martin Bazley (331) 379 posts |
Well, since we’re going to have to update a lot of code anyway, it shouldn’t be too hard to insert an STM R11 and an LDM R11 at the beginning and end of every misbehaving SWI. (I assume it would only ever be kernel modules which would be able to use branches.) Thinking further along these lines, once we have code to produce a unique process identifier, it has potentially many applications – for example, whatever bit of FileSwitch (or possibly something else, I really can’t tell the difference between them) which sets and returns the CSD. What I’m more worried about are system variables – surely we can’t reproduce every one for all processes? |
James Peacock (318) 129 posts |
I’ve been thinking along the same lines as Adrian and option 4 via option 2, only to use an existing micro kernel such as one of the L4 variants or something similar. Would it be possible to slot RISC OS in on top of such a beast without an impractical amount of effort? I’m not sure, but it might be possible to adapt the micro kernel to make it possible. Once this is done RISC OS like system communication could be built using the IPC provided by the underlying micro kernel and modules gradually converted into micro kernel processes or some sort of shared library/resource as appropriate. Existing applications could run in the old RISC OS or as microkernel processes if some form of SWI mapping could be arranged. The big advantage of this approach is that micro kernels already deal with processes and have highly efficient interprocess communication, meaning these things don’t need to be developed from scratch. Longer term it gives a mechanism to gradually transform RISC OS into a more modern system. |
Ben Avison (25) 445 posts |
Here’s another vote for using a per-thread memory allocation system as a way to patch up SWIs like Wimp_OpenTemplate that assume they can store data in what is (in ‘C’ terms) static data and have it preserved from one SWI to the next. In fact this is already a problem – try running a couple of taskwindows concurrently which do a lot of GSTrans processing or directory enumeration for example. Yes, it would have been nice if these sorts of APIs had been designed correctly in the first place, but wholesale incompatible API changes will just break lots of software and alienate what developers are left – it’s particularly tricky for RISC OS where a lot of software is written in assembler, and just requiring an extra register here or there can be quite a headache to accommodate. In fact, you’d probably also want to store things like the CSD in per-thread storage too, maybe some categories of system variables too. It’s a feature that’s seriously overdue. The Cortex-A9 is probably the first chance RISC OS will ever have had to run on a truly symmetric multiprocessor system. The stuff we did at Tematic was on chips where the ARM cores were of radically different specifications (not all had MMUs for example) and could see different peripherals, and is really not reusable. The Hydra IIRC used a scheme like Jeffrey’s option 2, and never really gathered any support from developers, even back in those relatively prosperous days. I can only really see a full SMP approach working, even though it’d require a hell of a lot of effort. I’ve given the whole thing a fair bit of thought on and off for years. You’ve seen the threading branch of the Wimp – I’ve never been totally happy with that. I do feel that process management is too tightly tied to the GUI and too loosely tied to threading libraries and pre-empters like the TaskWindow module at the moment. A good scheduler will know about the comparative requirements of different processes, but also know that context switches between threads of the same process are cheaper than those between processes (especially if you can keep them on the same CPU), and I don’t think any of the solutions we have at the moment can really address that. Decoupling process management from the GUI would make it easier to write daemon processes, and certain things like the Internet stack which have to make heavy use of callbacks to achieve background operation should be easier to port and maintain. Writing a good scheduler appears to be rather a black art, and critical to making a good multiprocessor OS. Given the huge amount of effort involved in otherwise making RISC OS multicore-aware, I’d be sorely tempted to lift a scheduler from one of the BSDs or another permissively-licensed OS - it’s not like we haven’t borrowed from BSD heavily already! One way of looking at the problem of addressing backward compatibility would be for the OS to, at least initially, break everything up into very coarse lumps protected by mutexes which the OS assigns to on their behalf. For example, the USR mode parts of two different applications would rarely conflict with one another, and could probably run concurrently on two different cores. Things only get tricky when you introduce privileged mode code. So a first solution would be to mutex-protect all SWI and interrupt code so that all of it it only ever runs on one core. A next step would be to break that granularity down to a per-module level, so the OS assigns one mutex per module which is claimed on all its entry points (including registered callbacks, interrupt handler etc). This could remain in place indefinitely for legacy modules which are no longer under development, but others could signal their ability to do their own concurrency management, and work on each module could proceed in isolation. One other thing: I’d like to appeal against the suggestion of splitting the RMA into multiple dynamic areas. In terms of RAM, Beagleboard xM is already within a factor of 2 of a fully-expanded Iyonix and RAM sizes are bound to grow – no doubt with a few years we’ll be looking at machines with nearly 4GB of RAM. As physical RAM size grows, allocation of task-independent logical address space for dynamic areas becomes more and more of a problem and we end up having to pre-judge how the user wants to use their RAM, because we set a maximum possible size of each dynamic area as soon as we set the base address of the next dynamic area along. A better solution would be to retain one RMA, but have different page access permissions within the RMA depending upon whether the page contains code or data. Other dynamic areas, which don’t need to be accessed directly from all tasks, are probably better moved into application slots – the obvious one here is RAMFS. Ideally, the application slot should increase in size over time so that it is always larger than the amount of RAM fitted to any supported machine. |
Jeffrey Lee (213) 6048 posts |
Yes, very much so. A key part to all of those options would be to upgrade some or all of the kernel to provide threading features, either for the purpose of inter-process communication across the cores or just to provide the basic operating environment which code on the other cores will run under.
I don’t think there’d be a need to reproduce all the system variables. By default the system variables would be global, but with something like a *SetLocal command a process could create a new local system variable, or override the value of a global one. Things would get a bit more complicated when you take into account child processes inheriting the system variables of their parent, but ultimately the majority of the system variables would be global. I’ve been thinking along the same lines as Adrian and option 4 via option 2, only to use an existing micro kernel such as one of the L4 variants or something similar. Would it be possible to slot RISC OS in on top of such a beast without an impractical amount of effort? I’m not sure, but it might be possible to adapt the micro kernel to make it possible. Once this is done RISC OS like system communication could be built using the IPC provided by the underlying micro kernel and modules gradually converted into micro kernel processes or some sort of shared library/resource as appropriate. Existing applications could run in the old RISC OS or as microkernel processes if some form of SWI mapping could be arranged. It would probably be quite easy to slot RISC OS ontop of a microkernel, yes. But unless we go the role of creating a new OS, we’d ultimately want all the appropriate bits of the microkernel to be merged into the existing RISC OS kernel in order to get as much performance as possible (e.g. minimising overhead in SWI dispatching and IRQ handling, which would likely have to go through the microkernel first before reaching RISC OS) I’m also in agreement that it’s a sensible idea to take some threading code from elsewhere. Although I’ve got a few year’s experience with writing threaded code I’m nowhere near good enough to write a good quality thread scheduler and set of synchronisation primitives!
That’s a good point – I’d mostly been thinking about making stuff safe for multiple processes, not for multiple threads. In that case we may need to provide three levels of storage for modules – global, process, and thread.
Yes, that’s basically how I envisioned option 5 would work.
Fair point. The fragmentation caused by using per-page access rights won’t be anywhere near as bad as the extra logical address space required for splitting the RMA. |
Adrian Smith (422) 2 posts |
I assume that, back then:
Whereas these days:
I’m not saying option #2 is definitely the best, but only saying that because the Hydra API wasn’t widely adopted, doesn’t mean one should discount option #2 completely. |
Eric Rucker (325) 232 posts |
Option #2 is what other CMT OSes did to get multi-CPU support going. Multi-CPU Macs were pretty much just for Photoshop. However, you could use a hybrid of #2 and #4, I suspect – doesn’t Unixlib implement a threading model? Make Unixlib’s threading model take advantage of the SMP system, and I suspect most apps that use Unixlib could take advantage of it right away. I do have a suspicion that one processor will always have to be a master processor, with the others as slaves, though, with most apps running on the first processor, even if the OS goes to PMT later. But, I’m no developer, so… |
Jeffrey Lee (213) 6048 posts |
Here’s something interesting: ARMv7 has three CP15 registers that are dedicated to storing thread IDs (or process IDs, or whatever other 32bit value you want). Of course the only trouble there is that the registers aren’t present in older versions of the architecture, which could cause problems if we want the multicore version of RISC OS (or even programs written for it) to run on older machines. Also it’s worth pointing out the dual translation table base registers that are available in ARMv7. Basically they allow you to split the logical address space in two parts, one part uses one page table, the other part uses the other. Context switching is then just a case of updating one of the TTBR registers to point to the new page table (specifically it has to be TTBR0, which covers the lower part of the logical address range). This conveniently fits in with RISC OS’s memory map where application space starts at the low end of the logical address space (disregarding the 32k of kernel space) and extends to a nice power-of-two 512MB limit. If we can update RISC OS to use this scheme (which is on my todo list, somewhere!) then there won’t be any lazy task swapping overhead, so task switching should be even faster than it currently is. And how this relates to multi-core is that, in an ideal world, all cores could share the same page table for the upper part of the address space. This should help cut down on memory overhead (level 2 page tables can be several megabytes in size) and make cache/TLB maintenance simpler (I haven’t really looked at the docs much yet, but I believe that cache/TLB maintenane ops are broadcast to all the cores when they are running in SMP mode – so only the core that alters a shared page table would need to perform the cache maintenance operations). Each core can then make use of the TTBR0 register to switch between process-specific lower page tables as required. If we went down the route of using ‘magic pages’ to point to process specific information then, in order to make it possible to use the dual page tables on ARMv7, I’d envisage that the magic page would be located somewhere within TTBR0’s range. I.e. either in the lower 32k of RAM or as a page right at the end of application space. If each module is given one word of space within the page then it could use that word to point to process-specific data held elsewhere (i.e. in the RMA). Of course there’s still the problem of how to handle thread-specific and core-specific information. If we want to have the ability to have the same process active on multiple cores at once (and to use the same page tables) then we can’t store the thread ID/core ID at a fixed location in memory. We’d either have to use ARMv7-specific features directly (stopping software from running on older machines), or wrap the ARMv7 features in SWIs (a good idea, but could hurt performance for speed-critical software), or start storing the data in memory and use more page tables/RAM as a consequence. I suppose there is a compromise solution available – share the L2 upper page table but have unique L1 upper page tables. The L1 page table is much smaller than the L2, and by having unique L1 page tables it will easily allow us to have core-specific stacks (and other core specific data like the current thread ID) without wasting logical address space by making the same data visible to all cores. Any modifications to the L2 page table would automatically be broadcast to the other cores, so it would only be L1 page table modifications which would need software intervention on all cores. Also I suspect most of you have no idea what I just said. Don’t worry, I’m sure I’d find it unintelligible to myself if I were to go and read through it again ;) The main thing is that we try and keep an open mind when it comes to the possibility of taking advantage of new features offered by new architecture versions. |
Jess Hampshire (158) 865 posts |
Would wrapping the features in SWIs prevent anyone who needed to get high performance without compatibility, from using these features directly? |
Jeffrey Lee (213) 6048 posts |
No, not really. It would just be a bit neater and simpler if everything went through the same interface. |
Trevor Johnson (329) 1645 posts |
Would there be any merit in someone attending this Migrating to Multicore Architectures session at EmbeddedLIVE ? Or would it just be repeating what’s already well understood? |
Kuemmel (439) 384 posts |
Some thoughts from me… I used to program a lot on Risc OS in assembly language, but recently also on x86 to do a Multi-Core-Mandelbrot Benchmark on Windows (http://www.mikusite.de/). In the Windows/x86 world the use of multiple cores is quite nicely implemented as far as I can see it from that programming experience. Basically you detect the number of cores and then assign threads to the different cores. There are main functions like: “CreateThread”,”SetThreadPriority”,”SetThreadAffinityMask”,”ResumeThread”,... Furthermore with a ‘lock’ assembler command you can also lock for example memory access for a single thread/core for the use of writing to common memory. So basically it’s a piece of cake to program it…but I got no idea how this can be easily ‘translated’ to Risc OS and of course one needs to adjust all old software…which then of course leads to the question, how much software is still actively developed and in which way it can be usefull for a certain program. For the Mandelbrot it’s perfect, almost 100% scaling for each more core, as there’s almost no memory access… I for myself would love to have a dual core ARM with that kind of implementation…and even if it’s just for calculating Fractals ;-) |
Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 26