Thinking ahead: Supporting multicore CPUs
Pages: 1 ... 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Steffen Huber (91) 1953 posts |
FPEPC by WSS: http://wss.co.uk/products.html#FPEPC |
Jeffrey Lee (213) 6048 posts |
Over Christmas I was able to spend a bit more time working on SMP-related things. Compared to this post the status is that:
As things progress I’m starting to get a better idea of the challenges and design decisions that we’ll be facing when updating the OS. So at some point I’ll probably have to put together a doc which summarises all of the issues and potential solutions. I was hoping that I’d be able to get things to the point where I’d be able to submit the pending kernel + BCM HAL changes to CVS (i.e. I’d be confident that the changes are sensible and aren’t going to break things for regular users), but since there’s a lot of room for experimentation with different solutions to problems I suspect the better option would be to submit the changes to a new branch. |
Anthony Vaughan Bartram (2454) 458 posts |
This sounds really cool Jeffrey. Although I’d love to have a go at running/reviewing the test app in any case. Thanks, |
George T. Greenfield (154) 748 posts |
Very encouraging that you’re making progress in this key area Jeffrey. All being well I should have a ‘spare’ Pi2 available for use running test modules/ROMs if and when required in a few weeks. |
Kuemmel (439) 384 posts |
Sounds great Jeffrey ! Can you explain how you will deal with shared data or memory access of the different tasks running on different CPU’s ? For example if more than one task needs information from the other one in form of a changing variable or memory address ? In the x86 world a “lock” prefix was invented to prevent any “leakage” in assembly language. After googleing for something similar in the ARM-World it seems that there’s LDREX and STREX…is that the way to go ? Would be nice to have some basic example code to learn from once you are ready. |
Jeffrey Lee (213) 6048 posts |
There are actually a bunch of instructions that you need to use.
Correct use of the above instructions is difficult – if you’ve got access to the ARMv7 ARM have a read through the “barrier litmus tests” chapter for examples of how things can go wrong. So “basic example code” is likely to take the form of “here, use this C library which implements high-level synchronisation primitives” rather than examples of how to use the underlying instructions directly :-) (e.g. SyncLib has spinlock and mutex implementations which contain all the necessary barriers to make them suitable for general-purpose use. But if you use the barrier or CPU event functions directly then extra care is necessary.) |
Clive Semmens (2335) 3276 posts |
“Barrier Litmus Tests” – ooh, nice! That wasn’t in the original wot I was involved in writing. Had huge fun* with writing scraps of code to test my understanding of these instructions, just to be sure I was getting the instruction documentation right…and no, don’t rely on me as an expert on them after all these years…the silicon I was running them on wasn’t even production silicon… * according to my dictionary, this means “trouble” |
Jeffrey Lee (213) 6048 posts |
Over here is an archive containing the source for my current HAL+kernel changes and the SMP module. There’s also a prebuilt Pi ROM + SMP module so that people can experiment with it without needing to have a working ROM build environment, and a readme covering the functionality of the SMP module. Everything’s still very much in the prototype phase, but it has reached the point where there’s a useful microkernel running on the other cores:
Also there’s no sample code, so unless you’re planning on writing something yourself or reviewing the code there’s probably no point downloading the archive. I’m not sure yet what the next step from here is going to be – there are a lot of areas that could be worked on, and a lot of scope to have people working on things in parallel. So if you’re interested in getting involved, watch this space for future information – but also feel free to poke holes in the code I’ve posted above. |
Rick Murray (539) 13840 posts |
Well… damn! That’s all I can say. |
Anthony Vaughan Bartram (2454) 458 posts |
Well this is sort of like Hello World (assuming the use of the prototype RISCOS.IMG on a PI 2 and prior rmload smp) Probably not cleaning up but I need to read the readme again…. #include < stdio.h > #include < time.h > #include "kernel.h" #include "swis.h" /* Threading example. Sets caller parameter to '1' if thread is executed. */ typedef int thread_t; #define USR 16 #define SMP_CreateThread 0x0c1242 // n.b. This should really thunk through assembler to set up a stack etc. // But as the example entry point only sets a value & for simplicity // this is not implemented. // thread_t ThreadCreate(const char * name, void * entry, void * param) { thread_t tid = -1; _ kernel_swi_regs regs; _ kernel_oserror *err; regs.r[ 0 ] = (int)name; regs.r[ 1 ] = 0; // Affinity mask regs.r[ 2 ] = 0; // Pollword regs.r[ 3 ] = (int)param; // Parameter regs.r[ 4 ] = 0; regs.r[ 5 ] = (int)entry; regs.r[ 6 ] = USR; _ kernel_swi(SMP_CreateThread,®s,®s); tid = regs.r[ 0 ]; return tid; } int ThreadEntry(void * param) { int * out = (int * )param; ( * out ) = 1; return 0; } int main(int argc, char * argv[]) { time_t start; start = time(NULL); printf("Start Time %d\n\n", start); printf("Attempting to invoke thread\n"); { volatile int rtn = 0; thread_t t = -1; t = ThreadCreate("first", (void*)&ThreadEntry, &rtn); while (time(NULL) < (start + 2)); printf("tid = %x rtn = %d\n", t, rtn); if (rtn == 1) printf("Thread was called successfully\n"); } return 0; } |
Anthony Vaughan Bartram (2454) 458 posts |
Worth being careful with the parameter address. As documented in the readme, memory allocated in the application space is the same in the WIMP for each task… So the anything derivative of the example above should really be run outside the WIMP otherwise you can get errors visible with SMPMetrics. |
Jeffrey Lee (213) 6048 posts |
HTML <pre> tags usually work best for me.
Yes, a call to SMP_DestroyThread is required to clean up a thread after it’s exited/terminated. A couple of other things to be wary of with that code:
|
Anthony Vaughan Bartram (2454) 458 posts |
Hi Jeffrey, I’ll correct the example above at least to include volatile and return an int (will add thread destruction later). I’ll try out the pre tag. Thanks, Tony |
Kuemmel (439) 384 posts |
Hi Anthony, would that C-Code also work if transformed to BASIC ? Any chance you would do that including that e.g. 2 tasks would be created on 2 cores and also do something, like a small calculation or just print some text individually…so we would have a nice Hello World for both main coding platforms on Risc OS that encourages people start coding…including myself beeing a lousy single task coder… |
Anthony Vaughan Bartram (2454) 458 posts |
Hi Kuemmel, Tony |
Kuemmel (439) 384 posts |
@Jeffrey, @Anthony: May be BBC Basic could start itself individually as a thread when a code is started on each core and run BASIC code…may be it’s to insane to believe it would work. But how is that done when I start lets say 2 or more WIMP Basic programs anyway under Risc OS now without the support of multiple cores ? Is each WIMP program starting a new BBC Basic interpreter to run ? If that would be the case this could be assigned to cores somehow I would guess ? |
David Feugey (2125) 2709 posts |
Hum, or a cut down version of BBC Basic that could run on other cores? |
Rick Murray (539) 13840 posts |
No. The BASIC interpreter stores “state” in the area between LOMEM and PAGE, and I think some more up around HIMEM? I’m sure somebody will correct me here. Anyway, what happens is exactly what happens with C programs. The “state” of the program is held within its own application space. As such, when the Wimp is polled the registers are saved and the entire application memory is paged out1. Another application is paged in1 and the registers for that app are restored before the Wimp_Poll call of the new application returns, and everything just carries on from where it left off. The view from the application: Calls Wimp_Poll, it’ll return with an event that needs to be handled. The view from the Wimp: Application called Wimp_Poll. Shuffle it out of the way and find the next app. Repeat in a round-robin fashion, noting if the app is using PollIdle or doesn’t want specific events (like Null polls). To consider: Between your app calling Wimp_Poll and the SWI returning to you, it is entirely possible that fifty other apps have polled fifty hundred thousand times and twenty minutes have gone by.
I think the problem that is going to rear its head the most with multi-core work is exactly how much of the internals of RISC OS are not re-entrant. You cannot access files on TickerV or CallAfter/CallEvery. You must use those to trigger a Callback (which is invoked when the system is next “not busy”, so could be any time really…) and then perform the file access.
To be honest, that would be kind of pointless, don’t you think? A better idea, and one that might be doable (just) is to have a version of BASIC that implements a TUBE-like interface to something (module?) on RISC OS Prime. The other-core program (and not just BASIC) will be stalled awaiting the host module doing its thing and sending data (if any) back to the other core. In this way, other cores can access the usual system facilities, they just might take a speed hit awaiting the opportunity. And, yes, this does mean RISC OS Prime will be doing work for the other cores. Trust me, you only want one thing talking to devices in a controlled manner. Anything else is going to be messy. I believe Jeffrey mentioned a TaskWindow that uses other cores? If so, that’s sort of like what I’m thinking of, and could well be a good start. The RISC OS mini kernels on the other cores should, necessarily, do as much as they can for themselves (otherwise it kind of ruins the point of having multiple cores to use), but it should absolutely NOT be afraid of kicking more complicated things back to the host. There’s only one keyboard, only one screen, so be like TUBE and let the host deal with that and the co-pro (co-core?) programs deal with them via the host. 1 Lazy task swapping uses some sort of mechanism to only page in partial bits of memory. I’m not sure of the exact method, suffice to say that while whole-appspace swapping was acceptable for older slower machines with small apps (remember, the MEMC with 4MB installed would use 32K pages), modern ARMs with 4K pages repeatedly swapping 10-20MB appspace…was rather painful. |
Anthony Vaughan Bartram (2454) 458 posts |
Hi Rick, David, Kuemmel et al, This thread is (and possibly has through most of its life) been an enjoyable list of speculative and exploratory ideas. However, I recommend re-reading Jeffrey’s post. It outlines a threading scheme with the option to list & permit safe SWI calls. Development tasks are being tracked outside of this thread in order to action a series of changes to incrementally build a multi-core capability. In its simplest form – launching threads to perform background tasks are useful. Ideally auto-scheduling safe portions of a process on different cores is feasible. This is consistent with other OS’s which employ abstraction of a process away from its execution core/processor. If you would like to help perform some of this work, please reach out to Jeffrey and myself and individual tasks could be assigned to you. Thanks,Tony |
David Feugey (2125) 2709 posts |
Perhaps, but in fact, what you suggest is what I was thinking about. |
rob andrews (112) 200 posts |
Hi Jeffery i see that you are on so can i ask if you are going to do an SMP module for OMAP5 or will the current one work with your changes to the kernal?? |
Jeffrey Lee (213) 6048 posts |
The plan is to have it so that the same SMP module can work with all of the multi-core machines. But in order to support this, each machine will need to have some changes made to its HAL. At the moment I’m trying to get OMAP4 working, so I can test out a new way of managing interrupts. There are also a couple of DMA/memory management things that need fixing before multi-core code can safely be used on other machines (the Pi and OMAP4 are fine since they don’t make much use of DMA). Once that’s all done it shouldn’t take much effort to get the other machines working (the changes that are being made to the OMAP4 HAL will be almost identical to the changes that will need making to the other HALs, since they all use the same type of interrupt controller). |
rob andrews (112) 200 posts |
Good news I look forward to testing it when it becomes available. |
Jeffrey Lee (213) 6048 posts |
Does anyone have any suggestions for some test code? I need something that I can use to detect any cache maintenance/memory management bugs – i.e. have a test harness which repeatedly runs the code on the other core(s) and watches for any failures while the main core messes with the system. Specifically, I’m looking for an algorithm which will touch a reasonably large amount of memory (lots of reading & writing, repeated reads & writes of the same location, etc.), will run for a reasonable amount of time (e.g. at least a second or two), and produces the same result when given the same input. Further constraints are:
|
Alan Robertson (52) 420 posts |
Would a Fast Fourier Transformation algorithm be of use here? |
Pages: 1 ... 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26