Multi-core and modules with thread-local storage
Julie Stamp (8365) 474 posts |
I’ve got a function, which I’ll simplify a bit and call f. The function uses a global variable y as storage.
This would work fine in an application or library in an app. The problem though would be if I introduced a second thread. To solve that I’d put a qualifier
Now imagine that f is actually in a module. The original unqualified version would again work, even if f is called in the background during a foreground call to f, since the foreground only regains control once the background has restored y. It would be a problem though if there were tasks running on multiple cores, since they would be able to simultaenously call f. I feel that here I should be able to again qualify y
and fix the problem. In the context of a module then _Thread_local would refer to OS threads (associated with separate supervisor stacks and running on separate cores). Any module with storage qualified as thread-local would see separate versions of that storage according to which OS thread it is being run it. Hope that makes sense and somebody has an idea about implementation – saw on the wiki that a processor register might hold the thread number (OS thread rather than application thread???) so maybe that could be used to index into the usual area for static data in a module but extended with a copy of the thread-local stuff for each core? |
Dave Higton (1515) 3559 posts |
But if that didn’t work, surely nothing would work with multiple threads? |
Paolo Fabio Zaino (28) 1893 posts |
It’s a good idea. If nothing works, then you could use an array of pointers and as the index a thread id (or something) and then allocate storage memory for each thread. Assuming you’re using Jeffrey’s SMP (given you mentioned different cores): #define MAX_THREADS 3 static void *module_pw; void *rma_alloc(size_t s) { void *mem; if (_swix(OS_Module, _IN(0)|_IN(3)|_OUT(2), ModHandReason_Claim, s, &mem)) { return NULL; } return mem; } void rma_free(void *ptr) { if (ptr) { _swix(OS_Module, _IN(0)|_IN(2), ModHandReason_Free, ptr); } } In a thread: // Thread function #if _GCC void * #else _kernel_oserror * #endif thread_function(void *arg) { long thread_id = (long)arg; // Allocate memory for this thread's data y[thread_id] = (int *)rma_alloc(sizeof(int)); if (y[thread_id] == NULL) { perror("Failed to allocate memory for thread data"); #ifdef _GCC pthread_exit(NULL); #else return NULL; #endif } // Example operation on the data *y[thread_id] = thread_id * 2; // Just an example operation printf("Thread %ld, data = %d\n", thread_id, *y[thread_id]); #ifdef _GCC pthread_exit(NULL); #else return NULL; #endif } And in your main or initialization function: int main() { int *y; y = (int *)rma_alloc(sizeof(int)*MAX_THREADS); #ifdef _GCC pthread_t threads[MAX_THREADS]; #else int threads[MAX_THREADS]; for (int I = 0; i < MAX_THREADS; I++) threads[I] = -1; #endif ... // Initialize the global array to NULL for (int i = 0; i < MAX_THREADS; ++i) { y[i] = NULL; } for (long i = 0; i < MAX_THREADS; ++i) { #ifdef _GCC int rc = pthread_create(&threads[i], NULL, thread_function, (void *)i); if (rc) { // handle error exit(-1); } #else char name[32]; _kernel_oserror *e = _swix(SMP_CreateThread, _INR(0,6)|_OUT(0), name, 0, 0, &y[I], module_pw, thread_function, 0x13, &threads[i]); if (e) { // handle error exit(-1); } #endif } #if _GCC // Join threads for (int i = 0; i < MAX_THREADS; ++i) { pthread_join(threads[i], NULL); } #endif ... for (long i = 0; i < MAX_THREADS; ++i) rma_free(y[i]); rma_free(y); As a bonus, this would allow your code to work with both GCC and DDE and if you need to “return” the storage content you can. Note: I haven’t tested this, so consider it some form of pseudo code! |
Julie Stamp (8365) 474 posts |
I found examples here and here of creating threads from a C application and C module. I thinking about if I’m creating threads in an application and calling a SWI in a C module, wanting to be able to register it with SMP_RegisterSWIChunk. If I use variables in static storage then I need a way to know what thread I’m in – that’s ok use SMP_CurrentThread. In module initialisation I could create copy of thread-local storage for each thread, but I didn’t find an SMP_ call to count the threads. There would then be a service call to notify of new threads to get another copy ready an initialised. Maybe another way is to test what thread we are in on entry to the SWI and then if it’s not one we’ve seen before, get the storage ready (but this could leave copies of the area hanging around??? Would still want a service call for thread end to be able to clean those out?) |
Simon Willcocks (1499) 540 posts |
Is there documentation on this threading interface? There’s a function in https://github.com/Simon-Willcocks/SimpleAsPi/blob/main/Processor/processor.h called change_word_if_equal that can be used to create a spin lock between cores. |
Paolo Fabio Zaino (28) 1893 posts |
yup
In the multi-core support code I am writing for my UltimaVM, I count them from the thread initialization code, not inside the thread. The function is uber simple right now: In the module init:
Some support functions (slightly different implementations of Jeffrey’s support functions):
Function to count threads (it’s usually called by main(), but in my VM a thread can generate a new thread, so: uvm_thread_id *gen_thread_id() { static mix_t mtx = NULL; if (!mtx) { _kernel_oserror *e = mtx_create(mtx); // handle error } // Lock mtx while (mtx_lock(mtx) != thrd_success); static volatile uvm_thread_id *thread_id = NULL; if (!thread_id) { thread_id = rma_alloc(sizeof(uvm_thread_id)); if (thread_id == NULL) { // Handle allocation failure return 0; } *thread_id = 0; // Initialize the first thread ID } *thread_id += 1; // Unlock and return thread_id while (mtx_unlock(mtx) != thrd_success); return thread_id; } This should ensure that thread counting is consistent, even when a new thread is created from a thread (I haven’t started to test SMP code yet, but the code works with pthreads on Linux, BSD, and macOS) I have removed pthreads ifdefs to make it a bit more readable than previous one. In my case thread_id, is not the same concept as SMP_CurrentThread, thread_id, in this case is a shared storage. that value is a thread id within a virtual machine instance, which is used to control all the storage a VM context need to have access to, so the typical vm struct basically, which in my particular case can spin completely separated threads in which to execute a function call in a module (this to reinforce security). Beside of the context , hope it helps a bit with thread counting.
Yes that is what I’d do, but you probably want to do this check within the context of a locked mutex OR be sure your thread counting is thread safe (if you use an array, all you should need is a unique index).
You can clean your local storage areas at the end of your thread and set it back to NULL, so if the thread is restarted, you have that storage location NULL and know you need to reallocate RMA memory for it. In other words, the way I see this, you either use a mutex lock at your thread level, or you use a thread safe way to calculate the array storage index for a given thread. if you use a separate index for each thread, you can have lock-free thread execution. Now all of this works if the thread is the only one manipulating the data in the data storage. If instead you need to tell a thread that something else has put data there, you’ll need a condition variable and follow the examples you’ve shared. But, if your algorithm is a producer/consumer like, then the producer may loop for the array locations that are set back to NULL from a thread and the producer would be responsible for allocating RMA space for that specific y[thread_id], and the thread would loop for having y[self] not NULL, process the data and then set it back to NULL to wait for more data. The condition variable approach would make a thread sleep while waiting for new data, while the busy loop approach above would make it spin waiting for more data. If you want to save power you prefer conditional variables. HTH |
Paolo Fabio Zaino (28) 1893 posts |
@ Simon
I don’t think there is yet documentation for this, you need to look at Jeffrey’s SMP work on gitlab, here: https://gitlab.riscosopen.org/jlee/SMP I am writing my UltimaVM RO SMP code based on his examples and test code. I am sure Jeffrey will be happy to answer questions if you have some. |
Jeffrey Lee (213) 6048 posts |
There are three CPU registers which are intended to be used for thread/process IDs. The SMP module uses one of them itself, the other two are (for now at least) free for use by threads. https://gitlab.riscosopen.org/jlee/SMP/-/blob/SMPthread/docs/SMP#L35 Of course “free for use” doesn’t mean it’s a free-for-all, at some point there’ll have to be some more rules about their usage in order to prevent conflicts between different pieces of software (like a thread-aware SWI which gets called from a different thread-aware module or application). I don’t have a fully thought-through plan for how thread-local storage will work. But these are my current thoughts/ideas:
There’s also the question of whether the TPIDRURO and TPIDRURW registers should be used to store the current process & thread IDs, so that software which wants to know the current process/thread IDs can find out directly instead of having to make expensive function/SWI calls. We probably wouldn’t want to use both of them for that purpose, but maybe it would be worth reserving TPIDRURO for it, e.g. have it point to a small block of memory which has a documented format and contains those two IDs. It’s probably the kind of thing we’ll have to feel out as the threading support matures. In module initialisation I could create copy of thread-local storage for each thread, but I didn’t find an SMP_ call to count the threads. There would then be a service call to notify of new threads to get another copy ready an initialised. At some point we might end up needing service calls (or similar) for thread creation/destruction. We’re already teetering close to that with FPA & VFP support in threads, since they both rely on something calling FPEmulator & VFPSupport to create and destroy the FP contexts. But I’m not sure yet what that should look like. |
Simon Willcocks (1499) 540 posts |
For what it’s worth, my vision of threads and cores is that the floating point stuff should be invisible to the user code; use an FP instruction and you get a storage area allocated and the FP context gets stored along with the integer registers when you’re switched out. |