Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → Code review →

Multi-core and modules with thread-local storage

9 posts, 5 voices

Jan 20, 2024 1:45pm Julie Stamp (8365) 480 posts	I’ve got a function, which I’ll simplify a bit and call f. The function uses a global variable y as storage. `static int y;` `int f(int x) { int result;` `int save = y; /* Save / y = x;` `y = y + 1; / Could call f here / y = y y` `result = y y = save; /* Restore */ }` This would work fine in an application or library in an app. The problem though would be if I introduced a second thread. To solve that I’d put a qualifier `static _Thread_local int y;` Now imagine that f is actually in a module. The original unqualified version would again work, even if f is called in the background during a foreground call to f, since the foreground only regains control once the background has restored y. It would be a problem though if there were tasks running on multiple cores, since they would be able to simultaenously call f. I feel that here I should be able to again qualify y `static _Thread_local int y;` and fix the problem. In the context of a module then _Thread_local would refer to OS threads (associated with separate supervisor stacks and running on separate cores). Any module with storage qualified as thread-local would see separate versions of that storage according to which OS thread it is being run it. Hope that makes sense and somebody has an idea about implementation – saw on the wiki that a processor register might hold the thread number (OS thread rather than application thread???) so maybe that could be used to index into the usual area for static data in a module but extended with a copy of the thread-local stuff for each core?

Jan 20, 2024 4:21pm Dave Higton (1515) 3592 posts	But if that didn’t work, surely nothing would work with multiple threads?

Jan 21, 2024 1:16am Paolo Fabio Zaino (28) 1933 posts	static _Thread_local int y; It’s a good idea. If nothing works, then you could use an array of pointers and as the index a thread id (or something) and then allocate storage memory for each thread. Assuming you’re using Jeffrey’s SMP (given you mentioned different cores): #define MAX_THREADS 3 static void module_pw; void rma_alloc(size_t s) { void mem; if (_swix(OS_Module, _IN(0)\|_IN(3)\|_OUT(2), ModHandReason_Claim, s, &mem)) { return NULL; } return mem; } void rma_free(void ptr) { if (ptr) { _swix(OS_Module, _IN(0)\|_IN(2), ModHandReason_Free, ptr); } } In a thread: // Thread function #if _GCC void * #else _kernel_oserror * #endif thread_function(void arg) { long thread_id = (long)arg; // Allocate memory for this thread's data y[thread_id] = (int )rma_alloc(sizeof(int)); if (y[thread_id] == NULL) { perror("Failed to allocate memory for thread data"); #ifdef _GCC pthread_exit(NULL); #else return NULL; #endif } // Example operation on the data y[thread_id] = thread_id 2; // Just an example operation printf("Thread %ld, data = %d\n", thread_id, y[thread_id]); #ifdef _GCC pthread_exit(NULL); #else return NULL; #endif } And in your main or initialization function: int main() { int y; y = (int )rma_alloc(sizeof(int)MAX_THREADS); #ifdef _GCC pthread_t threads[MAX_THREADS]; #else int threads[MAX_THREADS]; for (int I = 0; i < MAX_THREADS; I++) threads[I] = -1; #endif ... // Initialize the global array to NULL for (int i = 0; i < MAX_THREADS; ++i) { y[i] = NULL; } for (long i = 0; i < MAX_THREADS; ++i) { #ifdef _GCC int rc = pthread_create(&threads[i], NULL, thread_function, (void )i); if (rc) { // handle error exit(-1); } #else char name[32]; _kernel_oserror e = _swix(SMP_CreateThread, _INR(0,6)\|_OUT(0), name, 0, 0, &y[I], module_pw, thread_function, 0x13, &threads[i]); if (e) { // handle error exit(-1); } #endif } #if _GCC // Join threads for (int i = 0; i < MAX_THREADS; ++i) { pthread_join(threads[i], NULL); } #endif ... for (long i = 0; i < MAX_THREADS; ++i) rma_free(y[i]); rma_free(y); As a bonus, this would allow your code to work with both GCC and DDE and if you need to “return” the storage content you can. Note: I haven’t tested this, so consider it some form of pseudo code!

Jan 21, 2024 11:54am Julie Stamp (8365) 480 posts	I found examples here and here of creating threads from a C application and C module. I thinking about if I’m creating threads in an application and calling a SWI in a C module, wanting to be able to register it with SMP_RegisterSWIChunk. If I use variables in static storage then I need a way to know what thread I’m in – that’s ok use SMP_CurrentThread. In module initialisation I could create copy of thread-local storage for each thread, but I didn’t find an SMP_ call to count the threads. There would then be a service call to notify of new threads to get another copy ready an initialised. Maybe another way is to test what thread we are in on entry to the SWI and then if it’s not one we’ve seen before, get the storage ready (but this could leave copies of the area hanging around??? Would still want a service call for thread end to be able to clean those out?)

Jan 21, 2024 2:21pm Simon Willcocks (1499) 552 posts	Is there documentation on this threading interface? There’s a function in https://github.com/Simon-Willcocks/SimpleAsPi/blob/main/Processor/processor.h called change_word_if_equal that can be used to create a spin lock between cores.

Jan 21, 2024 3:33pm Paolo Fabio Zaino (28) 1933 posts	that’s ok use SMP_CurrentThread yup In module initialisation I could create copy of thread-local storage for each thread, but I didn’t find an SMP_ call to count the threads. In the multi-core support code I am writing for my UltimaVM, I count them from the thread initialization code, not inside the thread. The function is uber simple right now: In the module init: … module_pw = pw; synclib_init(); … Some support functions (slightly different implementations of Jeffrey’s support functions): kernel_oserror mtx_create(mixt mtx) { _kernel_oserror e = swix(SMPCreateMutex, IN\|OUT, 0, &mtx→mtx); return e; } kernel_oserror mtx_destroy(mtxt mtx) { _kernel_oserror e = swix(SMPDestroyMutex, _IN(0), mtx→mtx); return e; } int mtx_lock(mtx_t mtx) { // This is identical to Jeffrey’s code } int mtx_unlock(mtx_t mtx) { // Also this one is identical to Jeffrey’s code } Function to count threads (it’s usually called by main(), but in my VM a thread can generate a new thread, so: uvm_thread_id gen_thread_id() { static mix_t mtx = NULL; if (!mtx) { _kernel_oserror e = mtx_create(mtx); // handle error } // Lock mtx while (mtx_lock(mtx) != thrd_success); static volatile uvm_thread_id thread_id = NULL; if (!thread_id) { thread_id = rma_alloc(sizeof(uvm_thread_id)); if (thread_id == NULL) { // Handle allocation failure return 0; } thread_id = 0; // Initialize the first thread ID } *thread_id += 1; // Unlock and return thread_id while (mtx_unlock(mtx) != thrd_success); return thread_id; } This should ensure that thread counting is consistent, even when a new thread is created from a thread (I haven’t started to test SMP code yet, but the code works with pthreads on Linux, BSD, and macOS) I have removed pthreads ifdefs to make it a bit more readable than previous one. In my case thread_id, is not the same concept as SMP_CurrentThread, thread_id, in this case is a shared storage. that value is a thread id within a virtual machine instance, which is used to control all the storage a VM context need to have access to, so the typical vm struct basically, which in my particular case can spin completely separated threads in which to execute a function call in a module (this to reinforce security). Beside of the context , hope it helps a bit with thread counting. Maybe another way is to test what thread we are in on entry to the SWI and then if it’s not one we’ve seen before, get the storage ready (but this could leave copies of the area hanging around??? Yes that is what I’d do, but you probably want to do this check within the context of a locked mutex OR be sure your thread counting is thread safe (if you use an array, all you should need is a unique index). Would still want a service call for thread end to be able to clean those out? You can clean your local storage areas at the end of your thread and set it back to NULL, so if the thread is restarted, you have that storage location NULL and know you need to reallocate RMA memory for it. In other words, the way I see this, you either use a mutex lock at your thread level, or you use a thread safe way to calculate the array storage index for a given thread. if you use a separate index for each thread, you can have lock-free thread execution. Now all of this works if the thread is the only one manipulating the data in the data storage. If instead you need to tell a thread that something else has put data there, you’ll need a condition variable and follow the examples you’ve shared. But, if your algorithm is a producer/consumer like, then the producer may loop for the array locations that are set back to NULL from a thread and the producer would be responsible for allocating RMA space for that specific y[thread_id], and the thread would loop for having y[self] not NULL, process the data and then set it back to NULL to wait for more data. The condition variable approach would make a thread sleep while waiting for new data, while the busy loop approach above would make it spin waiting for more data. If you want to save power you prefer conditional variables. HTH

Jan 21, 2024 3:37pm Paolo Fabio Zaino (28) 1933 posts	@ Simon Is there documentation on this threading interface? I don’t think there is yet documentation for this, you need to look at Jeffrey’s SMP work on gitlab, here: https://gitlab.riscosopen.org/jlee/SMP I am writing my UltimaVM RO SMP code based on his examples and test code. I am sure Jeffrey will be happy to answer questions if you have some.

Jan 21, 2024 3:47pm Jeffrey Lee (213) 6048 posts	Hope that makes sense and somebody has an idea about implementation – saw on the wiki that a processor register might hold the thread number (OS thread rather than application thread???) so maybe that could be used to index into the usual area for static data in a module but extended with a copy of the thread-local stuff for each core? There are three CPU registers which are intended to be used for thread/process IDs. The SMP module uses one of them itself, the other two are (for now at least) free for use by threads. https://gitlab.riscosopen.org/jlee/SMP/-/blob/SMPthread/docs/SMP#L35 Of course “free for use” doesn’t mean it’s a free-for-all, at some point there’ll have to be some more rules about their usage in order to prevent conflicts between different pieces of software (like a thread-aware SWI which gets called from a different thread-aware module or application). I don’t have a fully thought-through plan for how thread-local storage will work. But these are my current thoughts/ideas: `Thread_local` will require compiler support, so any initial CLib implementation I come up with will be aiming for the C11 tss* functions instead I still need to check to make sure, but I think there are only a handful of places where application code can be called in to. Specifically, it’s rare for a SWI to directly call into application code. So it’s feasible to make a rule that TPIDRURW should be reserved for private use by the current application. This will give different application runtimes the freedom to provide optimal thread-local storage implementations. E.g.: The compiler/linker could place all `_Thread_local` variables into a single AREA / linker section When a thread is created the CLib/UnixLib/whatever runtime can allocate a block of memory large enough to store all the thread-locals and set TPIDRURW to point to it Whenever a `_Thread_local` variable is accessed the compiler can emit code which reads TPIDRURW and accesses the correct offset in the buffer Thread-local storage for modules is tricky because there are so many potential entry points from external code. Updating those entry points to replace the TPIDRURO / TPIDRURW registers with “internal” versions suitable for the module’s runtime would probably add a significant amount of overhead, especially since there’s no guarantee the code being called is actually going to need to make use of them. So instead it might be better for modules to rely on dynamic lookup of thread-local storage, managed by the SMP module. E.g. the SMP module will provide SWI equivalents of the C11 tss_create and tss_delete functions, which will allow for system-wide registration of thread-local storage, to be used with arbitrary OS-level threads. tss_get & tss_set can be handled either by SWIs or direct calls into the SMP module. There’s also the question of whether the TPIDRURO and TPIDRURW registers should be used to store the current process & thread IDs, so that software which wants to know the current process/thread IDs can find out directly instead of having to make expensive function/SWI calls. We probably wouldn’t want to use both of them for that purpose, but maybe it would be worth reserving TPIDRURO for it, e.g. have it point to a small block of memory which has a documented format and contains those two IDs. It’s probably the kind of thing we’ll have to feel out as the threading support matures. In module initialisation I could create copy of thread-local storage for each thread, but I didn’t find an SMP_ call to count the threads. There would then be a service call to notify of new threads to get another copy ready an initialised. Maybe another way is to test what thread we are in on entry to the SWI and then if it’s not one we’ve seen before, get the storage ready (but this could leave copies of the area hanging around??? Would still want a service call for thread end to be able to clean those out?) At some point we might end up needing service calls (or similar) for thread creation/destruction. We’re already teetering close to that with FPA & VFP support in threads, since they both rely on something calling FPEmulator & VFPSupport to create and destroy the FP contexts. But I’m not sure yet what that should look like.

Jan 22, 2024 10:32pm Simon Willcocks (1499) 552 posts	For what it’s worth, my vision of threads and cores is that the floating point stuff should be invisible to the user code; use an FP instruction and you get a storage area allocated and the FP context gets stored along with the integer registers when you’re switched out.

Reply

To post replies, please first log in.

Forums → Code review →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

Developer peer review of proposed code alterations.

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails