C Kernel
Rick Murray (539) 13840 posts |
Do you know how Linux (for example) handles kernel modules and multiple cores? I’m just wondering how it is best to share one executable with multiple processors… Of course it’s great that modules can allocate their own (different) workspaces, plus the instantiations, already built into the system.
<smirk> Plus there’s Jeffrey’s remark:
I would hope they’re calling them the same thing ! One shouldn’t have to learn new sets of terminology for each processor family. If it’s the same thing, call it the same thing.
It’s as easy as a short hop across the Irish Sea… :-/
Well that’s a pretty dumb reason. But given the ending of restrictions, “pretty dumb” seems to be the way things roll. |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
Not really, multiprocessing didn’t crop up much until this new-fangled millennium thingy. RISC OS is a very different beast. What I’m expecting is that many modules will be available on every core (like DrawMod, the Font manager, etc.) and a few be run on just one core, with other cores running a proxy that synchronises with it in certain circumstances. For the Wimp, I would have a wrapper module that pretends to be the Wimp, and programs are represented by lightweight tasks on the core running the real thing which remember which areas of their windows are visible until any task performs an action that affects the window stack. While that’s processed, tasks updating their windows on other cores are told there’s nothing to do, but can continue working (playing the audio to go along with the video, for example). Similarly, a wrapped font manager might pause while a new font is installed, but happily continue rendering text on the screen in parallel with other cores most of the time. There’s also the possibility of implementing multi-threading while filing systems are waiting for hardware to respond. I don’t know the overhead of a “core”, but it might be worthwhile having virtual cores that get swapped in and out when they get blocked. (It would be interesting to see that retro-fitted to old hardware!) |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Marking memory as non-executable is the only way to fully prevent instruction fetches. I think it was the SDIO driver where we first ran into the issue – IIRC the CPU was speculatively fetching instructions from a pointer held in a register (presumably under the expectation that the code was about to use that register as a function pointer), but the pointer was actually pointing to the SD controller’s data FIFO. |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
That’s just nasty! I have separate routines for mapping devices and normal memory, so it shouldn’t be a problem. (Although I see I’ve left in code to map MiB memory mapped devices, and I should make it clear in general if it should be executable or not; TBD.) |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
Another thing to be careful of: When making memory non-cacheable, there’ll be a period of time where the TLB says the memory is non-cacheable but the cache still contains data. If code accesses that memory, it’s implementation-defined whether the CPU will use the data that’s in the cache or ignore the cache and access memory directly. This essentially makes the memory access unpredictable, because reads might bypass the cache and see the out-of-date data that’s in main memory, and writes might bypass the cache and then get overwritten with older data when the cache line gets flushed. This is the main reason why OS_Memory 19 was introduced as an alternative to the OS_Memory 0 ‘enable/disable cacheability’ functionality. Also while re-reading the notes about that, I’ve been reminded that LDREX/STREX are only guaranteed to work on cacheable memory. |
||||||||||||||||||||||||
Colin Ferris (399) 1814 posts |
Nice to see things moving on :-) Can one of the cores run 64bit code whilst the others run 32bit? |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
Yes, I found that one for myself.
Not yet, that would involve starting up in aarch64 EL3, and working down. But that’s what my other project does, so it could be made to work. |
||||||||||||||||||||||||
Theo Markettos (89) 919 posts |
Linux, macOS, etc are heavily multithreaded at the kernel level. There are (hundreds/thousands) of threads that may be ready to do work, and the ones that aren’t blocked get scheduled across any of the available cores. AIUI in general all the kernel threads are living in the same address space (modules aren’t isolated), so ‘scheduling’ means setting the registers and PC to be the ones for the next thread and letting it run, with a timer to preempt the thread if it’s been running too long. User mode threading is different in that it requires setting up the MMU with the appropriate page table base beforehand (and, if there are changes, shooting down TLB entries from a previous mapping in other cores). There’s a lot of tweaks the scheduler can do to get maximum performance – for example scheduling threads on colder cores so they can make best use of turbo clock headroom, and reduce the load on hot cores to let them cool down again – this means the load tends to jump from core to core to even up the heating.
There frequently you have work queues. You want to do some I/O, so you put the job on a work queue and block the application. The device driver thread sees its queue is non-empty and becomes schedulable. It generates an I/O request, and puts the job on another queue of pending requests. Some time later an interrupt comes back to say the device has completed a request. The interrupt handler works out which request from the pending queue is complete and makes a note of that in the device driver’s work queue. The device driver then returns the data to the application and marks that the application can proceed. Next scheduling round the application may then be able to be scheduled and so make some progress. (this is a bit of a caricature since, as even, real life is more complicated – many device driver stacks have multiple layers) |
||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
That would be fine as a quick workaround, but in reality you would want to set the video memory cacheable/bufferable for the OS and have a thread clean the screen range at Vsync. Write-through can cause a large performance hit. How the screen is cleaned should be down to the task doing the writing. The Wimp for example, where its writing to the visible buffer might opt to clean the whole screen range by MVA at VSync, where as a game might opt to use a back-buffer and clean that prior to swapping buffers. From the description of the demo, as it’s performing animation I think you’d want the four threads writing to a back-buffer and a 5th thread that cleans the cache and swaps screen buffers at VSync to hide the writes. And because the behaviour of buffer swapping is essentially “implementation defined”, you must use triple buffering to ensure you’re always writing to a back-buffer regardless of the GPU.
Is this not avoided by invalidating the TLB entry and performing a clean/invalidate on the range it covers? In my experience, ARM’s lack of atomic operations for combined cache/TLB operations does make cache/TLB management somewhat of a headache as the core can speculatively go off and do the one thing you don’t want it too. I suppose you could perform every operation twice just to be sure! |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
When making memory non-cacheable, there’ll be a period of time where the TLB says the memory is non-cacheable but the cache still contains data No. Those operations do need to be performed, but there’s no way of doing them in an atomic manner, so there will always be a period of time where the cache/TLB are dangerously out of sync with the page tables. It’s only explicit memory accesses which will be adversely affected by the inconsistency, so if it’s known that nothing is going to explicitly access the memory until everything is in sync again, then there’s nothing to worry about. But if you can’t guarantee that nothing’s going to be accessing the memory, then the OS has to take matters into its own hands. This is what OS_Memory 0 does; it disables IRQs & FIQs to prevent all unexpected accesses, and it also detects when the request is trying to alter the cacheability of the SVC stack and will make that safe by switching to IRQ mode so that the routine uses the IRQ stack instead. However there are some bits which aren’t dealt with: altering cacheability of kernel workspace or the page tables, and for SMP it needs to make sure the other cores are put into a safe state where they’re prevented from accessing the page.
Performing everything twice won’t help. The key thing is to make sure the operations are performed in the right order, with synchronisation after each step to ensure the previous one has completed before you start the next. So when making memory non-cacheable, you have the following sequence:
Although according to ARM, even that isn’t perfect; the ARM ARM talks of using a “break before make” strategy where before you write the new page table entry, you overwrite the old entries with faulting entries (and do an extra sync/flush + TLB invalidate). This avoids the situation where some cores will be using the old values while others will be using the new values, but I’ll admit I don’t fully understand why it’s needed or whether it has any other impact on the sequences (and the OS doesn’t yet use this break-before-make strategy). |
||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
Quite. The lack of atomic operations means most cache/write buffer/TLB operations require a sync immediately after. It’s best to simply assume all need a sync.
Didn’t the break-before-make guidance come about because of the way Linux maps its kernel memory to random addresses for each task? I believe it’s to ensure you don’t end up with two TLB entries for the same memory…so more security related than an actual sync issue. That would be my guess at any rate. |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
I think I’m beginning to see a pattern. 1. They introduce a feature, 2. after a few versions/years, they assume the feature is being used. The OS doesn’t use ASIDs, yet, afaics. It might be worthwhile. |
||||||||||||||||||||||||
Jon Abbott (1421) 2651 posts |
I agree. It doesn’t solve the atomicity issue but would improve task switching time. Chalk it up on the “lets drop old CPU support” board. Although I’ve never read up on it in detail, if I understand correctly you change the TTBR pointer, invalidate the TLB and sync several times, instead of modifying the global page table when changing context. |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
Yes, I think you clear out the translation table which refers to the Wimp task’s memory, then switch the ASID in TTBR0. ASIDs are 16-bit, these days, so there can easily be one per task. If you’re feeling lucky, istr there’s a bit you can set that will trigger an event on translation table walks, and do the clearing out then (it might have a positive effect on switching to small polling tasks). (Work on the project is stalled for a few weeks, I’m afraid.) |
||||||||||||||||||||||||
Jake Hamby (8915) 21 posts |
That sounds about right. I spent some time looking at the KVM module on PowerPC playing with QEMU-KVM on my PowerMac G5 Quad (Linux/ppc has an emulated hypervisor module called KVM-PS that handles the supervisor mode of the emulated PowerPC and runs the guest VM in user mode). I learned two clever algorithms from this adventure. First, the Linux kernel extensively uses a data structure called RCU (read-copy-update) for synchronized access to read-mostly data structures in an SMP environment. It eliminates blocking for all readers, and shortens the length of time that a writer may have to wait for readers within the RCU area to finish. Lockless data structures are hot these days, and while they don’t work for everything, RCU is a good example of a mostly lockless algorithm. Related to that is io_uring which is a way for user programs to do asynchronous I/O efficiently without having to perform system calls unless you want to wait for something to arrive. It’s implemented with two ring buffers, one in each direction, again, designed for lockless communication. Feel free to use either or both of those algorithms in your OS, since I believe neither is patented. I’m a big fan of async I/O, something that UNIX has historically been bad at (until Linux io_uring, which has yet to be widely adopted, although QEMU is starting to use it). I think UNIX and Java went in the wrong direction in the 1990s with the paradigm of creating thread pools (or process pools in the case of Apache using fork()/exec() on UNIX) to handle multiple connections, with each thread blocking on reads from a single socket. At the very least, you waste 1MB or so per thread of virtual address space for thread stacks, and at least 4K per thread of physical memory for those stacks. Async I/O is more challenging to write code for because it’s so heavily callback-driven, and that’s why we’re starting to see it built directly into the language, first with C# and now with Swift (except you’ll need to have the version of macOS/iOS that hasn’t shipped yet if you want to use Apple’s version). Kotiln’s coroutines and RxJava are similar attempts to make lambda / coroutine callbacks easier to reason about (although I found RxJava to be immensely confusing when I used it at a previous job). |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
Sorry to have disappeared for so long. I’ve got the cog demo working, albeit with major hacks to bring up the screen, find modules in the ROM, etc. I’ve decided, for the time being, to link in all but the bottom 64k of an existing ROM, so that I can be reasonably sure that I haven’t broken any existing code. The replacement kernel goes where the HAL was. I want to get a few more basic modules started before I start working on accessing the hardware properly. One thing I’ve been wondering about is whether it would make sense to create a whitelist of SWIs that may be used by an interrupt routine. The Draw module, for example, checks the value of IRQsema and generates an error if that happens, which seems ridiculous. |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
A whitelist probably won’t be sufficient – there’ll probably be a number of SWIs which are semi-functional when called from IRQ handlers. E.g. the SWI might allow you to read the state of something, but not write/modify the state. So the alternative would be to add a SWI to the kernel which reports the current execution context (foreground, IRQ, abort handler, etc.) – Draw can then be modified to call that SWI instead of peeking IRQsema. Since multiple states could occur at the same time (e.g. an abort within an IRQ handler), it might make sense for the SWI to return a series of flags (“in IRQ”, “in abort”, etc.) rather than trying to boil the states down to a simple priority order. |
||||||||||||||||||||||||
Stuart Swales (8827) 1357 posts |
In this case it’s really wanting to check whether ScratchSpace is available. That should be separated away from the “in IRQ” state. |
||||||||||||||||||||||||
Jeffrey Lee (213) 6048 posts |
I haven’t looked into the “rules” over who gets to use ScratchSpace when, but my gut feeling is that it’s a horrible thing that’s now completely irrelevant and we should try to get rid of it. I know we won’t be able to completely get rid of it (there are likely to be old programs which use it), but if its use can be weeded out of the OS then that should make things a lot easier. Otherwise it’s likely to start causing problems as we work on threads & multi-core support. |
||||||||||||||||||||||||
Stuart Swales (8827) 1357 posts |
Absolutely. It was only really used as we had 16KB otherwise kicking around going to waste on very small memory systems. |
||||||||||||||||||||||||
Jan Rinze (235) 368 posts |
Very interesting results. |
||||||||||||||||||||||||
Jan Rinze (235) 368 posts |
Simon has helped out with building the kernel from the GitHub sources. |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
Per the PRMs: During initialisation, your module is not on the active module list, and so you cannot call SWIs in your own SWI chunk. Instead you must directly enter your own code What I’m noticing is that the BufferManager, during its initialisation, is sending out a Service_BufferStarting, and three places seem to respond with a Buffer_Register, which fail because the module is not yet initialised. Is this:
|
||||||||||||||||||||||||
Julie Stamp (8365) 474 posts |
4. It doesn’t send it out during its initialisation, it sends it out in a callback just after its initialisation. See here |
||||||||||||||||||||||||
Simon Willcocks (1499) 513 posts |
Ah, so my kernel is calling callbacks before it ought to. In this case, before the initialisation routine returns. Is there a rule for when callbacks get called? Is it only when returning to USR mode or when idling? (Now I type that, it seems to ring a bell!) I was calling them on return from any SWI. Thank you! Edit: fixed it, tried it, it seems to work much better, thanks again. |