C Kernel

130 posts, 25 voices

Pages: 1 2 3 4 5 6

Jul 22, 2021 8:02pm

Rick Murray (539) 13840 posts

Most things will be, though; the RMA is shared, but modules in different cores will allocate different workspaces.

Do you know how Linux (for example) handles kernel modules and multiple cores? I’m just wondering how it is best to share one executable with multiple processors… Of course it’s great that modules can allocate their own (different) workspaces, plus the instantiations, already built into the system.

Is it just me, or are there more configuration registers in todays ARMs than there were instructions in the ARM2?

<smirk>

Plus there’s Jeffrey’s remark:

n.b. Device/Strongly-Ordered/Normal are all ARMv7 names for memory types, I’m not sure off the top of my head what they’re calling them in ARMv8

I would hope they’re calling them the same thing ! One shouldn’t have to learn new sets of terminology for each processor family. If it’s the same thing, call it the same thing.

moving from a country with 1,300 new cases a day to one with getting on for 50,000

It’s as easy as a short hop across the Irish Sea… :-/
And Westminster is freaking out about France because they can’t get their heads around DOM.

which won’t accept my vaccinations because they weren’t done by the NHS

Well that’s a pretty dumb reason. But given the ending of restrictions, “pretty dumb” seems to be the way things roll.

Jul 22, 2021 8:57pm

Simon Willcocks (1499) 513 posts

Do you know how Linux (for example) handles kernel modules and multiple cores? I’m just wondering how it is best to share one executable with multiple processors… Of course it’s great that modules can allocate their own (different) workspaces, plus the instantiations, already built into the system.

Not really, multiprocessing didn’t crop up much until this new-fangled millennium thingy. RISC OS is a very different beast. What I’m expecting is that many modules will be available on every core (like DrawMod, the Font manager, etc.) and a few be run on just one core, with other cores running a proxy that synchronises with it in certain circumstances. For the Wimp, I would have a wrapper module that pretends to be the Wimp, and programs are represented by lightweight tasks on the core running the real thing which remember which areas of their windows are visible until any task performs an action that affects the window stack. While that’s processed, tasks updating their windows on other cores are told there’s nothing to do, but can continue working (playing the audio to go along with the video, for example). Similarly, a wrapped font manager might pause while a new font is installed, but happily continue rendering text on the screen in parallel with other cores most of the time.

There’s also the possibility of implementing multi-threading while filing systems are waiting for hardware to respond. I don’t know the overhead of a “core”, but it might be worthwhile having virtual cores that get swapped in and out when they get blocked. (It would be interesting to see that retro-fitted to old hardware!)

Jul 22, 2021 9:05pm

Jeffrey Lee (213) 6048 posts

Marking memory as non-executable is the only way to fully prevent instruction fetches.

Sounds like the voice of painful experience! Does the processor grab instructions from anywhere it gets a pointer to, or something?

I think it was the SDIO driver where we first ran into the issue – IIRC the CPU was speculatively fetching instructions from a pointer held in a register (presumably under the expectation that the code was about to use that register as a function pointer), but the pointer was actually pointing to the SD controller’s data FIFO.

Jul 23, 2021 7:55am

Simon Willcocks (1499) 513 posts

That’s just nasty!

I have separate routines for mapping devices and normal memory, so it shouldn’t be a problem. (Although I see I’ve left in code to map MiB memory mapped devices, and I should make it clear in general if it should be executable or not; TBD.)

Jul 23, 2021 8:34am

Jeffrey Lee (213) 6048 posts

Another thing to be careful of: When making memory non-cacheable, there’ll be a period of time where the TLB says the memory is non-cacheable but the cache still contains data. If code accesses that memory, it’s implementation-defined whether the CPU will use the data that’s in the cache or ignore the cache and access memory directly. This essentially makes the memory access unpredictable, because reads might bypass the cache and see the out-of-date data that’s in main memory, and writes might bypass the cache and then get overwritten with older data when the cache line gets flushed. This is the main reason why OS_Memory 19 was introduced as an alternative to the OS_Memory 0 ‘enable/disable cacheability’ functionality.

Also while re-reading the notes about that, I’ve been reminded that LDREX/STREX are only guaranteed to work on cacheable memory.

Jul 23, 2021 8:49am

Colin Ferris (399) 1814 posts

Nice to see things moving on :-)

Can one of the cores run 64bit code whilst the others run 32bit?

Jul 23, 2021 2:39pm

Simon Willcocks (1499) 513 posts

I’ve been reminded that LDREX/STREX are only guaranteed to work on cacheable memory.

Yes, I found that one for myself.
https://stackoverflow.com/questions/68262841/why-doesnt-my-arm-ldrex-strex-c-function-work

Can one of the cores run 64bit code whilst the others run 32bit?

Not yet, that would involve starting up in aarch64 EL3, and working down. But that’s what my other project does, so it could be made to work.

Jul 24, 2021 2:48pm

Theo Markettos (89) 919 posts

Do you know how Linux (for example) handles kernel modules and multiple cores? I’m just wondering how it is best to share one executable with multiple processors…

Linux, macOS, etc are heavily multithreaded at the kernel level. There are (hundreds/thousands) of threads that may be ready to do work, and the ones that aren’t blocked get scheduled across any of the available cores. AIUI in general all the kernel threads are living in the same address space (modules aren’t isolated), so ‘scheduling’ means setting the registers and PC to be the ones for the next thread and letting it run, with a timer to preempt the thread if it’s been running too long. User mode threading is different in that it requires setting up the MMU with the appropriate page table base beforehand (and, if there are changes, shooting down TLB entries from a previous mapping in other cores).

There’s a lot of tweaks the scheduler can do to get maximum performance – for example scheduling threads on colder cores so they can make best use of turbo clock headroom, and reduce the load on hot cores to let them cool down again – this means the load tends to jump from core to core to even up the heating.

There’s also the possibility of implementing multi-threading while filing systems are waiting for hardware to respond. I don’t know the overhead of a “core”, but it might be worthwhile having virtual cores that get swapped in and out when they get blocked. (It would be interesting to see that retro-fitted to old hardware!)

There frequently you have work queues. You want to do some I/O, so you put the job on a work queue and block the application. The device driver thread sees its queue is non-empty and becomes schedulable. It generates an I/O request, and puts the job on another queue of pending requests. Some time later an interrupt comes back to say the device has completed a request. The interrupt handler works out which request from the pending queue is complete and makes a note of that in the device driver’s work queue. The device driver then returns the data to the application and marks that the application can proceed. Next scheduling round the application may then be able to be scheduled and so make some progress.

(this is a bit of a caricature since, as even, real life is more complicated – many device driver stacks have multiple layers)

Jul 25, 2021 7:42am

Jon Abbott (1421) 2651 posts

Make sure screen memory uses a write-through cache policy instead of write-back

That would be fine as a quick workaround, but in reality you would want to set the video memory cacheable/bufferable for the OS and have a thread clean the screen range at Vsync. Write-through can cause a large performance hit.

How the screen is cleaned should be down to the task doing the writing. The Wimp for example, where its writing to the visible buffer might opt to clean the whole screen range by MVA at VSync, where as a game might opt to use a back-buffer and clean that prior to swapping buffers.

From the description of the demo, as it’s performing animation I think you’d want the four threads writing to a back-buffer and a 5th thread that cleans the cache and swaps screen buffers at VSync to hide the writes. And because the behaviour of buffer swapping is essentially “implementation defined”, you must use triple buffering to ensure you’re always writing to a back-buffer regardless of the GPU.

When making memory non-cacheable, there’ll be a period of time where the TLB says the memory is non-cacheable but the cache still contains data

Is this not avoided by invalidating the TLB entry and performing a clean/invalidate on the range it covers?

In my experience, ARM’s lack of atomic operations for combined cache/TLB operations does make cache/TLB management somewhat of a headache as the core can speculatively go off and do the one thing you don’t want it too. I suppose you could perform every operation twice just to be sure!

Jul 25, 2021 11:04am

Jeffrey Lee (213) 6048 posts

When making memory non-cacheable, there’ll be a period of time where the TLB says the memory is non-cacheable but the cache still contains data

Is this not avoided by invalidating the TLB entry and performing a clean/invalidate on the range it covers?

No. Those operations do need to be performed, but there’s no way of doing them in an atomic manner, so there will always be a period of time where the cache/TLB are dangerously out of sync with the page tables.

It’s only explicit memory accesses which will be adversely affected by the inconsistency, so if it’s known that nothing is going to explicitly access the memory until everything is in sync again, then there’s nothing to worry about.

But if you can’t guarantee that nothing’s going to be accessing the memory, then the OS has to take matters into its own hands. This is what OS_Memory 0 does; it disables IRQs & FIQs to prevent all unexpected accesses, and it also detects when the request is trying to alter the cacheability of the SVC stack and will make that safe by switching to IRQ mode so that the routine uses the IRQ stack instead. However there are some bits which aren’t dealt with: altering cacheability of kernel workspace or the page tables, and for SMP it needs to make sure the other cores are put into a safe state where they’re prevented from accessing the page.

In my experience, ARM’s lack of atomic operations for combined cache/TLB operations does make cache/TLB management somewhat of a headache as the core can speculatively go off and do the one thing you don’t want it too. I suppose you could perform every operation twice just to be sure!

Performing everything twice won’t help. The key thing is to make sure the operations are performed in the right order, with synchronisation after each step to ensure the previous one has completed before you start the next. So when making memory non-cacheable, you have the following sequence:

Step	TLB state	Cache state
(initial state)	Uses old values	Uses old cacheability
Write new page table entries to RAM	Indeterminate	Indeterminate
Sync/flush the write (to ensure the next memory access the TLB performs will see the new values)	Indeterminate	Indeterminate
Invalidate the TLB entries	Indeterminate	Indeterminate
Sync to ensure TLB invalidate has completed (to ensure the next lookup for those pages will use the new values)	Uses new values	Indeterminate
Clean & invalidate the caches (e.g. by MVA)	Uses new values	Indeterminate
Sync to ensure cache maintenance has completed (to ensure future memory accesses will use the new cacheability)	Uses new values	Uses new cacheability

Although according to ARM, even that isn’t perfect; the ARM ARM talks of using a “break before make” strategy where before you write the new page table entry, you overwrite the old entries with faulting entries (and do an extra sync/flush + TLB invalidate). This avoids the situation where some cores will be using the old values while others will be using the new values, but I’ll admit I don’t fully understand why it’s needed or whether it has any other impact on the sequences (and the OS doesn’t yet use this break-before-make strategy).

Jul 27, 2021 4:52am

Jon Abbott (1421) 2651 posts

The key thing is to make sure the operations are performed in the right order, with synchronisation after each step

Quite. The lack of atomic operations means most cache/write buffer/TLB operations require a sync immediately after. It’s best to simply assume all need a sync.

I don’t fully understand why it’s needed

Didn’t the break-before-make guidance come about because of the way Linux maps its kernel memory to random addresses for each task? I believe it’s to ensure you don’t end up with two TLB entries for the same memory…so more security related than an actual sync issue. That would be my guess at any rate.

Jul 27, 2021 7:12pm

Simon Willcocks (1499) 513 posts

I think I’m beginning to see a pattern. 1. They introduce a feature, 2. after a few versions/years, they assume the feature is being used.

The OS doesn’t use ASIDs, yet, afaics. It might be worthwhile.

Jul 29, 2021 4:41am

Jon Abbott (1421) 2651 posts

The OS doesn’t use ASIDs, yet, afaics. It might be worthwhile.

I agree. It doesn’t solve the atomicity issue but would improve task switching time. Chalk it up on the “lets drop old CPU support” board.

Although I’ve never read up on it in detail, if I understand correctly you change the TTBR pointer, invalidate the TLB and sync several times, instead of modifying the global page table when changing context.

Aug 1, 2021 7:24am

Simon Willcocks (1499) 513 posts

Yes, I think you clear out the translation table which refers to the Wimp task’s memory, then switch the ASID in TTBR0. ASIDs are 16-bit, these days, so there can easily be one per task.

If you’re feeling lucky, istr there’s a bit you can set that will trigger an event on translation table walks, and do the clearing out then (it might have a positive effect on switching to small polling tasks).

(Work on the project is stalled for a few weeks, I’m afraid.)

Aug 4, 2021 10:25pm

Jake Hamby (8915) 21 posts

Linux, macOS, etc are heavily multithreaded at the kernel level. There are (hundreds/thousands) of threads that may be ready to do work, and the ones that aren’t blocked get scheduled across any of the available cores. AIUI in general all the kernel threads are living in the same address space (modules aren’t isolated), so ‘scheduling’ means setting the registers and PC to be the ones for the next thread and letting it run, with a timer to preempt the thread if it’s been running too long.

That sounds about right. I spent some time looking at the KVM module on PowerPC playing with QEMU-KVM on my PowerMac G5 Quad (Linux/ppc has an emulated hypervisor module called KVM-PS that handles the supervisor mode of the emulated PowerPC and runs the guest VM in user mode).

I learned two clever algorithms from this adventure. First, the Linux kernel extensively uses a data structure called RCU (read-copy-update) for synchronized access to read-mostly data structures in an SMP environment. It eliminates blocking for all readers, and shortens the length of time that a writer may have to wait for readers within the RCU area to finish.

Lockless data structures are hot these days, and while they don’t work for everything, RCU is a good example of a mostly lockless algorithm. Related to that is io_uring which is a way for user programs to do asynchronous I/O efficiently without having to perform system calls unless you want to wait for something to arrive. It’s implemented with two ring buffers, one in each direction, again, designed for lockless communication.

Feel free to use either or both of those algorithms in your OS, since I believe neither is patented. I’m a big fan of async I/O, something that UNIX has historically been bad at (until Linux io_uring, which has yet to be widely adopted, although QEMU is starting to use it). I think UNIX and Java went in the wrong direction in the 1990s with the paradigm of creating thread pools (or process pools in the case of Apache using fork()/exec() on UNIX) to handle multiple connections, with each thread blocking on reads from a single socket. At the very least, you waste 1MB or so per thread of virtual address space for thread stacks, and at least 4K per thread of physical memory for those stacks.

Async I/O is more challenging to write code for because it’s so heavily callback-driven, and that’s why we’re starting to see it built directly into the language, first with C# and now with Swift (except you’ll need to have the version of macOS/iOS that hasn’t shipped yet if you want to use Apple’s version). Kotiln’s coroutines and RxJava are similar attempts to make lambda / coroutine callbacks easier to reason about (although I found RxJava to be immensely confusing when I used it at a previous job).

Oct 12, 2021 8:40am

Simon Willcocks (1499) 513 posts

Sorry to have disappeared for so long. I’ve got the cog demo working, albeit with major hacks to bring up the screen, find modules in the ROM, etc.

I’ve decided, for the time being, to link in all but the bottom 64k of an existing ROM, so that I can be reasonably sure that I haven’t broken any existing code. The replacement kernel goes where the HAL was.

I want to get a few more basic modules started before I start working on accessing the hardware properly.

One thing I’ve been wondering about is whether it would make sense to create a whitelist of SWIs that may be used by an interrupt routine. The Draw module, for example, checks the value of IRQsema and generates an error if that happens, which seems ridiculous.

Oct 12, 2021 11:34am

Jeffrey Lee (213) 6048 posts

One thing I’ve been wondering about is whether it would make sense to create a whitelist of SWIs that may be used by an interrupt routine. The Draw module, for example, checks the value of IRQsema and generates an error if that happens, which seems ridiculous.

A whitelist probably won’t be sufficient – there’ll probably be a number of SWIs which are semi-functional when called from IRQ handlers. E.g. the SWI might allow you to read the state of something, but not write/modify the state.

So the alternative would be to add a SWI to the kernel which reports the current execution context (foreground, IRQ, abort handler, etc.) – Draw can then be modified to call that SWI instead of peeking IRQsema. Since multiple states could occur at the same time (e.g. an abort within an IRQ handler), it might make sense for the SWI to return a series of flags (“in IRQ”, “in abort”, etc.) rather than trying to boil the states down to a simple priority order.

Oct 12, 2021 1:08pm

Stuart Swales (8827) 1357 posts

The Draw module, for example, checks the value of IRQsema and generates an error if that happens, which seems ridiculous.

In this case it’s really wanting to check whether ScratchSpace is available. That should be separated away from the “in IRQ” state.

Oct 12, 2021 1:44pm

Jeffrey Lee (213) 6048 posts

I haven’t looked into the “rules” over who gets to use ScratchSpace when, but my gut feeling is that it’s a horrible thing that’s now completely irrelevant and we should try to get rid of it. I know we won’t be able to completely get rid of it (there are likely to be old programs which use it), but if its use can be weeded out of the OS then that should make things a lot easier. Otherwise it’s likely to start causing problems as we work on threads & multi-core support.

Oct 12, 2021 1:54pm

Stuart Swales (8827) 1357 posts

we should try to get rid of it

Absolutely. It was only really used as we had 16KB otherwise kicking around going to waste on very small memory systems.

Oct 14, 2021 7:37pm

Jan Rinze (235) 368 posts

Very interesting results.
I was wondering if there is a way to replicate the setup. Building with the source code from GitHub didn’t yield much success.

Oct 16, 2021 12:00pm

Jan Rinze (235) 368 posts

Simon has helped out with building the kernel from the GitHub sources.
For tests I have forked his repository to try if it can be built using cmake.
The result is that it builds and runs. My fork is not the official repo, it is just a test to try a different build method.

Nov 1, 2021 4:57pm

Simon Willcocks (1499) 513 posts

Per the PRMs: During initialisation, your module is not on the active module list, and so you cannot call SWIs in your own SWI chunk. Instead you must directly enter your own code

What I’m noticing is that the BufferManager, during its initialisation, is sending out a Service_BufferStarting, and three places seem to respond with a Buffer_Register, which fail because the module is not yet initialised.

Is this:

Normal behaviour
An error in my kernel, and the module gets added to the active module list earlier than I thought
An error in my kernel, but only because ROM modules are not like normal modules?

Nov 1, 2021 5:23pm

Julie Stamp (8365) 474 posts

4. It doesn’t send it out during its initialisation, it sends it out in a callback just after its initialisation. See here

Nov 1, 2021 6:04pm

Simon Willcocks (1499) 513 posts

Ah, so my kernel is calling callbacks before it ought to. In this case, before the initialisation routine returns. Is there a rule for when callbacks get called? Is it only when returning to USR mode or when idling? (Now I type that, it seems to ring a bell!) I was calling them on return from any SWI.

Thank you!

Edit: fixed it, tried it, it seems to work much better, thanks again.

Pages: 1 2 3 4 5 6

Reply

To post replies, please first log in.

Forums → Porting RISC OS →

C Kernel

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options