RISC OS Open: Forum: Hollowing out a Module

Nov 27, 2016 11:41am

Is there a standard way of ‘Hollowing out a Module’ ie:-

Taking a Arm module apart and building it up as a ‘C’ module – so some swi etc can be updated with ‘C’ code – one swi at a time.

Nov 27, 2016 6:01pm

Dave Higton (1515) 3526 posts

Interesting idea. If you have the source, then I believe it should be possible, within some limits. For example, I’d expect it to be more difficult if the module does service calls or the other “frills”. Easiest would be a module that does nothing but implement SWIs.

Do you have a particular module in mind?

Nov 29, 2016 11:56am

Colin Ferris (399) 1814 posts

SoundDMA module seems to drive the hardware – but only makes use of 16bit samples – even if the Hardware accepts 24bit.
Jim understands ‘C’ so the swi that takes a file piece by piece and feeds it to the Sound hardware – could perhaps changed to ‘C’ and updated – to handle 24bit samples.
I presume 16bit samples are half a word – so I presume 24bit samples would be a 32bit word – with the top bits zero.

A test case could be made with a prog running in the TASK window – and reading a 24bit sound file – and feeding the Sound system.

Jan 12, 2017 10:32am

Dave Higton (1515) 3526 posts

I wonder if it could be done by patching the real module to give it a different SWI base number and name, and writing a C module that simply passes on every call to the patched module. Then you can fill in each SWI one at a time with equivalent functionality.

That only deals with the SWIs; I see that SoundDMA has a service call handler too, and I don’t know at what point the handler will cease to function correctly as the functionality is moved over; clearly the workspaces of the two modules would be separate, so neither can see the other’s data.

I find assembly language impenetrable at this level of complexity and sophistication. The authors really have woven a tangled web.

Jan 12, 2017 1:41pm

Jeffrey Lee (213) 6048 posts

I think the way I’d tackle piecemeal rewriting of a module in C would be as follows:

Create a CMHG file which lists all the SWIs, commands and service calls that the module uses.
- At this stage, there’s no need to list any vectors or “generic veneers” – the assembler code will be able to register vectors as normal, and have them call straight into assembler without passing through any of the C/CMHG code
Create a C header which contains a struct which mirrors the layout of the module workspace the assembler uses (must be the same layout, to allow both C and assembler to access the workspace)
Have a global C variable that’s a pointer to the assembler workspace. There’s no need to handle workspace allocation/freeing from C, the assembler will take care of that for you.
In the handler functions that CMHG generates you’ll want to call through to the original assembler functions. However you’ll need a small veneer (either inline assembler or standalone assembler functions) to make sure the inputs & outputs are dealt with correctly. E.g. when calling the assembler function you’d want R12 to be a pointer to your C “assembler workspace pointer” variable, so that it can act as a surrogate private workspace pointer for the assembler to use.

Once you’ve got that set up the module should be able to function as normal, and you can start replacing code with C.

If you want to be able to call into C from assembler then you’ll need to be aware of the usual rules when adding C to assembler code (making sure you use APCS calling conventions, and making sure the relocation offsets are on the stack – although if the stack trace is C → assembler → C then the relocation offsets should already be there, so for most cases you can probably avoid setting them up again)

Jan 12, 2017 1:46pm

Rick Murray (539) 13840 posts

I wonder if it could be done by patching the real module to give it a different SWI base number and name, and writing a C module that simply passes on every call to the patched module.

In a word, yes.
Many years ago I tried an experiment with IIC so I hacked the system IIC module on the fly, loaded my own version that talks to the parallel port, and added a fiddle to calls to the CMOS RAM would be passed to the old original module while everything else went to the parallel port.
Ugly hack but it works.

Jan 12, 2017 2:28pm

Dave Higton (1515) 3526 posts

The good news, in the case of SoundDMA, is that its workspace is tiny – if Zap’s “Get module workspace” has got it right.

Jan 12, 2017 2:51pm

Jeffrey Lee (213) 6048 posts

I guess you can also be selective with which bits of workspace you define in C. E.g. you could re-arrange the assembler workspace so that all the bits which need to be accessed by C are at the start, so the C version of the structure can be much shorter. And/or you can remove entries from the workspace struct and convert them to standard global/static C variables as you convert each section of code (however this does make it harder for assembler to get at the values).

Jan 12, 2017 7:47pm

Rick Murray (539) 13840 posts

is that its workspace is tiny – if Zap’s “Get module workspace” has got it right.

Ah, be careful there. What Zap sees (and what you will see) is the module’s private workspace pointer that is in R12 on entry to the module. Often, this is a fixed size chunk of memory that is claimed when the module first starts up, with R12 then being set to point to it.

Any further memory claims will have at least one anchor within the private workspace (depends if it is a linked list or whathaveyou), but these cannot be determined by the OS or by Zap, so they will appear to be “invisible”.
By way of example, my server module (written in C) has a workspace of about 12K. Also in the RMA but not accounted for are the indexes, declarations, an entire IP lookup database (~2MiB), and some other junk. I may, when I get ahold of a round tuit, modify the module to set up this stuff in dynamic areas because that’s easier to keep track of… At any rate, none of it appears as module’s workspace.

Jan 12, 2017 8:40pm

Jon Abbott (1421) 2651 posts

SoundDMA doesn’t use a lot of workspace, you can view the definitions for IOMD and HAL builds in the source code.

There’s a few variables and a chunk of code that’s dynamically created.

Jan 12, 2017 9:47pm

Dave Higton (1515) 3526 posts

a chunk of code that’s dynamically created

Wow.

Jan 13, 2017 2:57pm

Jeffrey Lee (213) 6048 posts

The dynamic code generation isn’t that scary (compared to e.g. SpriteExtend). It’s used for converting the 8bit mu-law audio to 16bit. You can quite easily write a generic algorithm for converting the audio, but that would involve a few multiplies, which were still a bit slow on ARMv3. So for increased performance SoundDMA glues together a few different code fragments to produce a routine that’s customised for the current channel count, channel stereo positions, and ~~channel volumes~~ global volume. Maybe on “modern” machines (ARMv5, ARMv6?) some of the add+shift sequences could be replaced with multiplies, but I’m not sure how easy that would be to slot in.

There’s a NEON version of the same audio conversion code, but since it needs to operate on the samples in parallel there’s much less scope for optimisation via dynamic code generation. So instead it’s just a handful of static routines (different routines depending on the channel count), with the stereo positioning and volume scaling catered for via a vector multiply. There’s some performance stats comparing the two here, but since we’re now making better use of memory attributes I expect those stats to be a bit out-of-date (e.g. the DMA buffer should now be correctly flagged as bufferable memory, whereas before I think it would have been non-bufferable, so the ARM version shouldn’t suffer so much from the way that it writes out one word at a time)

Jan 13, 2017 7:40pm

Dave Higton (1515) 3526 posts

Conversion from mu-law to 16 bit linear should be done by a table: 256 entries, 2 bytes per entry, 512 bytes in the table. You’d have to be crazy to do it any other way, surely?

Conversion the other way is a little more involved.

Jan 13, 2017 8:38pm

Jeffrey Lee (213) 6048 posts

Conversion from mu-law to 16 bit linear should be done by a table

That’s what the ARM version of the routine does. But unless you want to have a lot of lookup tables (could end up thrashing the cache on older machines), you’ll need to perform the (per-channel) stereo positioning and the (global) volume scaling using some other method.

You’d have to be crazy to do it any other way, surely?

A lookup table makes sense for the ARM version because it processes one 8 bit sample at a time. But the NEON version processes many samples in parallel, and there aren’t any scatter-gather load/store instructions (to allow multiple LUT lookups in parallel), so it’s actually quicker to do a manual conversion in parallel than to use a LUT in serial.

At peak throughput, the ARM version loads 8 samples using LDMIA, and then performs the lookup to (mono, unscaled) 16bit with a further 16 instructions. So 2.125 instructions per sample.

The NEON version (always) loads 16 samples into a quadword register using VLDM and converts them to (mono, unscaled) 16bit with a further 18 instructions. That’s 1.1875 instructions per sample.

Jan 13, 2017 8:39pm

Rick Murray (539) 13840 posts

You’d have to be crazy to do it any other way, surely?

Jeffrey said:

So for increased performance SoundDMA glues together a few different code fragments to produce a routine that’s customised for the current channel count, channel stereo positions, and channel volumes.

So it may be that with all of these things taken into account, the code version was seen as an acceptable compromise?

Jan 13, 2017 10:17pm

Dave Higton (1515) 3526 posts

I sit corrected.

Jan 15, 2017 9:37am

Colin Ferris (399) 1814 posts

Is there a way of adding support for 24bit samples – for one of the new ARM machines?

Jan 15, 2017 8:24pm

Jon Abbott (1421) 2651 posts

Is there a way of adding support for 24bit samples

You could switch SoundDMA to internally use 32bit and downscale for the output device. Longer term, you’d also want to change SharedSound internally to 32bit, with an extension to allow clients to specify their bit depth, 16bit being the default.

Bit depth is only half the issue though, you’d also want to add support for sample rate correction as many of the newer machines don’t support many of the legacy sample rates. It’s fairly trivial to implement, provided you don’t get all fancy on how you do it.

Jan 15, 2017 8:57pm

Dave Higton (1515) 3526 posts

It’s fairly trivial to implement, provided you don’t get all fancy on how you do it.

It’s fairly trivial to implement badly. If you want to implement it with high audio quality (is this what you mean by “fancy”?), you may find you rapidly run out of signal processing capacity.

Hollowing out a Module

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Nov 27, 2016 11:41am Colin Ferris (399) 1814 posts	Is there a standard way of ‘Hollowing out a Module’ ie:- Taking a Arm module apart and building it up as a ‘C’ module – so some swi etc can be updated with ‘C’ code – one swi at a time.

Nov 27, 2016 6:01pm Dave Higton (1515) 3526 posts	Interesting idea. If you have the source, then I believe it should be possible, within some limits. For example, I’d expect it to be more difficult if the module does service calls or the other “frills”. Easiest would be a module that does nothing but implement SWIs. Do you have a particular module in mind?

Nov 29, 2016 11:56am Colin Ferris (399) 1814 posts	SoundDMA module seems to drive the hardware – but only makes use of 16bit samples – even if the Hardware accepts 24bit. Jim understands ‘C’ so the swi that takes a file piece by piece and feeds it to the Sound hardware – could perhaps changed to ‘C’ and updated – to handle 24bit samples. I presume 16bit samples are half a word – so I presume 24bit samples would be a 32bit word – with the top bits zero. A test case could be made with a prog running in the TASK window – and reading a 24bit sound file – and feeding the Sound system.

Jan 12, 2017 10:32am Dave Higton (1515) 3526 posts	I wonder if it could be done by patching the real module to give it a different SWI base number and name, and writing a C module that simply passes on every call to the patched module. Then you can fill in each SWI one at a time with equivalent functionality. That only deals with the SWIs; I see that SoundDMA has a service call handler too, and I don’t know at what point the handler will cease to function correctly as the functionality is moved over; clearly the workspaces of the two modules would be separate, so neither can see the other’s data. I find assembly language impenetrable at this level of complexity and sophistication. The authors really have woven a tangled web.

Jan 12, 2017 1:41pm Jeffrey Lee (213) 6048 posts	I think the way I’d tackle piecemeal rewriting of a module in C would be as follows: Create a CMHG file which lists all the SWIs, commands and service calls that the module uses. At this stage, there’s no need to list any vectors or “generic veneers” – the assembler code will be able to register vectors as normal, and have them call straight into assembler without passing through any of the C/CMHG code Create a C header which contains a struct which mirrors the layout of the module workspace the assembler uses (must be the same layout, to allow both C and assembler to access the workspace) Have a global C variable that’s a pointer to the assembler workspace. There’s no need to handle workspace allocation/freeing from C, the assembler will take care of that for you. In the handler functions that CMHG generates you’ll want to call through to the original assembler functions. However you’ll need a small veneer (either inline assembler or standalone assembler functions) to make sure the inputs & outputs are dealt with correctly. E.g. when calling the assembler function you’d want R12 to be a pointer to your C “assembler workspace pointer” variable, so that it can act as a surrogate private workspace pointer for the assembler to use. Once you’ve got that set up the module should be able to function as normal, and you can start replacing code with C. If you want to be able to call into C from assembler then you’ll need to be aware of the usual rules when adding C to assembler code (making sure you use APCS calling conventions, and making sure the relocation offsets are on the stack – although if the stack trace is C → assembler → C then the relocation offsets should already be there, so for most cases you can probably avoid setting them up again)

Jan 12, 2017 1:46pm Rick Murray (539) 13840 posts	I wonder if it could be done by patching the real module to give it a different SWI base number and name, and writing a C module that simply passes on every call to the patched module. In a word, yes. Many years ago I tried an experiment with IIC so I hacked the system IIC module on the fly, loaded my own version that talks to the parallel port, and added a fiddle to calls to the CMOS RAM would be passed to the old original module while everything else went to the parallel port. Ugly hack but it works.

Jan 12, 2017 2:28pm Dave Higton (1515) 3526 posts	The good news, in the case of SoundDMA, is that its workspace is tiny – if Zap’s “Get module workspace” has got it right.

Jan 12, 2017 2:51pm Jeffrey Lee (213) 6048 posts	I guess you can also be selective with which bits of workspace you define in C. E.g. you could re-arrange the assembler workspace so that all the bits which need to be accessed by C are at the start, so the C version of the structure can be much shorter. And/or you can remove entries from the workspace struct and convert them to standard global/static C variables as you convert each section of code (however this does make it harder for assembler to get at the values).

Jan 12, 2017 7:47pm Rick Murray (539) 13840 posts	is that its workspace is tiny – if Zap’s “Get module workspace” has got it right. Ah, be careful there. What Zap sees (and what you will see) is the module’s private workspace pointer that is in R12 on entry to the module. Often, this is a fixed size chunk of memory that is claimed when the module first starts up, with R12 then being set to point to it. Any further memory claims will have at least one anchor within the private workspace (depends if it is a linked list or whathaveyou), but these cannot be determined by the OS or by Zap, so they will appear to be “invisible”. By way of example, my server module (written in C) has a workspace of about 12K. Also in the RMA but not accounted for are the indexes, declarations, an entire IP lookup database (~2MiB), and some other junk. I may, when I get ahold of a round tuit, modify the module to set up this stuff in dynamic areas because that’s easier to keep track of… At any rate, none of it appears as module’s workspace.

Jan 12, 2017 8:40pm Jon Abbott (1421) 2651 posts	SoundDMA doesn’t use a lot of workspace, you can view the definitions for IOMD and HAL builds in the source code. There’s a few variables and a chunk of code that’s dynamically created.

Jan 12, 2017 9:47pm Dave Higton (1515) 3526 posts	a chunk of code that’s dynamically created Wow.

Jan 13, 2017 2:57pm Jeffrey Lee (213) 6048 posts	The dynamic code generation isn’t that scary (compared to e.g. SpriteExtend). It’s used for converting the 8bit mu-law audio to 16bit. You can quite easily write a generic algorithm for converting the audio, but that would involve a few multiplies, which were still a bit slow on ARMv3. So for increased performance SoundDMA glues together a few different code fragments to produce a routine that’s customised for the current channel count, channel stereo positions, and ~~channel volumes~~ global volume. Maybe on “modern” machines (ARMv5, ARMv6?) some of the add+shift sequences could be replaced with multiplies, but I’m not sure how easy that would be to slot in. There’s a NEON version of the same audio conversion code, but since it needs to operate on the samples in parallel there’s much less scope for optimisation via dynamic code generation. So instead it’s just a handful of static routines (different routines depending on the channel count), with the stereo positioning and volume scaling catered for via a vector multiply. There’s some performance stats comparing the two here, but since we’re now making better use of memory attributes I expect those stats to be a bit out-of-date (e.g. the DMA buffer should now be correctly flagged as bufferable memory, whereas before I think it would have been non-bufferable, so the ARM version shouldn’t suffer so much from the way that it writes out one word at a time)

Jan 13, 2017 7:40pm Dave Higton (1515) 3526 posts	Conversion from mu-law to 16 bit linear should be done by a table: 256 entries, 2 bytes per entry, 512 bytes in the table. You’d have to be crazy to do it any other way, surely? Conversion the other way is a little more involved.

Jan 13, 2017 8:38pm Jeffrey Lee (213) 6048 posts	Conversion from mu-law to 16 bit linear should be done by a table That’s what the ARM version of the routine does. But unless you want to have a lot of lookup tables (could end up thrashing the cache on older machines), you’ll need to perform the (per-channel) stereo positioning and the (global) volume scaling using some other method. You’d have to be crazy to do it any other way, surely? A lookup table makes sense for the ARM version because it processes one 8 bit sample at a time. But the NEON version processes many samples in parallel, and there aren’t any scatter-gather load/store instructions (to allow multiple LUT lookups in parallel), so it’s actually quicker to do a manual conversion in parallel than to use a LUT in serial. At peak throughput, the ARM version loads 8 samples using LDMIA, and then performs the lookup to (mono, unscaled) 16bit with a further 16 instructions. So 2.125 instructions per sample. The NEON version (always) loads 16 samples into a quadword register using VLDM and converts them to (mono, unscaled) 16bit with a further 18 instructions. That’s 1.1875 instructions per sample.

Jan 13, 2017 8:39pm Rick Murray (539) 13840 posts	You’d have to be crazy to do it any other way, surely? Jeffrey said: So for increased performance SoundDMA glues together a few different code fragments to produce a routine that’s customised for the current channel count, channel stereo positions, and channel volumes. So it may be that with all of these things taken into account, the code version was seen as an acceptable compromise?

Jan 13, 2017 10:17pm Dave Higton (1515) 3526 posts	I sit corrected.

Jan 15, 2017 9:37am Colin Ferris (399) 1814 posts	Is there a way of adding support for 24bit samples – for one of the new ARM machines?

Jan 15, 2017 8:24pm Jon Abbott (1421) 2651 posts	Is there a way of adding support for 24bit samples You could switch SoundDMA to internally use 32bit and downscale for the output device. Longer term, you’d also want to change SharedSound internally to 32bit, with an extension to allow clients to specify their bit depth, 16bit being the default. Bit depth is only half the issue though, you’d also want to add support for sample rate correction as many of the newer machines don’t support many of the legacy sample rates. It’s fairly trivial to implement, provided you don’t get all fancy on how you do it.

Jan 15, 2017 8:57pm Dave Higton (1515) 3526 posts	It’s fairly trivial to implement, provided you don’t get all fancy on how you do it. It’s fairly trivial to implement badly. If you want to implement it with high audio quality (is this what you mean by “fancy”?), you may find you rapidly run out of signal processing capacity.