Hollowing out a Module
Colin Ferris (399) 1814 posts |
Is there a standard way of ‘Hollowing out a Module’ ie:- Taking a Arm module apart and building it up as a ‘C’ module – so some swi etc can be updated with ‘C’ code – one swi at a time. |
Dave Higton (1515) 3526 posts |
Interesting idea. If you have the source, then I believe it should be possible, within some limits. For example, I’d expect it to be more difficult if the module does service calls or the other “frills”. Easiest would be a module that does nothing but implement SWIs. Do you have a particular module in mind? |
Colin Ferris (399) 1814 posts |
SoundDMA module seems to drive the hardware – but only makes use of 16bit samples – even if the Hardware accepts 24bit. A test case could be made with a prog running in the TASK window – and reading a 24bit sound file – and feeding the Sound system. |
Dave Higton (1515) 3526 posts |
I wonder if it could be done by patching the real module to give it a different SWI base number and name, and writing a C module that simply passes on every call to the patched module. Then you can fill in each SWI one at a time with equivalent functionality. That only deals with the SWIs; I see that SoundDMA has a service call handler too, and I don’t know at what point the handler will cease to function correctly as the functionality is moved over; clearly the workspaces of the two modules would be separate, so neither can see the other’s data. I find assembly language impenetrable at this level of complexity and sophistication. The authors really have woven a tangled web. |
Jeffrey Lee (213) 6048 posts |
I think the way I’d tackle piecemeal rewriting of a module in C would be as follows:
Once you’ve got that set up the module should be able to function as normal, and you can start replacing code with C. If you want to be able to call into C from assembler then you’ll need to be aware of the usual rules when adding C to assembler code (making sure you use APCS calling conventions, and making sure the relocation offsets are on the stack – although if the stack trace is C → assembler → C then the relocation offsets should already be there, so for most cases you can probably avoid setting them up again) |
Rick Murray (539) 13840 posts |
In a word, yes. |
Dave Higton (1515) 3526 posts |
The good news, in the case of SoundDMA, is that its workspace is tiny – if Zap’s “Get module workspace” has got it right. |
Jeffrey Lee (213) 6048 posts |
I guess you can also be selective with which bits of workspace you define in C. E.g. you could re-arrange the assembler workspace so that all the bits which need to be accessed by C are at the start, so the C version of the structure can be much shorter. And/or you can remove entries from the workspace struct and convert them to standard global/static C variables as you convert each section of code (however this does make it harder for assembler to get at the values). |
Rick Murray (539) 13840 posts |
Ah, be careful there. What Zap sees (and what you will see) is the module’s private workspace pointer that is in R12 on entry to the module. Often, this is a fixed size chunk of memory that is claimed when the module first starts up, with R12 then being set to point to it. Any further memory claims will have at least one anchor within the private workspace (depends if it is a linked list or whathaveyou), but these cannot be determined by the OS or by Zap, so they will appear to be “invisible”. |
Jon Abbott (1421) 2651 posts |
SoundDMA doesn’t use a lot of workspace, you can view the definitions for IOMD and HAL builds in the source code. There’s a few variables and a chunk of code that’s dynamically created. |
Dave Higton (1515) 3526 posts |
Wow. |
Jeffrey Lee (213) 6048 posts |
The dynamic code generation isn’t that scary (compared to e.g. SpriteExtend). It’s used for converting the 8bit mu-law audio to 16bit. You can quite easily write a generic algorithm for converting the audio, but that would involve a few multiplies, which were still a bit slow on ARMv3. So for increased performance SoundDMA glues together a few different code fragments to produce a routine that’s customised for the current channel count, channel stereo positions, and There’s a NEON version of the same audio conversion code, but since it needs to operate on the samples in parallel there’s much less scope for optimisation via dynamic code generation. So instead it’s just a handful of static routines (different routines depending on the channel count), with the stereo positioning and volume scaling catered for via a vector multiply. There’s some performance stats comparing the two here, but since we’re now making better use of memory attributes I expect those stats to be a bit out-of-date (e.g. the DMA buffer should now be correctly flagged as bufferable memory, whereas before I think it would have been non-bufferable, so the ARM version shouldn’t suffer so much from the way that it writes out one word at a time) |
Dave Higton (1515) 3526 posts |
Conversion from mu-law to 16 bit linear should be done by a table: 256 entries, 2 bytes per entry, 512 bytes in the table. You’d have to be crazy to do it any other way, surely? Conversion the other way is a little more involved. |
Jeffrey Lee (213) 6048 posts |
That’s what the ARM version of the routine does. But unless you want to have a lot of lookup tables (could end up thrashing the cache on older machines), you’ll need to perform the (per-channel) stereo positioning and the (global) volume scaling using some other method.
A lookup table makes sense for the ARM version because it processes one 8 bit sample at a time. But the NEON version processes many samples in parallel, and there aren’t any scatter-gather load/store instructions (to allow multiple LUT lookups in parallel), so it’s actually quicker to do a manual conversion in parallel than to use a LUT in serial. At peak throughput, the ARM version loads 8 samples using LDMIA, and then performs the lookup to (mono, unscaled) 16bit with a further 16 instructions. So 2.125 instructions per sample. The NEON version (always) loads 16 samples into a quadword register using VLDM and converts them to (mono, unscaled) 16bit with a further 18 instructions. That’s 1.1875 instructions per sample. |
Rick Murray (539) 13840 posts |
Jeffrey said:
So it may be that with all of these things taken into account, the code version was seen as an acceptable compromise? |
Dave Higton (1515) 3526 posts |
I sit corrected. |
Colin Ferris (399) 1814 posts |
Is there a way of adding support for 24bit samples – for one of the new ARM machines? |
Jon Abbott (1421) 2651 posts |
You could switch SoundDMA to internally use 32bit and downscale for the output device. Longer term, you’d also want to change SharedSound internally to 32bit, with an extension to allow clients to specify their bit depth, 16bit being the default. Bit depth is only half the issue though, you’d also want to add support for sample rate correction as many of the newer machines don’t support many of the legacy sample rates. It’s fairly trivial to implement, provided you don’t get all fancy on how you do it. |
Dave Higton (1515) 3526 posts |
It’s fairly trivial to implement badly. If you want to implement it with high audio quality (is this what you mean by “fancy”?), you may find you rapidly run out of signal processing capacity. |