Audio Recording API
Colin (478) 2433 posts |
I should add that the recording api should be identical to the playback api just the direction of the stream changes. So where you would set a samplerate to play you can set the samplerate to record. |
Rick Murray (539) 13840 posts |
From the document:
Why?
Whoa, so poetic!
Simple. That which is currently playing determines the sample rate. It might be an idea to have a very simple rubbish quality resampler available so that a 22kHz beep can be overlaid onto a 48kHz AAC. The premise being that if the user is actually intending to play something with two different sample rates at the same time, they should use an application that can do it; this is only “good enough” for beeps, chimes, tones, etc to be mixed into the audio stream (and since what’s currently playing is treated as the primary, it should be able to mix it in without the sample rate changing).
I note that my phone, a Samsung S9 so not a slouch, is still incapable of doing this properly.
This is probably best handled with some sort of configuration option, with mixing being the default (as it’ll be what most normal non-pro users expect).
It probably will. I would imagine the system will be put together and “proven” using only a subset of the intended codecs and sample rates, with those to be fleshed out once the core is running satisfactorily.
I think the point is that between the software generating the sound, and the playback device, the sound system should use the best format that isn’t going to result in potential data loss down the line. You know, like how the current 16 bit system can’t cope with 24 bit audio. If this change is done internally and is invisible to either end, then it’s really no big deal. It’s a sound equivalent to how floating point units internally promote FP values to extended precision prior to doing a calculation, in order that the result have the best accuracy.
I agree with Colin in that that’s a big deficiency in audio handling, and when done properly (like this), audio input is not only complimentary to output, it can share a fair few bits of the mechanics of it.
Also, given that the main use case is going to be playing and not recording… ;-)
I wonder if, in the future (as an addition) it would be possible to have input and output be represented by DeviceFS objects. That way an application can “open a file” (using special fields to define the required format) and then simply read or write data to the device (along with a large enough buffer that normal multitasking delays won’t cause stuttering). |
Dave Higton (1515) 3526 posts |
For real time communication applications (video calls, VoIP voice calls), the latency MUST be short. The most common block length is 20 milliseconds. You don’t want to introduce delays by requiring big buffers to be filled too much before anything happens. For my iyoPhone app, I got Christian Ludlam to do me a special version of the audio in app with a very short buffer. |
Rick Murray (539) 13840 posts |
One can imagine (would hope) that those sorts of applications would be talking directly to the sound system, not using the DeviceFS “file” approach. That being said, no reason why a special field option couldn’t define the amount of buffering required? |
Colin (478) 2433 posts |
Unlike USB – have we learned lessons from that – the new api would be as closely coupled to the device as possible via a callback interface (not the os callback) – which for our usb driver is a latency of about 3 days The devicefs is just an addition to enable some use in application space. Buffer size won’t matter much you won’t be able to do low latency stuff in application space unless of course you single task the system then you can use callbacks. |
jim lesurf (2082) 1438 posts |
Two pretty basic points: 1) It seems a no-brainer obvious point to me that 32 bit values for transfers makes sense. Because it is routine for people to play both 16 and 24 bit samples, etc. Hence standardising on 32 bit for buffers or transfers makes coping with 16, 24, and 32 all use the same route and buffers. Saves having to arrange different ‘modes’ or ‘settings’ for each type. 2) As soon as you add a ‘mixer’ it is unavoidably that you will need levels to ‘duck’ when more than one source is played (or recorded) at a time. This is because when you sum the samples some of the results will (try to) go above 0dBFS if you don’t! i.e. you will get clipping distortions. Hence mixers and fiddling about with levels go together. So again,avoiding mixers whenever possible is easier and better. |
Colin (478) 2433 posts |
My point is suppose you have an audio file that is 2 byte sample size and a device that takes 3 bytes why waste cpu time copying up to 4bytes then down to 3bytes. If the audio is headed for a mixer stream then converting all streams to 4 bytes makes sense but the change only needs to happen at the input to the mixer like sample size changing only needs to happen at the input to an audio device. In that situation if you have a mixer the sample change happens twice once on entry to the mixer and once in entry to the device. If you take out the mixer the sample change happens once on entry to the device It is likely that once you get into high samplerates that a mixer/processor will be too slow to process the data so taking out unnecessary processing is helpful |
Chris Hughes (2123) 336 posts |
The more I read this thread the more I think some people are trying to run before we can walk. To me its seems we need to get the basics working before worrying about all the nice extra features. Otherwise this is going to generate too much work for one person. |
Steve Pampling (1551) 8170 posts |
To me it seems like people are trying to agree the API so that it gets done once.1 1 I’m a big believer in spending longer planning things and maybe a bit extra implementing the basic “shell” in order to avoid rushing a basic “does the existing job” version that has to be binned the moment anything needs alteration. |
Steve Fryatt (216) 2105 posts |
Not just the API, but the architecture, too. There’s no point implementing the input stuff first without thinking about how you’re going to connect to it and use it. Otherwise you just end up doing it all again when you subsequently find out that in fact, “plase 2” needed a slightly different implementation. More haste, less speed.
Careful. That sounds dangerously close to planning things. |
Colin (478) 2433 posts |
In the document the format descriptor needs an entry for sample size separate from bit resolution. Though thinking about it do we need bit resolution, Sample size tells you the number of bytes in the sample if the resolution is smaller it doesn’t matter – usb supplies both. I note you referencing Class 1 Audio. Class 2 does things slightly differently though I think its mainly an implementation change and extending formats. Note that USBdescriptors include an interface type id which may help the user recognise a device from a list. |
Steve Pampling (1551) 8170 posts |
I frequently comment at work1 that some people think it is a place in China (Pla Ning). I might be socially inept but I don’t normally need to do things more than once so I’m tolerated :) 1 Bit more eyesight rebuild work and I can go back and annoy them again :^) |
Clive Semmens (2335) 3276 posts |
I think Pla Ning is actually in Vietnam… 8~D |
Steve Pampling (1551) 8170 posts |
Thus proving how wrong they are. |
Jason Tribbeck (8508) 21 posts |
(Document’s been updated [no point in downloading it again; the differences are very subtle], and I’m trying to catch up – I may have quoted things out of order!)
True – the system beep is important. But it is just a beep – it’s not music, and it could be easily replaced with an arbitrary sample.
In my original plan, there was a degree of sharing of functionality between input and output; the new API is just a bit of an extension of the earlier approach, removing some of the requirements of having to have both 16-bit and 32-bit merge code – and probably more sharing of functionality.
While I’ve made USB devices, and written USB firmware, USB audio isn’t something I’ve had a lot of experience with from a programming perspective – so I’m learning as I’m going along :) And I know that even though I was only asked about audio input, for I2S devices, they’ll interfere with the sound output, so either sound output would need to be paused while recording occurs (so the code can gain exclusive access to the hardware registers), or sound output also needs to be considered.
As am I. I also like writing these documents, because I can have a conversation with myself to try to see if I’ve covered all angles. At work, I was prototyping a multi-computer, multi-instance communications architecture for a project and I started writing code to do it. But about 20% of the way through, I realised that I didn’t actually know how it would end up and I was getting confused with the message passing. So I took a day to document how it should work, and coded to that – and it didn’t need to change that much. I’m also a believer in the Agile approach (I’ve been doing it for 20ish years), so I wouldn’t be worrying about USB right this minute. But having an architecture which can’t cope with USB (and Bluetooth) doesn’t make sense – we already have one of those! I really want to get this right. |
Rick Murray (539) 13840 posts |
First reason – because it simplifies things. Let’s just “assume” for the moment that the sound system consists of three parts. An input, something to route the data around, and an output. That’s the sound system, so the input is something like an MP3 decoder, not a microphone. And, besides, it is probably faster. You have your two bytes. A simple MOV for one byte, and a rotated ORR to merge in the second, and then you word-write the entire value, instead of two byte writes. So it’s probably actually faster. Likewise, the far end can pluck a word at a time and fiddle the data around in registers to fit its three-byte preference.
Processing is fast, especially if the conversion fits into cache and the registers are used sensibly. Memory access. That’s what’s slow.
And is future expandable without horrid hacks like sticking “TASK” or “TRUE” into certain registers.
Depending upon something USB provides might bite you in the ass if the input comes from something that isn’t USB.
For a short while, my A5000 would exclaim “Oh s**t!” when there was an error. I guess we could troll Windows users by loading up “critical stop.wav” as the sample? |
Steve Pampling (1551) 8170 posts |
Which would be a nice bonus – incorporating a feature of various third party utilities from the 26 bit era. That wouldn’t be a design requirement, but it might be a consequence of the implementation. |
Jason Tribbeck (8508) 21 posts |
Pretty much what I was thinking. I did consider that we could use more than 32-bits as the intermediate level, but we’d almost certainly never need higher than 32-bit precision, the code would be a lot more complicated, and the memory would be oddly arranged (unless it’s 64-bit, but that’s a massive jump in precision!) |
Colin (478) 2433 posts |
How can it be faster than no processing of 4 byte aligned data by the processing unit – which is doing more than just copy data – the device then converts the data to its sample size which may be the same so it would be a word read and write. You only need to arrange for the processing units to mandate a 4byte sample size and if the application uses one then either it or the os promotes the sample data to 8bytes at the beginning of a chain of processing which has to happen anyway if even if direct access to the device is not allowed – which makes this all academic anyway as Jasons proposal is not to have direct access by devices. I don’t think we need to get bogged down by details if you think you can optimise flow through processing units where it won’t make a difference then I’m happy with that I was just voicing my concerns that high samplerate audio will need all the cpu bandwidth it can get, |
Peter Howkins (211) 236 posts |
This document may be of interest/inspiration. It’s the spec that Acorn wrote for Phoebe’s mixer and audio input capabilities. |
Rick Murray (539) 13840 posts |
Just to throw in a little something here, it’s worth pointing out that consumer grade line levels typically peak at +/-0.447V (which is sometimes written as −10 dBV). This means we have a swing of just under 0.9V possible in our signal. A 16 bit recording system can resolve 65,536 steps, which makes it 0.000014 volts per step. A 24 bit recording, however, has a much greater range of possible values, 16,777,216 steps, which equates to 0.000000054 (I think, Google actually said 5.36441803 × 10-8). 16 bit is ~14µV per step. The ATX power supply spec specifies a maximum ripple of 50mV for the 5V supply (which will likely be what is present on the USB socket). So while it is useful to have data be available as 32 bit values to allow for future expansion and to benefit from the native word size of the processor, it is worth remembering that just because a format exists, it doesn’t mean that anything is actually capable of either recording or playing back with anything even remotely near to that degree of precision. Think about this – in both formats (16 bit and 24 bit), how many of the least significant bits are essentially useless? That is to say, if you performed a recording using the sorts of domestic consumer grade devices that we normal people might buy, and set up a recording with no input connected, how many of the least significant bits are simply random due to noise inherent in the design, and fluctuating due to other design issues? |
Steve Pampling (1551) 8170 posts |
Fixed that for you. BTW. Check out differential input and in that general enquiry follow the links for lots of cheap (and not so cheap) single chip amplifiers in noise-cancelling circuits. Not saying the current hardware has any prospect of such things but perhaps a Titanium with a special sound board? |
Colin (478) 2433 posts |
What about this idea, just mimic USB If you look at USB you have terminals and units. Terminals are either end of a chain of units so the sound starts at the input terminal and the stream passes through each unit which does something and you get the sound out at the output terminal. Each terminal and unit has a control interface and all are linked through the stream interface For us an input terminal would be a line in, a program sourcing sound and an output terminal would be a line out or a program sinking sound. All codecs,mixers, selectors etc are units If you want to play some audio the programmer: 1) Selects the output terminal of choice. Stores a context for the terminal. You end up with a chain of processing units between you and the output terminal. Each unit knows the unit/device it is outputting to so can communicate with it over the steaming interface. I now have a context for the terminal and units in the chain which I can use to access each units control interface – if it has one. As you are holding context for units the units are not holding state and can be used more than once. The programmer sets up the output terminal they have chosen – line out – registers a buffer fill function with it and sets up any units used. I’ll concentrate on a PCM stream but it works with any Playing then starts with a negotiation 1) you ask the top of the chain to set up the buffer. You tell it a) the format you are sending it b) the minimum buffer size you would need – in case you can only read in chunks. it would be generally be 0 for a PCM stream2) the first unit in the chain looks at what is sent and asks its output unit the same question. ie tell its output unit 3) This continues down the chain until it reaches the output terminal – line out device. The output terminal has been told the format it will receive and the minimum buffer size required. The device works out the maximum number of samples it would require in one chunk it multiplies that by the maximum of its own sample size and the input format sample size. Then it creates a buffer which is the maximum of the buffer size it calculated and the minumum buffer size required. You start the output terminal. It calls the buffer fill function you registered telling you how many samples to put in the buffer and the buffer address. The buffer is filled from the top in the sample size of the source and the first unit in the chain is informed that the buffer is filled with x samples. The unit knows what sample size to expect, it was in the buffer request, if it wants to work in and output 4byte samples it expands the input – the buffer size has been negotiated to enable this. Once that unit has done its stuff it tells its output unit the buffer is full and so on down the chain. The input terminal can be anything your program can source data from a file, The output from SoundDMA, a line in device etc. With a line in device you register with it and hold its context. You can use its control interface to set it up how you want. It will tell you the buffer size it requires and you use that buffer size as a minimum buffersize when requesting a buffer size from units/terminals. If you make the output terminal a line out you tie input to output You don’t need any units or you can use what you want. If you have a system sound unit all programs can include that. You just mix and match units to suit. If you have an mp3 unit it can convert your mp3 input stream to pcm and you choose the output terminal you want file, line-out etc. I’ve probably missed some points but you get the gist. I think it offers great flexibility. |
jim lesurf (2082) 1438 posts |
In my practical experience USB Audio Class 2 devices (usual for modern kit) use 4 bytes-per-value transfer. So 16 and 24 bit sample values have to be given to USB as 4-byte – i.e. 32bit – chunks per value. Class 1 tended to use 2 bytes per value. Actually transferring 3 bytes per value seems to have been a bodge for squeezing 24bit samples though older kit that was Class 1 so far as I’m aware. Using 4 bytes per value thoughout makes sense as you then only need a conversion at the input or output of the entire OS chain. And someone else has confirmed my suspicion that using ‘words’ make the actual process easier anyway IIUC. In practice 24bit values from good ADCs may have a noise level below -9n dBFS, hence need 24 bits to ensure dithering. Most home recordings won’t be that good. But some pro ones will be. And in reality most users will be playing pro recordings they have purchased. So the system does need to be able to transfer 24 bit values unmolested. FWIW I’m looking at the spec of my Benchmark ADC and this gives SNR unweighted Noise -119 dB even for 192k sample rate. (i.e. nominal audio bandwidth of 96kHz.) This is nominally studio / testbench quality, but some large companies will have better kit than this! Of course, many consumer releases are crapped up, but this is the kind of level the digitial path needs to have to cope with. So, Rick, yes, many real recordings aren’t really ‘24bit’ in anything more than name. But some are. If curious, have a look at http://www.audiomisc.co.uk/MQA/cool/bitfreezing.htmlwhere I came at this from a different angle for another reason. The example recording with a spectrum was made using DSD (sigh) so has an enormous ‘noise hill’ in the ultrasonic. And 3/10th of SFA in terms of music up there. The check people can make is as follows. If you have a ‘high rez’ recording in flac format, convert it into a noise shaped one at 48k/16bit. Compare the sizes of the files. The change is due mainly to removing over-specified noise in the lowest bits. However although usually this indicates 16bits (noise shaped) is fine. It won’t always be so. And means someone/something somewhere would need to shape down. Which implies adding it to the sound system or having the user keep doing it. So, bottom line: Using 4 bytes per value from end-to-end is the simplest course for getting reliable pass-the-parcel performance with no risk of confusions or damage to the result. Let the DAC dither it down to low bit. That’s its job. Just as some ADCs make 1 bit captures and then covert them to 24 bit for storage. 8-] (BTW I hate DSD just as much as I hate MQA and HDCD. But that’s another story… :→ ) |
Rick Murray (539) 13840 posts |
This is in response to Jim: https://www.mojo-audio.com/blog/the-24bit-delusion/ ;-) [I love the bit that says that commercial 24 bit stuff is just 16 bit data stuck into 24 bit space because people think 24 bit is necessary despite (as I suspected) the playback capabilities not having a hope in hell of being able to actually use the additional information] |