Warning: huge post -- I'm just thinking out loud.
Here's my perspective as an image processing engineer... The same issues of bit depth and sampling rate (resolution) obviously come up in image processing. (I have also built a lot of audio processing circuits using
pure data)
First, the difference between 44, 48 and 96k sampling rates. This is analogous to image resolution. In image processing, fine details are referred to as "high frequency," and this corresponds to high frequency audio. The better the sampling rate, the better you're able to reproduce high frequency detail. 44k has 0.92 times the resolution of 48k, and 96k has twice the resolution of 48k.
Here's 44k:
And 48k:
And 96k:
You can see how small the difference is between 44k and 48k. Both have enough resolution after Nyquist to cover the range of normal human hearing, which tops out at ~20k. There are some fine details you can see better in the 48k image, but you're squinting.
The 96k image is obviously larger, and there are more details, but... uh, unless you pitch shift that audio, it's just resolving details you can't hear. It would be like using a 20 megapixel camera when your lens could only resolve 10 MP of information. Just my opinion.
Bit depth is another issue. Bit depth is not "resolution." It simply defines how many steps of amplitude there are between 0 and 1. (1 usually being 0db). Here's the 24bit color image (8 bit per color channel, commonly but confusingly called "8 bit"):
Here's the same image resampled to 4 bit (16 amplitude values), so you can see the steps clearly:
Obviously this is why no one records at 4 bit.
However, here's the same image in 4 bit, but dithered using the algorithm generally accepted as "best" -- it's called error diffusion, or just "diffusion" for short:
Clearly, that is much, much better.
Here's a less impressive dithering technique ("pattern dither").
This demonstrates that the dithering style
does make a difference -- pattern dither introduces distracting artifacts. IMO, for best results, use gear/software with error diffusion dithering.
Some audio gear actually just takes the bottom 16 bits of a 24bit signal (someone said Quicktime does this?). This is disastrous, since anything that reaches above 2/3 of the way to 0db will be unceremoniously clipped off. That would look like this:
24 bit vs 16bitOkay, so how big of a difference is there between 16 and 24 bit? Well, in order to make the difference more clear, I'm going to filter out the high frequency information by blurring the image. Here's a 24bit image:
And here's the 16bit version (resampled to 24bit jpg for viewing on the web):
Wow, okay, the 24 bit version looks loads better! Well, that 16bit conversion didn't use any dithering. If we were converting from 24 > 16bit and used diffusion dithering, you would be hard pressed to spot the difference.
There's actually another factor here related to dithering. Error diffusion works by introducing minute errors (noise) to the signal. But your signal already has noise if it came from a mic, went through a pre, and passed through A/D. So even if you're not working with gear that dithers, odds are that your signal is doing the dithering for you.
Here's the same 24bit image with a small amount of noise added to emulate the amount of noise in a nice, clean recording:
And the signal with the same amount of noise at 16 bit:
It looks
very close to the 24bit image. The 24bit is maybe a tiny bit nicer, but you have to be looking for it.
This is why Sound Devices says that at full signal level, 24 and 16 bit sound
"largely identical." They should know.
HeadroomOf course, as SD points out, the world is not perfect, and you can't always be kissing 0db. So how does 24bit stack up against 16bit if you record with your levels down, and normalize (multiply) them later? Lets say you leave enough headroom that most of your audio peaks at -18db. You want to be sure that you won't clip if the dude next to you yells. That means you're only using 1/8th of the available levels -- in 16bit, 8192 levels; in 24bit, 2,097,152 levels. In other words, now your 16bit audio is really 13bit, and your 24bit is really 21bit.
The real issue, however, isn't the loss of amplitude fidelity. It's the fact that we're amplifying the ADC noise. Let's imagine that our ADC introduces an amount of noise that equals about 2 amplitude levels. So out of the 8192 levels used in our 16bit file, 2 of those are noise. And of the 2,097,152 levels used in our 24bit file, 2 of those are noise. See the issue? In the 16bit file, 2/8192 = 0.024% of the signal is noise, versus 0.0001% noise for the 24 bit file. Relative to the signal, the 16bit file has exactly 256 times more digital noise than the 24bit version.
That said, ADCs are all different, and will have different "characters," so if you do plan on running with a lot of overhead, it's probably best to try a few out and see what sounds good to you.
Dynamic RangeSo, what about dynamic range? Dynamic range is one of the most misunderstood terms in both digital photography and digital audio. That's partially because DR figures are almost always given in logarithmic scale (stops in photography, dB in audio), and people always get tripped up with log numbers. It's also because people aren't sure what affects DR.
There are only two things that can affect DR: noise and bit depth. Noise is simple: your mic has a self-noise, your pre adds some noise, and your ADC adds some noise. Whatever you have left between the noise floor and 0db is your raw dynamic range. Bit depth affects DR, because audio is typically encoded linearly. So in 16bit, you have 65536 levels. Half of those levels (32768) cover 0db to -6db, half of the remaining levels (16384) cover -6 to -12, half of that (8192) cover -12 to -18, etc. By the time you get to the range between -66db and -72db, there are only 16 levels of amplitude to describe the waveform -- pretty gritty. By the time you get to the range between -84db and -90db, there are only two levels -- a square wave, either on or off. Of course, noise takes over long before we get to that point.
So the theoretical dynamic range of 16bit is 90db, but the last 30db or so are pretty rough. This is why some people think very soft sounds start to sound bad in 16bit -- for example, the oft-cited "end of the decay of a cymbal." There are plenty of microphones that have over 66db of dynamic range, so they can expose the limits of 16bit.
With 24bit, you start off with more levels. There are over 8 million levels to describe the amplitude between 0 and -6db!!! Obviously massive overkill. But the result is that between -66db and -72db, you have 4096 levels available vs 16 levels in 16bit. It takes 24bit a bit longer to get clipped to 16 levels -- you have to get down to the range between -114db and -120db. The theoretical limit to the DR in 24bit is 138db, because the range between -132 and -138db gives us only two levels.
Luckily, no microphone is capable of capturing that. Even microphone/pre/ADC combos capable of reaching 80db of DR will still have 1024 levels with which to describe their noise floor in 24bit.