Is it correct to assume that floating-point samples in a WAV or AIFF file will be normalized?

问题

Say I have a program that reads a .WAV or .AIFF file, and the file's audio is encoded as floating-point sample-values. Is it correct for my program to assume that any well-formed (floating-point-based) .WAV or .AIFF file will contain sample values only in the range [-1.0f,+1.0f]? I couldn't find anything in the WAV or AIFF specifications that addresses this point.

And if that is not a valid assumption, how can one know what the full dynamic range of the audio in the file was intended to be? (I could read the entire file and find out what the file's actual minimum and maximum sample values are, but there are two problems with that: (1) it would be a slow/expensive operation if the file is very large, and (2) it would lose information, in that if the file's creator had intended the file to have some "headroom" so as not play at dbFS at its loudest point, my program would not be able to detect that)

回答1:

As you state, the public available documentation do not go into details about the range used for floating point. However, from practice in the industry over the last several years, and from actual data existing as floating point files, I would say it is a valid assumption.

There are practical reasons to this as well as a very common range for normalization of high-precision data being color, audio, 3D etc.

The main reason for the range to be in the interval [-1, 1] is that it is fast and easy to scale/convert to the target bit-range. You only need to supply the target range and multiply.

For example:

If you want to play it at 16-bit you would do (pseudo, assuming signed rounded to integer result):

sample = in < 0 ? in * 0x8000 : in * 0x7fff;

or 24-bit:

sample = in < 0 ? in * 0x800000 : in * 0x7fffff;

or 8-bit:

sample = in < 0 ? in * 0x80 : in * 0x7f;

etc. without having to adjust the original input value in any way. -1 and 1 would represent min/max value when converted to target (1x = x).

If you used a range of [-0.5, 0.5] you would first (or at some point) have to adjust the input value so a conversion to for example 16-bit would need extra steps - this has an extra cost, not only for the extra step but also as we would work in the floating point domain which is heavier to compute (the latter is perhaps a bit legacy reason as floating point processing is pretty fast nowadays, but in any case).

in = in * 2;
sample = in < 0 ? in * 0x8000 : in * 0x7fff;

Keeping it in the [-1, 1] range rather than some pre-scaled ranged (for example [-32768, 32767]) also allow use of more bits for precision (using the IEEE 754 representation).

UPDATE 2017/07

Tests

Based on questions in comments I decided to triple-check by making a test using three files with a 1 second sine-wave:

A) Floating point clipped
B) Floating point max 0dB, and
C) integer clipped (converted from A)

The files where then scanned for positive values <= -1.0 and >= 1.0 starting after the data chunk and size field to make min/max values reflect the actual values found in the audio data.

The results confirms that the range is indeed in the [-1, 1] inclusive range, when not clipping (non-true <= 0 dB).

But it also revealed another aspect -

WAV files saved as floating point do allow values exceeding the 0 dB range. This means the range is actually beyond [-1, 1] for values that normally would clip.

The explanation for this can be that floating point formats are intended for intermediate use in production setups due to very little loss of dynamic range, where future processing (gain-staging, compressing, limiting etc.) can bring back the values (without loss) well within the final and normal -0.2 - 0 dB range; and therefor preserves the values as-is.

In conclusion

WAV files using floating point will save out values in the [-1, 1] when not clipping (<= 0dB), but does allow for values that are considered clipped

But when converted to a integer format these values will clip to the equivalent [-1, 1] range scaled by the bit-range of the integer format, regardless. This is natural due to the limited range each width can hold.

It will therefor be up the player/DAW/edit software to handle clipped floating point values by either normalizing the data or simply clip back to [-1, 1].

^{Notes: Max values for all files are measured directly from the sample data.}

^{Notes: Produced as clipped float (+6 dB), then converted to signed 16-bit and back to float}

^{Notes: Clipped to +6 dB}

^{Notes: Clipped to +12 dB}

Simple test script and files can be found here.

回答2:

I know the question was not specific to a given programming language or framework, but I could not find the answer in any specification. What I can say for sure is that the NAudio library that is widely used to handle .WAV files in applications written for the .NET framework assumes that the float samples are in the range [-1.0,+1.0].

Here is the applicable code from its source code:

namespace NAudio.Wave
{
    public class WaveFileReader : WaveStream
    {
        ...
        /// <summary>
        /// Attempts to read the next sample or group of samples as floating point normalised into the range -1.0f to 1.0f
        /// </summary>
        /// <returns>An array of samples, 1 for mono, 2 for stereo etc. Null indicates end of file reached
        /// </returns>
        public float[] ReadNextSampleFrame()
        {
            ...
            var sampleFrame = new float[waveFormat.Channels];
            int bytesToRead = waveFormat.Channels*(waveFormat.BitsPerSample/8);
            ...
            for (int channel = 0; channel < waveFormat.Channels; channel++)
            {
                if (waveFormat.BitsPerSample == 16)
                ...
                else if (waveFormat.BitsPerSample == 32 && waveFormat.Encoding == WaveFormatEncoding.IeeeFloat)
                {
                    sampleFrame[channel] = BitConverter.ToSingle(raw, offset);
                    offset += 4;
                }
                ...
            }
            return sampleFrame;
        }
        ...
    }
}

So it just copies the float into the array without doing any transformations on it and promises it to be in the given range.

回答3:

Yes.

Audio file formats act as carriers for one or more channels of audio data. That audio data has been encoded using a particular audio coding format. Each coding format uses an encoder algorithm. The algorithm is the important part. We can hand wave away the value of the file and coding formats.

AIFF and WAV both use Pulse-Code Modulation (PCM) or its descendants. (If you check out this Oracle doc, you'll notice that under "Encoding/CompressionType" lists of PCM-based algorithms.) PCM works by sampling the audio sine wave at fixed time intervals and choosing the nearest digital representation. The important point here is "sine wave".

Sine waves modulate between -1 and 1, thus all PCM-derived encodings will operate on this principle. Consider the mu-law implementation: notice in its defining equation the range is required to be -1 to 1.

I am doing a lot of hand-waving to answer this in brief. Sometimes we must necessarily lie to the kids. If you want to dig deeper into floating-point vs. fixed-point, importance of bit-depth to errors, etc. check out a good book on DSP. To get you started:

The Scientist and Engineer's Guide to Digital Signal Processing
Cisco Systems Waveform Coding Techniques

来源：https://stackoverflow.com/questions/29761331/is-it-correct-to-assume-that-floating-point-samples-in-a-wav-or-aiff-file-will-b

标签

floating-point

normalization

wav

aiff