How is audio represented with numbers in computers?

后端 未结 10 1175
野的像风
野的像风 2020-11-29 16:50

I like thinking about how everything can be and is represented by numbers. For example, plaintext is represented by a code like ASCII, and images are represented by RGB valu

相关标签:
10条回答
  • 2020-11-29 17:34

    There are 2 steps involved in converting actual analogous audio into a digital form.

    1. Sampling
    2. Quantization

    Sampling

    The rate at which a continuous waveform (in this case, audio) is sampled, is called the sampling rate. The frequency range perceived by humans is 20 - 20,000 Hz. However, CDs use the Nyquist sampling theorem, which means sampling rate of 44,100 Hz, covers frequencies in the range 0 - 22,050Hz.

    Quantization

    The discrete set of values received from the 'Sampling' phase now need to be converted into a finite number of values. An 8-bit quantization provides 256 possible values, while a 16 bit quantization provides upto 65,536 values.

    0 讨论(0)
  • 2020-11-29 17:35

    Minimal C audio generation example

    The example below generates a pure 1000k Hz sinus in raw format. At the common 44.1kHz sampling rate, it will last about 4 seconds.

    main.c:

    #include <math.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    
    int main(void) {
        FILE *f;
        const double PI2 = 2 * acos(-1.0);
        const double SAMPLE_FREQ = 44100;
        const unsigned int NSAMPLES = 4 * SAMPLE_FREQ;
        uint16_t ampl;
        uint8_t bytes[2];
        unsigned int t;
    
        f = fopen("out.raw", "wb");
        for (t = 0; t < NSAMPLES; ++t) {
            ampl = UINT16_MAX * 0.5 * (1.0 + sin(PI2 * t * 1000.0 / SAMPLE_FREQ));
            bytes[0] = ampl >> 8;
            bytes[1] = ampl & 0xFF;
            fwrite(bytes, 2, sizeof(uint8_t), f);
        }
        fclose(f);
        return EXIT_SUCCESS;
    }
    

    GitHub upstream.

    Generate out.raw:

    gcc -std=c99 -o main main.c -lm
    ./main
    

    Play out.raw directly:

    sudo apt-get install ffmpeg
    ffplay -autoexit -f u16be -ar 44100 -ac 1 out.raw
    

    or convert to a more common audio format and then play with a more common audio player:

    ffmpeg -f u16be -ar 44100 -ac 1 -i out.raw out.flac
    vlc out.flac
    

    Generated FLAC file: https://github.com/cirosantilli/media/blob/master/canon.flac

    Parameters explained at: https://superuser.com/questions/76665/how-to-play-a-pcm-file-on-an-unix-system/1063230#1063230

    Tested on Ubuntu 18.04.

    Canon in D in C

    Here is a more interesting synthesis example.

    Outcome: https://www.youtube.com/watch?v=JISozfHATms

    main.c

    #include <math.h>
    #include <stdint.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    typedef uint16_t point_type_t;
    
    double PI2;
    
    void write_ampl(FILE *f, point_type_t ampl) {
        uint8_t bytes[2];
        bytes[0] = ampl >> 8;
        bytes[1] = ampl & 0xFF;
        fwrite(bytes, 2, sizeof(uint8_t), f);
    }
    
    /* https://en.wikipedia.org/wiki/Piano_key_frequencies */
    double piano_freq(unsigned int i) {
        return 440.0 * pow(2, (i - 49.0) / 12.0);
    }
    
    /* Chord formed by the nth note of the piano. */
    point_type_t piano_sum(unsigned int max_ampl, unsigned int time,
            double sample_freq, unsigned int nargs, unsigned int *notes) {
        unsigned int i;
        double sum = 0;
        for (i = 0 ; i < nargs; ++i)
            sum += sin(PI2 * time * piano_freq(notes[i]) / sample_freq);
        return max_ampl * 0.5 * (nargs + sum) / nargs;
    }
    
    enum notes {
        A0 = 1, AS0, B0,
        C1, C1S, D1, D1S, E1, F1, F1S, G1, G1S, A1, A1S, B1,
        C2, C2S, D2, D2S, E2, F2, F2S, G2, G2S, A2, A2S, B2,
        C3, C3S, D3, D3S, E3, F3, F3S, G3, G3S, A3, A3S, B3,
        C4, C4S, D4, D4S, E4, F4, F4S, G4, G4S, A4, A4S, B4,
        C5, C5S, D5, D5S, E5, F5, F5S, G5, G5S, A5, A5S, B5,
        C6, C6S, D6, D6S, E6, F6, F6S, G6, G6S, A6, A6S, B6,
        C7, C7S, D7, D7S, E7, F7, F7S, G7, G7S, A7, A7S, B7,
        C8,
    };
    
    int main(void) {
        FILE *f;
        PI2 = 2 * acos(-1.0);
        const double SAMPLE_FREQ = 44100;
        point_type_t ampl;
        point_type_t max_ampl = UINT16_MAX;
        unsigned int t, i;
        unsigned int samples_per_unit = SAMPLE_FREQ * 0.375;
        unsigned int *ip[] = {
            (unsigned int[]){4, 2, C3, E4},
            (unsigned int[]){4, 2, G3, D4},
            (unsigned int[]){4, 2, A3, C4},
            (unsigned int[]){4, 2, E3, B3},
    
            (unsigned int[]){4, 2, F3, A3},
            (unsigned int[]){4, 2, C3, G3},
            (unsigned int[]){4, 2, F3, A3},
            (unsigned int[]){4, 2, G3, B3},
    
            (unsigned int[]){4, 3, C3, G4, E5},
            (unsigned int[]){4, 3, G3, B4, D5},
            (unsigned int[]){4, 2, A3,     C5},
            (unsigned int[]){4, 3, E3, G4, B4},
    
            (unsigned int[]){4, 3, F3, C4, A4},
            (unsigned int[]){4, 3, C3, G4, G4},
            (unsigned int[]){4, 3, F3, F4, A4},
            (unsigned int[]){4, 3, G3, D4, B4},
    
            (unsigned int[]){2, 3, C4, E4, C5},
            (unsigned int[]){2, 3, C4, E4, C5},
            (unsigned int[]){2, 3, G3, D4, D5},
            (unsigned int[]){2, 3, G3, D4, B4},
    
            (unsigned int[]){2, 3, A3, C4, C5},
            (unsigned int[]){2, 3, A3, C4, E5},
            (unsigned int[]){2, 2, E3,     G5},
            (unsigned int[]){2, 2, E3,     G4},
    
            (unsigned int[]){2, 3, F3, A3, A4},
            (unsigned int[]){2, 3, F3, A3, F4},
            (unsigned int[]){2, 3, C3,     E4},
            (unsigned int[]){2, 3, C3,     G4},
    
            (unsigned int[]){2, 3, F3, A3, F4},
            (unsigned int[]){2, 3, F3, A3, C5},
            (unsigned int[]){2, 3, G3, B3, B4},
            (unsigned int[]){2, 3, G3, B3, G4},
    
            (unsigned int[]){2, 3, C4, E4, C5},
            (unsigned int[]){1, 3, C4, E4, E5},
            (unsigned int[]){1, 3, C4, E4, G5},
            (unsigned int[]){1, 2, G3,     G5},
            (unsigned int[]){1, 2, G3,     A5},
            (unsigned int[]){1, 2, G3,     G5},
            (unsigned int[]){1, 2, G3,     F5},
    
            (unsigned int[]){3, 3, A3, C4, E5},
            (unsigned int[]){1, 3, A3, C4, E5},
            (unsigned int[]){1, 3, E3, G3, E5},
            (unsigned int[]){1, 3, E3, G3, F5},
            (unsigned int[]){1, 3, E3, G3, E5},
            (unsigned int[]){1, 3, E3, G3, D5},
        };
        f = fopen("canon.raw", "wb");
        for (i = 0; i < sizeof(ip) / sizeof(int*); ++i) {
            unsigned int *cur = ip[i];
            unsigned int total = samples_per_unit * cur[0];
            for (t = 0; t < total; ++t) {
                ampl = piano_sum(max_ampl, t, SAMPLE_FREQ, cur[1], &cur[2]);
                write_ampl(f, ampl);
            }
        }
        fclose(f);
        return EXIT_SUCCESS;
    }
    

    GitHub upstream.

    For YouTube, I prepared it as:

    wget -O canon.png https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/The_C_Programming_Language_logo.svg/564px-The_C_Programming_Language_logo.svg.png
    ffmpeg -loop 1 -y -i canon.png -i canon.flac -shortest -acodec copy -vcodec vp9 canon.mkv
    

    as explained at: https://superuser.com/questions/700419/how-to-convert-mp3-to-youtube-allowed-video-format/1472572#1472572

    Tested on Ubuntu 18.04.

    Physics

    Audio is encoded as a single number for every moment in time. Compare that to a video, which needs WIDTH * HEIGHT numbers per moment in time.

    This number is then converted to the linear displacement of the diaphragm of your speaker:

    |   /
    |  /
    |-/
    | | A   I   R
    |-\
    |  \
    |   \
    <-> displacement
    
    |     /
    |    /
    |---/
    |   | A I   R
    |---\
    |    \
    |     \
    <---> displacement
    
    |       /
    |      /
    |-----/
    |     | A I R
    |-----\
    |      \
    |       \
    <-----> displacement
    

    The displacement pushes air backwards and forwards, creating pressure differences, which travel through air as P-waves.

    Only displacement matters: a constant signal, even if maximal, produces no sound: the diaphragm just stays at a fixed position.

    The sampling frequency determines how fast the displacements should be done.

    44,1kHz is a common sampling frequency because humans can hear up to 20kHz and because of the Nyquist–Shannon sampling theorem.

    The sampling frequency is analogous to the FPS for video, although it has a much higher value compared to the 25 (cinema) - 144 (hardcore gaming monitors) range we commonly see for video.

    Formats

    Uncompressed:

    • .raw is an underspecified format that contains just the amplitude bytes, and no metadata.

      We have to pass a few meta-data parameters on the command line like the sampling frequency because the format does not contain that data.

    • .wav is another popular uncompressed format which contain all needed metadata: WAV File Synthesis From Scratch - C

    • MIDI (.mid): https://en.wikipedia.org/wiki/MIDI

      This format represents keystrokes of an instrument. It is what a basic digital keyboard will output to a computer. File sizes can be very small as a result, but it can't necessarily represent "arbitrary sounds", more like notes.

      Conversion to MP3: https://softwarerecs.stackexchange.com/questions/10915/automatically-turn-midi-files-into-wav-or-mp3/76955#76955

    In practice, most people deal exclusively with compressed formats, which make files streaming much smaller. Some of those formats take into account characteristics of the human ear to further compress the audio in a lossy way. The most popular royalty free formats as of 2019 appear to be:

    • lossless: FLAC
    • lossy: Vorbis

    Biology

    Humans perceive sound mostly by their frequency decomposition (AKA Fourier transform).

    I think this is because the inner ear has parts which resonate to different frequencies (TODO confirm).

    Therefore, when synthesizing music, we think more in terms of adding up frequencies instead of points in time. This is illustrated in this example.

    This leads to thinking in terms of a 1D vector between 20Hz and 20kHz for each point in time.

    The mathematical Fourier transform loses the notion of time, so what we do when synthesizing is to take groups of points, and sum up frequencies for that group, and take the Fourier transform there.

    Luckily, the Fourier transform is linear, so we can just add up and normalize displacements directly.

    The size of each group of points leads to a time - frequency precision tradeoff, mediated by the same mathematics as Heisenberg's uncertainty principle.

    Wavelets may be a more precise mathematical description of this intermediary time - frequency description.

    Quick ways to generate common tones out of the box

    The amazing FFmpeg library covers several of them: Linux sine wave audio generator

    sudo apt-get install ffmpeg
    ffmpeg -f lavfi -i "sine=frequency=1000:duration=5" out.wav
    

    Python pyo

    https://github.com/belangeo/pyo

    Python sound library.

    Got it to work after a bit of frustration: Pyo server.boot() fails with pyolib._core.PyoServerStateException on Ubuntu 14.04

    Csound

    https://en.wikipedia.org/wiki/Csound

    https://github.com/csound/csound

    Program that reads a custom XML format that allows you to create some very funky sounds.

    sudo apt install csound
    

    Here's a really cool and advanced demo: https://github.com/csound/csound/blob/b319c336d31d942af2d279b636339df83dc9f9f9/examples/xanadu.csd rendered at: https://www.youtube.com/watch?v=7fXhVMDCfaA

    abcmidi

    Nice project that converts MIDI to the ABC notation and vice versa, allowing you to edit a MIDI file in your text editor: https://sound.stackexchange.com/questions/39457/how-to-open-midi-file-in-text-editor/50058#50058

    MusicXML

    https://en.wikipedia.org/wiki/MusicXML

    An attempt to standardize music sheet representation.

    I can't find easily how to convert it to an audio format from the command line however... Convert musicxml to wav?

    MuseScore

    https://github.com/musescore/MuseScore

    The best FOSS scorewriter GUI I've seen so far. You can really compose for an orchestra with this.

    Other high level out-of-box open source synthesizers for Linux

    If you are going down this road, you might as well have a look at the big boys to learn about common synthesis techniques:

    • https://www.youtube.com/watch?v=cXCwH9n3M-c Zyn-Fusion + Ardour synthesis music composition tutorial "from scratch" by unfa, big kudos to him
    • how to get ZynAddSubFX running on Ubuntu: https://askubuntu.com/questions/340204/zynaddsubfx-fails-to-initialize-with-error-message-default-i-o-did-not-initiali/1238546#1238546
    • http://linuxsynths.com/ quick review of ALL of them with generated audio samples
    0 讨论(0)
  • 2020-11-29 17:35

    Look up things like analog-digital conversion. That should get you started. These devices can convert a audio signal (sine waves) into digital representations. So, a 16-bit ADC would be able to represent a sine from between -32768 to 32768. This is in fixed-point. It is also possible to do it in floating-point (though not recommended for performance reasons but may be needed for range reasons). The opposite (digital-analog conversion) happens when we convert numbers to sine waves. This is handled by something called a DAC.

    0 讨论(0)
  • 2020-11-29 17:37

    I think samples of the waveform at a specific sample frequency would be the most basic representation.

    0 讨论(0)
  • 2020-11-29 17:39

    I think a good way to start playing with audio would be with Processing and Minim. This program will draw the frequency spectrum of sound from your microphone!

    import ddf.minim.*;
    import ddf.minim.analysis.*;
    
    AudioInput in;
    FFT fft;
    
    void setup()
    {
      size(1024, 600);
      noSmooth();
      Minim.start(this);
      in = Minim.getLineIn();
      fft = new FFT(in.bufferSize(), in.sampleRate());
    }
    
    void draw()
    {
      background(0);
      fft.forward(in.mix);
      stroke(255);
      for(int i = 0; i < fft.specSize(); i++)
        line(i*2+1, height, i*2+1, height - fft.getBand(i)*10);
    }
    
    void stop()
    {
      in.close();
      Minim.stop();
      super.stop();
    }
    
    0 讨论(0)
  • 2020-11-29 17:40

    Physically, as you probably know, audio is a vibration. Typically, we're talking about vibrations of air between approximitely 20Hz and 20,000Hz. That means the air is moving back and forth 20 to 20,000 times per second.

    If you measure that vibration and convert it to an electrical signal (say, using a microphone), you'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

    Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.

    Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

    Now, we hook our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD.

    This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.

    0 讨论(0)
提交回复
热议问题