How is audio represented with numbers in computers?

后端未结

关注

 10  1175

I like thinking about how everything can be and is represented by numbers. For example, plaintext is represented by a code like ASCII, and images are represented by RGB valu

相关标签:

10条回答

孤独总比滥情好

2020-11-29 17:34
There are 2 steps involved in converting actual analogous audio into a digital form.
1. Sampling
2. Quantization
Sampling

The rate at which a continuous waveform (in this case, audio) is sampled, is called the sampling rate. The frequency range perceived by humans is 20 - 20,000 Hz. However, CDs use the Nyquist sampling theorem, which means sampling rate of 44,100 Hz, covers frequencies in the range 0 - 22,050Hz.

Quantization

The discrete set of values received from the 'Sampling' phase now need to be converted into a finite number of values. An 8-bit quantization provides 256 possible values, while a 16 bit quantization provides upto 65,536 values.
0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-11-29 17:35
Minimal C audio generation example

The example below generates a pure 1000k Hz sinus in raw format. At the common 44.1kHz sampling rate, it will last about 4 seconds.

main.c:
```
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main(void) {
    FILE *f;
    const double PI2 = 2 * acos(-1.0);
    const double SAMPLE_FREQ = 44100;
    const unsigned int NSAMPLES = 4 * SAMPLE_FREQ;
    uint16_t ampl;
    uint8_t bytes[2];
    unsigned int t;

    f = fopen("out.raw", "wb");
    for (t = 0; t < NSAMPLES; ++t) {
        ampl = UINT16_MAX * 0.5 * (1.0 + sin(PI2 * t * 1000.0 / SAMPLE_FREQ));
        bytes[0] = ampl >> 8;
        bytes[1] = ampl & 0xFF;
        fwrite(bytes, 2, sizeof(uint8_t), f);
    }
    fclose(f);
    return EXIT_SUCCESS;
}
```
GitHub upstream.

Generate out.raw:
```
gcc -std=c99 -o main main.c -lm
./main
```
Play out.raw directly:
```
sudo apt-get install ffmpeg
ffplay -autoexit -f u16be -ar 44100 -ac 1 out.raw
```
or convert to a more common audio format and then play with a more common audio player:
```
ffmpeg -f u16be -ar 44100 -ac 1 -i out.raw out.flac
vlc out.flac
```
Generated FLAC file: https://github.com/cirosantilli/media/blob/master/canon.flac

Parameters explained at: https://superuser.com/questions/76665/how-to-play-a-pcm-file-on-an-unix-system/1063230#1063230

Tested on Ubuntu 18.04.

Canon in D in C

Here is a more interesting synthesis example.

Outcome: https://www.youtube.com/watch?v=JISozfHATms

main.c
```
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

typedef uint16_t point_type_t;

double PI2;

void write_ampl(FILE *f, point_type_t ampl) {
    uint8_t bytes[2];
    bytes[0] = ampl >> 8;
    bytes[1] = ampl & 0xFF;
    fwrite(bytes, 2, sizeof(uint8_t), f);
}

/* https://en.wikipedia.org/wiki/Piano_key_frequencies */
double piano_freq(unsigned int i) {
    return 440.0 * pow(2, (i - 49.0) / 12.0);
}

/* Chord formed by the nth note of the piano. */
point_type_t piano_sum(unsigned int max_ampl, unsigned int time,
        double sample_freq, unsigned int nargs, unsigned int *notes) {
    unsigned int i;
    double sum = 0;
    for (i = 0 ; i < nargs; ++i)
        sum += sin(PI2 * time * piano_freq(notes[i]) / sample_freq);
    return max_ampl * 0.5 * (nargs + sum) / nargs;
}

enum notes {
    A0 = 1, AS0, B0,
    C1, C1S, D1, D1S, E1, F1, F1S, G1, G1S, A1, A1S, B1,
    C2, C2S, D2, D2S, E2, F2, F2S, G2, G2S, A2, A2S, B2,
    C3, C3S, D3, D3S, E3, F3, F3S, G3, G3S, A3, A3S, B3,
    C4, C4S, D4, D4S, E4, F4, F4S, G4, G4S, A4, A4S, B4,
    C5, C5S, D5, D5S, E5, F5, F5S, G5, G5S, A5, A5S, B5,
    C6, C6S, D6, D6S, E6, F6, F6S, G6, G6S, A6, A6S, B6,
    C7, C7S, D7, D7S, E7, F7, F7S, G7, G7S, A7, A7S, B7,
    C8,
};

int main(void) {
    FILE *f;
    PI2 = 2 * acos(-1.0);
    const double SAMPLE_FREQ = 44100;
    point_type_t ampl;
    point_type_t max_ampl = UINT16_MAX;
    unsigned int t, i;
    unsigned int samples_per_unit = SAMPLE_FREQ * 0.375;
    unsigned int *ip[] = {
        (unsigned int[]){4, 2, C3, E4},
        (unsigned int[]){4, 2, G3, D4},
        (unsigned int[]){4, 2, A3, C4},
        (unsigned int[]){4, 2, E3, B3},

        (unsigned int[]){4, 2, F3, A3},
        (unsigned int[]){4, 2, C3, G3},
        (unsigned int[]){4, 2, F3, A3},
        (unsigned int[]){4, 2, G3, B3},

        (unsigned int[]){4, 3, C3, G4, E5},
        (unsigned int[]){4, 3, G3, B4, D5},
        (unsigned int[]){4, 2, A3,     C5},
        (unsigned int[]){4, 3, E3, G4, B4},

        (unsigned int[]){4, 3, F3, C4, A4},
        (unsigned int[]){4, 3, C3, G4, G4},
        (unsigned int[]){4, 3, F3, F4, A4},
        (unsigned int[]){4, 3, G3, D4, B4},

        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){2, 3, G3, D4, D5},
        (unsigned int[]){2, 3, G3, D4, B4},

        (unsigned int[]){2, 3, A3, C4, C5},
        (unsigned int[]){2, 3, A3, C4, E5},
        (unsigned int[]){2, 2, E3,     G5},
        (unsigned int[]){2, 2, E3,     G4},

        (unsigned int[]){2, 3, F3, A3, A4},
        (unsigned int[]){2, 3, F3, A3, F4},
        (unsigned int[]){2, 3, C3,     E4},
        (unsigned int[]){2, 3, C3,     G4},

        (unsigned int[]){2, 3, F3, A3, F4},
        (unsigned int[]){2, 3, F3, A3, C5},
        (unsigned int[]){2, 3, G3, B3, B4},
        (unsigned int[]){2, 3, G3, B3, G4},

        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){1, 3, C4, E4, E5},
        (unsigned int[]){1, 3, C4, E4, G5},
        (unsigned int[]){1, 2, G3,     G5},
        (unsigned int[]){1, 2, G3,     A5},
        (unsigned int[]){1, 2, G3,     G5},
        (unsigned int[]){1, 2, G3,     F5},

        (unsigned int[]){3, 3, A3, C4, E5},
        (unsigned int[]){1, 3, A3, C4, E5},
        (unsigned int[]){1, 3, E3, G3, E5},
        (unsigned int[]){1, 3, E3, G3, F5},
        (unsigned int[]){1, 3, E3, G3, E5},
        (unsigned int[]){1, 3, E3, G3, D5},
    };
    f = fopen("canon.raw", "wb");
    for (i = 0; i < sizeof(ip) / sizeof(int*); ++i) {
        unsigned int *cur = ip[i];
        unsigned int total = samples_per_unit * cur[0];
        for (t = 0; t < total; ++t) {
            ampl = piano_sum(max_ampl, t, SAMPLE_FREQ, cur[1], &cur[2]);
            write_ampl(f, ampl);
        }
    }
    fclose(f);
    return EXIT_SUCCESS;
}
```
GitHub upstream.

For YouTube, I prepared it as:
```
wget -O canon.png https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/The_C_Programming_Language_logo.svg/564px-The_C_Programming_Language_logo.svg.png
ffmpeg -loop 1 -y -i canon.png -i canon.flac -shortest -acodec copy -vcodec vp9 canon.mkv
```
as explained at: https://superuser.com/questions/700419/how-to-convert-mp3-to-youtube-allowed-video-format/1472572#1472572

Tested on Ubuntu 18.04.

Physics

Audio is encoded as a single number for every moment in time. Compare that to a video, which needs WIDTH * HEIGHT numbers per moment in time.

This number is then converted to the linear displacement of the diaphragm of your speaker:
```
|   /
|  /
|-/
| | A   I   R
|-\
|  \
|   \
<-> displacement

|     /
|    /
|---/
|   | A I   R
|---\
|    \
|     \
<---> displacement

|       /
|      /
|-----/
|     | A I R
|-----\
|      \
|       \
<-----> displacement
```
The displacement pushes air backwards and forwards, creating pressure differences, which travel through air as P-waves.

Only displacement matters: a constant signal, even if maximal, produces no sound: the diaphragm just stays at a fixed position.

The sampling frequency determines how fast the displacements should be done.

44,1kHz is a common sampling frequency because humans can hear up to 20kHz and because of the Nyquist–Shannon sampling theorem.

The sampling frequency is analogous to the FPS for video, although it has a much higher value compared to the 25 (cinema) - 144 (hardcore gaming monitors) range we commonly see for video.

Formats

Uncompressed:
- .raw is an underspecified format that contains just the amplitude bytes, and no metadata.
  
  We have to pass a few meta-data parameters on the command line like the sampling frequency because the format does not contain that data.
- .wav is another popular uncompressed format which contain all needed metadata: WAV File Synthesis From Scratch - C
- MIDI (.mid): https://en.wikipedia.org/wiki/MIDI
  
  This format represents keystrokes of an instrument. It is what a basic digital keyboard will output to a computer. File sizes can be very small as a result, but it can't necessarily represent "arbitrary sounds", more like notes.
  
  Conversion to MP3: https://softwarerecs.stackexchange.com/questions/10915/automatically-turn-midi-files-into-wav-or-mp3/76955#76955
In practice, most people deal exclusively with compressed formats, which make files streaming much smaller. Some of those formats take into account characteristics of the human ear to further compress the audio in a lossy way. The most popular royalty free formats as of 2019 appear to be:
- lossless: FLAC
- lossy: Vorbis
Biology

Humans perceive sound mostly by their frequency decomposition (AKA Fourier transform).

I think this is because the inner ear has parts which resonate to different frequencies (TODO confirm).

Therefore, when synthesizing music, we think more in terms of adding up frequencies instead of points in time. This is illustrated in this example.

This leads to thinking in terms of a 1D vector between 20Hz and 20kHz for each point in time.

The mathematical Fourier transform loses the notion of time, so what we do when synthesizing is to take groups of points, and sum up frequencies for that group, and take the Fourier transform there.

Luckily, the Fourier transform is linear, so we can just add up and normalize displacements directly.

The size of each group of points leads to a time - frequency precision tradeoff, mediated by the same mathematics as Heisenberg's uncertainty principle.

Wavelets may be a more precise mathematical description of this intermediary time - frequency description.

Quick ways to generate common tones out of the box

The amazing FFmpeg library covers several of them: Linux sine wave audio generator
```
sudo apt-get install ffmpeg
ffmpeg -f lavfi -i "sine=frequency=1000:duration=5" out.wav
```
Python pyo

https://github.com/belangeo/pyo

Python sound library.

Got it to work after a bit of frustration: Pyo server.boot() fails with pyolib._core.PyoServerStateException on Ubuntu 14.04

Csound

https://en.wikipedia.org/wiki/Csound

https://github.com/csound/csound

Program that reads a custom XML format that allows you to create some very funky sounds.
```
sudo apt install csound
```
Here's a really cool and advanced demo: https://github.com/csound/csound/blob/b319c336d31d942af2d279b636339df83dc9f9f9/examples/xanadu.csd rendered at: https://www.youtube.com/watch?v=7fXhVMDCfaA

abcmidi

Nice project that converts MIDI to the ABC notation and vice versa, allowing you to edit a MIDI file in your text editor: https://sound.stackexchange.com/questions/39457/how-to-open-midi-file-in-text-editor/50058#50058

MusicXML

https://en.wikipedia.org/wiki/MusicXML

An attempt to standardize music sheet representation.

I can't find easily how to convert it to an audio format from the command line however... Convert musicxml to wav?

MuseScore

https://github.com/musescore/MuseScore

The best FOSS scorewriter GUI I've seen so far. You can really compose for an orchestra with this.

Other high level out-of-box open source synthesizers for Linux

If you are going down this road, you might as well have a look at the big boys to learn about common synthesis techniques:
- https://www.youtube.com/watch?v=cXCwH9n3M-c Zyn-Fusion + Ardour synthesis music composition tutorial "from scratch" by unfa, big kudos to him
- how to get ZynAddSubFX running on Ubuntu: https://askubuntu.com/questions/340204/zynaddsubfx-fails-to-initialize-with-error-message-default-i-o-did-not-initiali/1238546#1238546
- http://linuxsynths.com/ quick review of ALL of them with generated audio samples
0 讨论(0)
发布评论:

提交评论
- 加载中...
太阳男子

2020-11-29 17:35

Look up things like analog-digital conversion. That should get you started. These devices can convert a audio signal (sine waves) into digital representations. So, a 16-bit ADC would be able to represent a sine from between -32768 to 32768. This is in fixed-point. It is also possible to do it in floating-point (though not recommended for performance reasons but may be needed for range reasons). The opposite (digital-analog conversion) happens when we convert numbers to sine waves. This is handled by something called a DAC.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-11-29 17:37

I think samples of the waveform at a specific sample frequency would be the most basic representation.

0 讨论(0)
发布评论:

提交评论
- 加载中...

有刺的猬

2020-11-29 17:39

I think a good way to start playing with audio would be with Processing and Minim. This program will draw the frequency spectrum of sound from your microphone!

import ddf.minim.*;
import ddf.minim.analysis.*;

AudioInput in;
FFT fft;

void setup()
{
  size(1024, 600);
  noSmooth();
  Minim.start(this);
  in = Minim.getLineIn();
  fft = new FFT(in.bufferSize(), in.sampleRate());
}

void draw()
{
  background(0);
  fft.forward(in.mix);
  stroke(255);
  for(int i = 0; i < fft.specSize(); i++)
    line(i*2+1, height, i*2+1, height - fft.getBand(i)*10);
}

void stop()
{
  in.close();
  Minim.stop();
  super.stop();
}

0 讨论(0)

臣服心动

2020-11-29 17:40

Physically, as you probably know, audio is a vibration. Typically, we're talking about vibrations of air between approximitely 20Hz and 20,000Hz. That means the air is moving back and forth 20 to 20,000 times per second.

If you measure that vibration and convert it to an electrical signal (say, using a microphone), you'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.

Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

Now, we hook our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD.

This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页