I like thinking about how everything can be and is represented by numbers. For example, plaintext is represented by a code like ASCII, and images are represented by RGB valu
There are 2 steps involved in converting actual analogous audio into a digital form.
Sampling
The rate at which a continuous waveform (in this case, audio) is sampled, is called the sampling rate. The frequency range perceived by humans is 20 - 20,000 Hz. However, CDs use the Nyquist sampling theorem, which means sampling rate of 44,100 Hz, covers frequencies in the range 0 - 22,050Hz.
Quantization
The discrete set of values received from the 'Sampling' phase now need to be converted into a finite number of values. An 8-bit quantization provides 256 possible values, while a 16 bit quantization provides upto 65,536 values.
Minimal C audio generation example
The example below generates a pure 1000k Hz sinus in raw format. At the common 44.1kHz sampling rate, it will last about 4 seconds.
main.c:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(void) {
FILE *f;
const double PI2 = 2 * acos(-1.0);
const double SAMPLE_FREQ = 44100;
const unsigned int NSAMPLES = 4 * SAMPLE_FREQ;
uint16_t ampl;
uint8_t bytes[2];
unsigned int t;
f = fopen("out.raw", "wb");
for (t = 0; t < NSAMPLES; ++t) {
ampl = UINT16_MAX * 0.5 * (1.0 + sin(PI2 * t * 1000.0 / SAMPLE_FREQ));
bytes[0] = ampl >> 8;
bytes[1] = ampl & 0xFF;
fwrite(bytes, 2, sizeof(uint8_t), f);
}
fclose(f);
return EXIT_SUCCESS;
}
GitHub upstream.
Generate out.raw
:
gcc -std=c99 -o main main.c -lm
./main
Play out.raw
directly:
sudo apt-get install ffmpeg
ffplay -autoexit -f u16be -ar 44100 -ac 1 out.raw
or convert to a more common audio format and then play with a more common audio player:
ffmpeg -f u16be -ar 44100 -ac 1 -i out.raw out.flac
vlc out.flac
Generated FLAC file: https://github.com/cirosantilli/media/blob/master/canon.flac
Parameters explained at: https://superuser.com/questions/76665/how-to-play-a-pcm-file-on-an-unix-system/1063230#1063230
Tested on Ubuntu 18.04.
Canon in D in C
Here is a more interesting synthesis example.
Outcome: https://www.youtube.com/watch?v=JISozfHATms
main.c
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
typedef uint16_t point_type_t;
double PI2;
void write_ampl(FILE *f, point_type_t ampl) {
uint8_t bytes[2];
bytes[0] = ampl >> 8;
bytes[1] = ampl & 0xFF;
fwrite(bytes, 2, sizeof(uint8_t), f);
}
/* https://en.wikipedia.org/wiki/Piano_key_frequencies */
double piano_freq(unsigned int i) {
return 440.0 * pow(2, (i - 49.0) / 12.0);
}
/* Chord formed by the nth note of the piano. */
point_type_t piano_sum(unsigned int max_ampl, unsigned int time,
double sample_freq, unsigned int nargs, unsigned int *notes) {
unsigned int i;
double sum = 0;
for (i = 0 ; i < nargs; ++i)
sum += sin(PI2 * time * piano_freq(notes[i]) / sample_freq);
return max_ampl * 0.5 * (nargs + sum) / nargs;
}
enum notes {
A0 = 1, AS0, B0,
C1, C1S, D1, D1S, E1, F1, F1S, G1, G1S, A1, A1S, B1,
C2, C2S, D2, D2S, E2, F2, F2S, G2, G2S, A2, A2S, B2,
C3, C3S, D3, D3S, E3, F3, F3S, G3, G3S, A3, A3S, B3,
C4, C4S, D4, D4S, E4, F4, F4S, G4, G4S, A4, A4S, B4,
C5, C5S, D5, D5S, E5, F5, F5S, G5, G5S, A5, A5S, B5,
C6, C6S, D6, D6S, E6, F6, F6S, G6, G6S, A6, A6S, B6,
C7, C7S, D7, D7S, E7, F7, F7S, G7, G7S, A7, A7S, B7,
C8,
};
int main(void) {
FILE *f;
PI2 = 2 * acos(-1.0);
const double SAMPLE_FREQ = 44100;
point_type_t ampl;
point_type_t max_ampl = UINT16_MAX;
unsigned int t, i;
unsigned int samples_per_unit = SAMPLE_FREQ * 0.375;
unsigned int *ip[] = {
(unsigned int[]){4, 2, C3, E4},
(unsigned int[]){4, 2, G3, D4},
(unsigned int[]){4, 2, A3, C4},
(unsigned int[]){4, 2, E3, B3},
(unsigned int[]){4, 2, F3, A3},
(unsigned int[]){4, 2, C3, G3},
(unsigned int[]){4, 2, F3, A3},
(unsigned int[]){4, 2, G3, B3},
(unsigned int[]){4, 3, C3, G4, E5},
(unsigned int[]){4, 3, G3, B4, D5},
(unsigned int[]){4, 2, A3, C5},
(unsigned int[]){4, 3, E3, G4, B4},
(unsigned int[]){4, 3, F3, C4, A4},
(unsigned int[]){4, 3, C3, G4, G4},
(unsigned int[]){4, 3, F3, F4, A4},
(unsigned int[]){4, 3, G3, D4, B4},
(unsigned int[]){2, 3, C4, E4, C5},
(unsigned int[]){2, 3, C4, E4, C5},
(unsigned int[]){2, 3, G3, D4, D5},
(unsigned int[]){2, 3, G3, D4, B4},
(unsigned int[]){2, 3, A3, C4, C5},
(unsigned int[]){2, 3, A3, C4, E5},
(unsigned int[]){2, 2, E3, G5},
(unsigned int[]){2, 2, E3, G4},
(unsigned int[]){2, 3, F3, A3, A4},
(unsigned int[]){2, 3, F3, A3, F4},
(unsigned int[]){2, 3, C3, E4},
(unsigned int[]){2, 3, C3, G4},
(unsigned int[]){2, 3, F3, A3, F4},
(unsigned int[]){2, 3, F3, A3, C5},
(unsigned int[]){2, 3, G3, B3, B4},
(unsigned int[]){2, 3, G3, B3, G4},
(unsigned int[]){2, 3, C4, E4, C5},
(unsigned int[]){1, 3, C4, E4, E5},
(unsigned int[]){1, 3, C4, E4, G5},
(unsigned int[]){1, 2, G3, G5},
(unsigned int[]){1, 2, G3, A5},
(unsigned int[]){1, 2, G3, G5},
(unsigned int[]){1, 2, G3, F5},
(unsigned int[]){3, 3, A3, C4, E5},
(unsigned int[]){1, 3, A3, C4, E5},
(unsigned int[]){1, 3, E3, G3, E5},
(unsigned int[]){1, 3, E3, G3, F5},
(unsigned int[]){1, 3, E3, G3, E5},
(unsigned int[]){1, 3, E3, G3, D5},
};
f = fopen("canon.raw", "wb");
for (i = 0; i < sizeof(ip) / sizeof(int*); ++i) {
unsigned int *cur = ip[i];
unsigned int total = samples_per_unit * cur[0];
for (t = 0; t < total; ++t) {
ampl = piano_sum(max_ampl, t, SAMPLE_FREQ, cur[1], &cur[2]);
write_ampl(f, ampl);
}
}
fclose(f);
return EXIT_SUCCESS;
}
GitHub upstream.
For YouTube, I prepared it as:
wget -O canon.png https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/The_C_Programming_Language_logo.svg/564px-The_C_Programming_Language_logo.svg.png
ffmpeg -loop 1 -y -i canon.png -i canon.flac -shortest -acodec copy -vcodec vp9 canon.mkv
as explained at: https://superuser.com/questions/700419/how-to-convert-mp3-to-youtube-allowed-video-format/1472572#1472572
Tested on Ubuntu 18.04.
Physics
Audio is encoded as a single number for every moment in time. Compare that to a video, which needs WIDTH * HEIGHT numbers per moment in time.
This number is then converted to the linear displacement of the diaphragm of your speaker:
| /
| /
|-/
| | A I R
|-\
| \
| \
<-> displacement
| /
| /
|---/
| | A I R
|---\
| \
| \
<---> displacement
| /
| /
|-----/
| | A I R
|-----\
| \
| \
<-----> displacement
The displacement pushes air backwards and forwards, creating pressure differences, which travel through air as P-waves.
Only displacement matters: a constant signal, even if maximal, produces no sound: the diaphragm just stays at a fixed position.
The sampling frequency determines how fast the displacements should be done.
44,1kHz is a common sampling frequency because humans can hear up to 20kHz and because of the Nyquist–Shannon sampling theorem.
The sampling frequency is analogous to the FPS for video, although it has a much higher value compared to the 25 (cinema) - 144 (hardcore gaming monitors) range we commonly see for video.
Formats
Uncompressed:
.raw
is an underspecified format that contains just the amplitude bytes, and no metadata.
We have to pass a few meta-data parameters on the command line like the sampling frequency because the format does not contain that data.
.wav
is another popular uncompressed format which contain all needed metadata: WAV File Synthesis From Scratch - C
MIDI (.mid
): https://en.wikipedia.org/wiki/MIDI
This format represents keystrokes of an instrument. It is what a basic digital keyboard will output to a computer. File sizes can be very small as a result, but it can't necessarily represent "arbitrary sounds", more like notes.
Conversion to MP3: https://softwarerecs.stackexchange.com/questions/10915/automatically-turn-midi-files-into-wav-or-mp3/76955#76955
In practice, most people deal exclusively with compressed formats, which make files streaming much smaller. Some of those formats take into account characteristics of the human ear to further compress the audio in a lossy way. The most popular royalty free formats as of 2019 appear to be:
Biology
Humans perceive sound mostly by their frequency decomposition (AKA Fourier transform).
I think this is because the inner ear has parts which resonate to different frequencies (TODO confirm).
Therefore, when synthesizing music, we think more in terms of adding up frequencies instead of points in time. This is illustrated in this example.
This leads to thinking in terms of a 1D vector between 20Hz and 20kHz for each point in time.
The mathematical Fourier transform loses the notion of time, so what we do when synthesizing is to take groups of points, and sum up frequencies for that group, and take the Fourier transform there.
Luckily, the Fourier transform is linear, so we can just add up and normalize displacements directly.
The size of each group of points leads to a time - frequency precision tradeoff, mediated by the same mathematics as Heisenberg's uncertainty principle.
Wavelets may be a more precise mathematical description of this intermediary time - frequency description.
Quick ways to generate common tones out of the box
The amazing FFmpeg library covers several of them: Linux sine wave audio generator
sudo apt-get install ffmpeg
ffmpeg -f lavfi -i "sine=frequency=1000:duration=5" out.wav
Python pyo
https://github.com/belangeo/pyo
Python sound library.
Got it to work after a bit of frustration: Pyo server.boot() fails with pyolib._core.PyoServerStateException on Ubuntu 14.04
Csound
https://en.wikipedia.org/wiki/Csound
https://github.com/csound/csound
Program that reads a custom XML format that allows you to create some very funky sounds.
sudo apt install csound
Here's a really cool and advanced demo: https://github.com/csound/csound/blob/b319c336d31d942af2d279b636339df83dc9f9f9/examples/xanadu.csd rendered at: https://www.youtube.com/watch?v=7fXhVMDCfaA
abcmidi
Nice project that converts MIDI to the ABC notation and vice versa, allowing you to edit a MIDI file in your text editor: https://sound.stackexchange.com/questions/39457/how-to-open-midi-file-in-text-editor/50058#50058
MusicXML
https://en.wikipedia.org/wiki/MusicXML
An attempt to standardize music sheet representation.
I can't find easily how to convert it to an audio format from the command line however... Convert musicxml to wav?
MuseScore
https://github.com/musescore/MuseScore
The best FOSS scorewriter GUI I've seen so far. You can really compose for an orchestra with this.
Other high level out-of-box open source synthesizers for Linux
If you are going down this road, you might as well have a look at the big boys to learn about common synthesis techniques:
Look up things like analog-digital conversion. That should get you started. These devices can convert a audio signal (sine waves) into digital representations. So, a 16-bit ADC would be able to represent a sine from between -32768 to 32768. This is in fixed-point. It is also possible to do it in floating-point (though not recommended for performance reasons but may be needed for range reasons). The opposite (digital-analog conversion) happens when we convert numbers to sine waves. This is handled by something called a DAC.
I think samples of the waveform at a specific sample frequency would be the most basic representation.
I think a good way to start playing with audio would be with Processing and Minim. This program will draw the frequency spectrum of sound from your microphone!
import ddf.minim.*;
import ddf.minim.analysis.*;
AudioInput in;
FFT fft;
void setup()
{
size(1024, 600);
noSmooth();
Minim.start(this);
in = Minim.getLineIn();
fft = new FFT(in.bufferSize(), in.sampleRate());
}
void draw()
{
background(0);
fft.forward(in.mix);
stroke(255);
for(int i = 0; i < fft.specSize(); i++)
line(i*2+1, height, i*2+1, height - fft.getBand(i)*10);
}
void stop()
{
in.close();
Minim.stop();
super.stop();
}
Physically, as you probably know, audio is a vibration. Typically, we're talking about vibrations of air between approximitely 20Hz and 20,000Hz. That means the air is moving back and forth 20 to 20,000 times per second.
If you measure that vibration and convert it to an electrical signal (say, using a microphone), you'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.
Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.
Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.
Now, we hook our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD.
This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.