问题
Im really confused over here. I am a ai programmer working on a game that is designed to detect beats in songs and some more. I have no previous knowledge about audio and just reading through whatever material i can find. While i got fft working and stuff I simply don't understand the way samples are transferred to different frequencies. Question 1, what does each frequency stands for. For the algorithm i got. I can transfer for example 1024 samples into 512 outcomes. So are they a description of the strength of each spectrum at the current second? it doesn't really make sense since what i remember is that there are 20,000hz in a 44.1khz audio recording. So how does 512 spectrum samples explain what is happening in that moment? Question 2, from what i read, its a number that represent the sound wave at this moment. However i read that by squaring both left channel and right channel, and add them together and you will get the current power level. Both these seems incoherent to my understanding, and i am really buff led so please explain away.
回答1:
DFT output
the output is complex representation of phasor (Re,Im,Frequency) of basis function (usually sin wave). First item is DC offset so skip it. All the others are multiples of the same fundamental frequency (
sampling rate/N
). The output is symmetric (if the input is real only) so use just first half of results. Often power spectrum is usedAmplitude=sqrt(Re^2+Im^2)
which is the amplitude of basis function. If phase is needed then
phase=atan2(Im,Re)
beware DFT results are strongly dependent on the input signal shape,frequency and phase shift to your basis functions. That causes the output to vibrate/oscillate around the correct value and produce wide peaks instead of sharp ones for singular frequencies not to mention aliasing.
frequencies
if you got
44100Hz
then the max output frequency is half of it that means the biggest frequency present in data is22050Hz
. The DFFT however does not contain this frequency so if you ignore the mirrored second half of results then:- for 4 samples DFT outputs frequencies are
{ -,11025 }
Hz - for 8 samples frequencies are:
{ -,5512.5,11025,16537.5 }
Hz
The output frequency is linear to its address from start so if you got
N=512
samples- do DFFT on it
- obtain first
N/2=256
results i
-th sample represents frequencyf=i*samplerate/N
Hzwhere
i={ 1,...,(N/2)-1}
... skippingi=0
the image shows one of mine utility apps tighted together with
- 2-channel sound generator (top left)
- 2-channel oscilloscope (top right)
- 2-channel spectral analyzer (bottom) ... switched to linear frequency scale to make obvious what I mean in above text
zoom the image to see the settings ... I made it as close to the real devices as I could.
Here DCT and DFT comparison:
Here the DFT output dependency on input signal frequency aliasing by sampling rate
- for 4 samples DFT outputs frequencies are
more channels
Summing power of channels is more safe. If you just add the channels then you could miss some data. For example let left channel is playing 1 Khz sin wave and the right exact opposite so if you just sum them then the result is zero but you can hear the sound .... (if you are not exactly in the middle between speakers). If you analyze each channel independently then you need to calculate DFFT for each channel but if you use power sum of channels (or abs sum) then you can obtain the frequencies for all channels at once , of coarse you need to scale the amplitudes ...
[Notes]
Bigger the N
nicer the result (less aliasing artifacts and closer to the max frequency). For specific frequencies detection are FIR filter detectors more precise and faster.
Strongly recommend to read DFT and all sublinks there and also this plotting real time Data on (qwt) Oscillocope
来源:https://stackoverflow.com/questions/28674724/i-dont-really-understand-fft-and-sample-rates