FFT Pitch Detection - Melody Extraction [closed]

房东的猫 提交于 2019-11-30 02:31:14

It depends greatly on the musical content you want to work with - extracting the pitch of a monophonic recording (i.e. single instrument or voice) is not the same as extracting the pitch of a single instrument from a polyphonic mixture (e.g. extracting the pitch of the melody from a polyphonic recording).

For monophonic pitch extraction there are various algorithm you could try to implement both in the time domain and frequency domain. A couple of examples include Yin (time domain) and HPS (frequency domain), link to further details on both are provided in wikipedia:

However, neither will work well if you want to extract the melody from polyphonic material. Melody extraction from polyphonic music is still a research problem, and there isn't a simple set of steps you can follow. There are some tools out there provided by the research community that you can try out (for non-commercial use only though), namely:

As a final note, when synthesizing your output I'd recommend synthesizing the continuous pitch curve that you extract (the easiest way to do this is to estimate the pitch every X ms (e.g. 10) and synthesize a sine wave that changes frequency every 10 ms, ensuring continuous phase). This will make your result sound a lot more natural, and you avoid the extra error involved in quantizing a continuous pitch curve into discrete notes (which is another problem in its own).

Jeremy Salwen

You probably don't want to be picking peaks from a FFT to calculate the pitch. You probably want to use autocorrelation. I wrote up a long answer to a very similar question here: Cepstral Analysis for pitch detection

Your method might work for synthetic music using notes synchronized to fit your fft frame timing and length, and using only note sounds whose complete spectrum is compatible with your HPS pitch estimator. None of that is true for common music.

For the more general case, automatic music transcription still seems to be a research problem, with no simple 5 step solution. Pitch is a human psycho-acoustic phenomena. People will hear notes that may or may not be present in the local spectrum. The HPS pitch estimation algorithm is much more reliable than using the FFT peak, but can still fail for many kinds of musical sounds. Also, the FFT of any frames that cross note boundaries or transients may contain no clear single pitch to estimate.

Your approach will not work for any general musical example, for the following reasons:

  1. Music by its very nature is dynamic. Meaning that every sound present in music is modulated by distinct periods of silence, attack, sustain, decay, and again silence, otherwise known as the envelope of the sound.

  2. Musical instrument notes and human vocal notes cannot be properly synthesized by a single tone. These notes must be synthesized by a fundamental tone and many harmonics.

  3. However, it is not sufficient to synthesize the fundamental tone and the harmonics of a musical instrument note or vocal note, one must also synthesize the envelope of the note, as described in 1 above.

  4. Furthermore, to synthesize a melodic passage in music, whether instrumental or vocal, one must synthesize items 1-3 above, for every note in the passage, and one must also synthesize the timing of every note relative to the beginning of the passage.

  5. Analytically extracting individual instruments or human voices from a final mix recording is a very difficult problem, and your approach doesn't address that problem, so your approach cannot properly address issues 1-4.

In short, any approach that attempts to extract a near perfect musical transcription from the final mix of a musical recording, by using strict analytical methods, is at worst almost certainly doomed to failure, and at best falls in the realm of advanced research.

How to proceed from this impasse depends on what is the purpose of the work, something that the OP didn't mention.

Will this work be used in a commercial product, or is it a hobby project?

If a commercial work, various further approaches are warranted (costly or very costly ones), but the details of those approaches depend on what are the goals of the work.

As a closing note, your synthesis sounds like random beeps due to the following:

  1. Your fundamental tone detector is tied to the timing of your rolling FFT frames, which in effect generates a probably fake fundamental tone at the start-time of each and every rolling FFT frame.

  2. Why are the detected fundamental tones probably fake? Because you're arbitrarily clipping the musical sample into (FFT) frames, and are therefore probably truncating many concurrently sounding notes somewhere mid-note, thereby distorting the spectral signatures of the notes.

  3. You're not trying to synthesize the envelopes of the detected notes, nor can you, because there's no way to obtain envelope information based on your analysis.

  4. Therefore, the synthesized result is probably a series of pure sine chirps, spaced in time by the rolling FFT frame's delta-t. Each chirp may be of a different frequency, with a different envelope magnitude, and with envelopes that are probably rectangular in shape.

To see the complex nature of musical notes, take a look at these references:

Musical instrument spectra to 102.4 KHz

Musical instrument note spectra and their time-domain envelopes

Observe in particular the many pure tones that make up each note, and the complex shape of the time-domain envelope of each note. The variable timing of multiple notes relative to each other is an additional essential aspect of music, as is polyphony (multiple voices sounding concurrently) in typical music.

All of these elements of music conspire to make the strict analytical approach to autonomous musical transcription, extremelly challenging.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!