Your approach will not work for any general musical example, for the following reasons:
Music by its very nature is dynamic. Meaning that every sound present in music is modulated by distinct periods of silence, attack, sustain, decay, and again silence, otherwise known as the envelope of the sound.
Musical instrument notes and human vocal notes cannot be properly synthesized by a single tone. These notes must be synthesized by a fundamental tone and many harmonics.
However, it is not sufficient to synthesize the fundamental tone and the harmonics of a musical instrument note or vocal note, one must also synthesize the envelope of the note, as described in 1 above.
Furthermore, to synthesize a melodic passage in music, whether instrumental or vocal, one must synthesize items 1-3 above, for every note in the passage, and one must also synthesize the timing of every note relative to the beginning of the passage.
Analytically extracting individual instruments or human voices from a final mix recording is a very difficult problem, and your approach doesn't address that problem, so your approach cannot properly address issues 1-4.
In short, any approach that attempts to extract a near perfect musical transcription from the final mix of a musical recording, by using strict analytical methods, is at worst almost certainly doomed to failure, and at best falls in the realm of advanced research.
How to proceed from this impasse depends on what is the purpose of the work, something that the OP didn't mention.
Will this work be used in a commercial product, or is it a hobby project?
If a commercial work, various further approaches are warranted (costly or very costly ones), but the details of those approaches depend on what are the goals of the work.
As a closing note, your synthesis sounds like random beeps due to the following:
Your fundamental tone detector is tied to the timing of your rolling FFT frames, which in effect generates a probably fake fundamental tone at the start-time of each and every rolling FFT frame.
Why are the detected fundamental tones probably fake? Because you're arbitrarily clipping the musical sample into (FFT) frames, and are therefore probably truncating many concurrently sounding notes somewhere mid-note, thereby distorting the spectral signatures of the notes.
You're not trying to synthesize the envelopes of the detected notes, nor can you, because there's no way to obtain envelope information based on your analysis.
Therefore, the synthesized result is probably a series of pure sine chirps, spaced in time by the rolling FFT frame's delta-t. Each chirp may be of a different frequency, with a different envelope magnitude, and with envelopes that are probably rectangular in shape.
To see the complex nature of musical notes, take a look at these references:
Musical instrument spectra to 102.4 KHz
Musical instrument note spectra and their time-domain envelopes
Observe in particular the many pure tones that make up each note, and the complex shape of the time-domain envelope of each note. The variable timing of multiple notes relative to each other is an additional essential aspect of music, as is polyphony (multiple voices sounding concurrently) in typical music.
All of these elements of music conspire to make the strict analytical approach to autonomous musical transcription, extremelly challenging.