Music transcription means creating music notation from sound (or audio data). While accomplished musicians and especially composers are able to do this, it's an extremely difficult task to do with a machine, and as far as i know, there has been little success so far - mostly academic experiments.
Basically, to recognize notes, you want to know where they start, where they end, and what is their pitch. Fourier transform is the most basic way to turn time domain (audio data) to frequency domain (pitches) - in principle. In practice, musical instruments generate lots of harmonics (overtones) and if we have polyphony (many F0s) added, it's a mess.
You could try feeding something like 50 millisecond sequential slices of the audio data to the FFT. This way you would get the spectrum of each slice, then detect the strongest peaks in each slice, and infer the rhythm from what happens between successive slices.
Sorry, I couldn't help much... But just wanted to point out that what you're trying to do is extremely difficult, seriously. Perhaps you should start from something simpler, like detecting one-note sine wave melodies. Good luck!