My undergraduate project dealt with transcribing notes from a WAV file to a MIDI file. We handled only the simple case of one instrument, possibly playing more than one note at a time (a piano, for instance). Our research into the subject before we started showed that even this (i.e. only one instrument) is considered non-trivial. Basically, the problem is:
- find what frequencies are playing at any given time. This can be done by a DFT/FFT of small windows one at a time.
- Use some heuristic to guess which frequencies are harmonies of the same note, and which belong to different notes. This may be easy if you know what instrument is playing, but it's hard in the general case, because the magnitudes of each harmony differ by instrument. For instance, you might have two Cs one octave apart from one instrument, or you might have one C but from a different instrument.
- after you know what notes are playing at each time, you have to guess when you have breaks in the notes. You could have one long note or a series of short notes. Depending on the size of the windows you used for the initial DFT, you could have different results here.
Now, if you have more than one instrument at a time, and no two are playing the same notes or harmonies thereof at one time, you might be able to tell the instruments apart using some heuristic on the magnitudes of the harmonies or on the sequences of notes they're playing. Most likely there will be times when two instruments are playing the same note. Then you don't really have any way to decide if there is (a) one instrument playing the note, (b) two instruments playing at the same volume, (c) one playing soft and the other playing loud or (d) any combination thereof.
Anyway, that's the short list of problems to solve. I don't know of any algorithm that solves this in the general case. I don't think this problem has been solved yet.
Edit: My project presentation can be found at http://www-sipl.technion.ac.il/new/Archive/Special_Events/sipl2004/Projects_PowerPoint/WAV-to-MIDI.pdf