I have 15 audio tapes, one of which I believe contains an old recording of my grandmother and myself talking. A quick attempt to find the right place didn\'t turn it up. I don
Two ideas:
I wrote a blog article ago about using Windows speech recognition. I have a basic tutorial on converting audio files to text in C#. You can check out here.
if you are familiar with java you could try to feed the audio files throu minim and calculate some FFT-spectrums. Silence could be detected by defining a minimum level for the amplitude of the samples (to rule out noise). To seperate speech from music the FFT spectrum of a time-window can be used. Speech uses some very distinct frequencybands called formants - especially for vovels - music is more evenly distributed among the frequency spectrum.
You propably won't get a 100% separation of the speech/music blocks but it should be good enought to tag the files and only listen to the interesting parts.
http://code.compartmental.net/tools/minim/
http://en.wikipedia.org/wiki/Formant
The best option would be to find an open source module that does voice recognition or speaker identification (not speech recognition). Speaker identification is used to identify a particular speaker whereas speech recognition is converting spoken audio to text. There may be open source speaker identification packages, you could try searching something like SourceForge.net for "speaker identification" or "voice AND biometrics". Since I have not used one myself I can't recommend anything.
If you can't find anything but you are interested in rolling one of your own, then there are plenty of open source FFT libraries for any popular language. The technique would be:
Note, the number of hours to complete this project could easily exceed the 20 hours of listening to the recordings manually. But it will be a lot more fun than grinding through 20 hours of audio and you can use the software you build again in the future.
Of course if the audio is not sensitive from a privacy viewpoint, you could outsource the audio auditioning task to something like Amazon's mechanical turk.
What you probably save you most of the time is speaker diarization. This works by annotating the recording with speaker IDs, which you can then manually map to real people with very little effort. The errors rates are typically at about 10-15% of record length, which sounds awful, but this includes detecting too many speakers and mapping two IDs to same person, which isn't that hard to mend.
One such good tool is SHoUT toolkit (C++), even though it's a bit picky about input format. See usage for this tool from author. It outputs voice/speech activity detection metadata AND speaker diarization, meaning you get 1st and 2nd point (VAD/SAD) and a bit extra, since it annotates when is the same speaker active in a recording.
The other useful tool is LIUM spkdiarization (Java), which basically does the same, except I haven't put enough effort in yet to figure how to get VAD metadata. It features a nice ready to use downloadable package.
With a little bit of compiling, this should work in under an hour.
You could also try pyAudioAnalysis to:
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3, plot = True)
segments
contains the endpoints of the non-silence segments.