Open source code for voice detection and discrimination

前端 未结 8 2103
囚心锁ツ
囚心锁ツ 2021-01-31 17:49

I have 15 audio tapes, one of which I believe contains an old recording of my grandmother and myself talking. A quick attempt to find the right place didn\'t turn it up. I don

8条回答
  •  南方客
    南方客 (楼主)
    2021-01-31 18:35

    The best option would be to find an open source module that does voice recognition or speaker identification (not speech recognition). Speaker identification is used to identify a particular speaker whereas speech recognition is converting spoken audio to text. There may be open source speaker identification packages, you could try searching something like SourceForge.net for "speaker identification" or "voice AND biometrics". Since I have not used one myself I can't recommend anything.

    If you can't find anything but you are interested in rolling one of your own, then there are plenty of open source FFT libraries for any popular language. The technique would be:

    • Get a typical recording of you talking normally and your grandmother talking normally in digital form, something with as little background noise as possible
      • Take the FFT of every second of audio or so in the target recordings
      • From the array of FFT profiles you have created, filter out any below a certain average energy threshold since they are most likely noise
      • Build a master FFT profile by averaging out the non-filtered FFT snapshots
      • Then repeat the FFT sampling technique above on the digitized target audio (the 20 hours of stuff)
      • Flag any areas in the target audio files where the FFT snapshot at any time index is similar to your master FFT profile for you and your grandmother talking. You will need to play with the similarity setting so that you don't get too many false positives. Also note, you may have to limit your FFT frequency bin comparison to only those frequency bins in your master FFT profile that have energy. Otherwise, if the target audio of you and your grandmother talking contains significant background noise, it will throw off your similarity function.
      • Crank out a list of time indices for manual inspection

    Note, the number of hours to complete this project could easily exceed the 20 hours of listening to the recordings manually. But it will be a lot more fun than grinding through 20 hours of audio and you can use the software you build again in the future.

    Of course if the audio is not sensitive from a privacy viewpoint, you could outsource the audio auditioning task to something like Amazon's mechanical turk.

提交回复
热议问题