Detect human voice from audio file input

廉价感情. 提交于 2019-12-02 15:50:44
msh

Voice detection is not that simple. There are several algorithms, some of them are published, for example GSM VAD. Several open source VAD libraries are available, some of them are discussed here

If you want to have a clean recording you can

  1. Filter noise from the voice, you can use FFT for that and apply filters such as lowpass, highpass and bandpass filters Filtering using FFT and Filters

2.After Filtration the noise would be decreased and you can use Voice recognition API's

API's

The more Filtering the better less noise More recognition, but be wary in filtering because it can also remove the Voice together with the noise.

Also read more about FFt

Fast Fourier Transform of Human Voice

Hope This Helps :)

For voice detect, try ftt algorithm.

For noise, try speex library.

The way to process the input is to use a specialised library which removes noise.

For example, http://audacity.sourceforge.net, does noise removal.

So long as you have characterised the main types of noise, you should have only speech remaining.

It would be worthwhile collecting sampling data before the capture from the user, and after the user ended the capture, as this would provide at-the-time samples of noise in the environment. This is useful if each user faces unique background noise challenges.

What exactly are you looking for? Do you just want to filter out the human speech in the audio or do you actually want to know what the person has said?

Filtering the human speech is done by nearly every Smartphone by recording the background noice with a second microphone at the back of the device and subtract the two signals. But to be honest, I haven't seen any Android API were you can directly access the two signals.

If you want to do speech to text conversion, then have a look at Sphinx4 and Praat. Both do this job but again, I haven't seen an implementation for Android. Sphinx4 claims to be fully written in Java, so it should be possible to embed it in an Android App.

Have you considered using Microsoft's speech Recognition API? You can use a voice key utterance to begin recording, like how they say "computer" before asking the computer something in Star Trek. Use ISpRecognizer::CreateRecoContext to load your recognition grammar and start recognition. Then implement a check with ISpPhrase to see if you should begin recording or not.

In the completely general case, this is an unsolved problem. In the practical sense...

First step is to get as noise-free a recording as possible. As others have noted, that starts with a directional microphone as focused on the sound you want to keep as possible.

Second step is filtering. As noted previously, the telephone company did a lot of work on which frequency ranges are actually needed by humans for speech comprehension. Filtering out frequencies outside that range will make the voice sound like... well, a telephone... but will get rid of more of the background noise.

If you want to go beyond that, things can get really complicated. There are some algorithms which, if you can show them a sample of what you consider noise on that particular recording, will analyse it and try to subtract it out without damaging the sound you want to keep too much. This is not simple programming; if I were you I'd seriously consider buying it from someone who has already gotten it right rather than trying to reinvent/reimplement it. I don't know whether any of them are available for Android or whether the typical Android box has enough computing power to execute them in anything like realtime. (I've used SoundSoap in the studio to remove A/C noise, and it works very well.)

In fact, my own inclincation would be to simplify the problem to a solved one: use the most directional and closest mike I could get, let Android do the recording... but then do the signal processing to clean it up later, using off-the-shelf tools. But I admit I'm biased because I have already invested in the latter.

I tried to solve a similar problem on Windows. One thing I learned fast -- simple frequency analysis with a fast Fourier transform is not enough. Lots of noises hit human frequencies -- from simple taps on the microphone to clapping hands. Even some level of sophisticated filtering won't do it. I've found the easiest way is to take the noise to a cloud API and ask it to transcribe the speech. If the cloud API can transcribe to a reasonable length string, then I can continue recording -- else, stop recording. This does require that you sample some noise and send it to a cloud provider.

Most of them have misunderstood the question and their replies solves problems different from yours.

You should parse the audio in your buffer searching for frequencies in the voice human range. As soon you detect them, will mean someone has started talking, and you can start recording (don't forget to include the buffer too as it contains the first part of the speech).

Search for routines that prints the list of frequencies in an audio raw stream

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!