Split speech audio file on words in python

前端 未结 4 1505
无人及你
无人及你 2020-12-23 10:16

I feel like this is a fairly common problem but I haven\'t yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which ca

相关标签:
4条回答
  • 2020-12-23 10:36

    You could look at Audiolab It provides a decent API to convert the voice samples into numpy arrays. The Audiolab module uses the libsndfile C++ library to do the heavy lifting.

    You can then parse the arrays to find the lower values to find the pauses.

    0 讨论(0)
  • 2020-12-23 10:39

    Use IBM STT. Using timestamps=true you will get the word break up along with when the system detects them to have been spoken.

    There are a lot of other cool features like word_alternatives_threshold to get other possibilities of words and word_confidence to get the confidence with which the system predicts the word. Set word_alternatives_threshold to between (0.1 and 0.01) to get a real idea.

    This needs sign on, following which you can use the username and password generated.

    The IBM STT is already a part of the speechrecognition module mentioned, but to get the word timestamp, you will need to modify the function.

    An extracted and modified form looks like:

    def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
                                    word_confidence=False, word_alternatives_threshold=0.1):
        assert isinstance(username, str), "``username`` must be a string"
        assert isinstance(password, str), "``password`` must be a string"
    
        flac_data = audio_data.get_flac_data(
            convert_rate=None if audio_data.sample_rate >= 16000 else 16000,  # audio samples should be at least 16 kHz
            convert_width=None if audio_data.sample_width >= 2 else 2  # audio samples should be at least 16-bit
        )
        url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
            "profanity_filter": "false",
            "continuous": "true",
            "model": "{}_BroadbandModel".format(language),
            "timestamps": "{}".format(str(timestamps).lower()),
            "word_confidence": "{}".format(str(word_confidence).lower()),
            "word_alternatives_threshold": "{}".format(word_alternatives_threshold)
        }))
        request = Request(url, data=flac_data, headers={
            "Content-Type": "audio/x-flac",
            "X-Watson-Learning-Opt-Out": "true",  # prevent requests from being logged, for improved privacy
        })
        authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
        request.add_header("Authorization", "Basic {}".format(authorization_value))
    
        try:
            response = urlopen(request, timeout=None)
        except HTTPError as e:
            raise sr.RequestError("recognition request failed: {}".format(e.reason))
        except URLError as e:
            raise sr.RequestError("recognition connection failed: {}".format(e.reason))
        response_text = response.read().decode("utf-8")
        result = json.loads(response_text)
    
        # return results
        if show_all: return result
        if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
            raise Exception("Unknown Value Exception")
    
        transcription = []
        for utterance in result["results"]:
            if "alternatives" not in utterance:
                raise Exception("Unknown Value Exception. No Alternatives returned")
            for hypothesis in utterance["alternatives"]:
                if "transcript" in hypothesis:
                    transcription.append(hypothesis["transcript"])
        return "\n".join(transcription)
    
    0 讨论(0)
  • 2020-12-23 10:41

    pyAudioAnalysis can segment an audio file if the words are clearly separated (this is rarely the case in natural speech). The package is relatively easy to use:

    python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3
    

    More details on my blog.

    0 讨论(0)
  • 2020-12-23 10:52

    An easier way to do this is using pydub module. recent addition of silent utilities does all the heavy lifting such as setting up silence threahold , setting up silence length. etc and simplifies code significantly as opposed to other methods mentioned.

    Here is an demo implementation , inspiration from here

    Setup:

    I had a audio file with spoken english letters from A to Z in the file "a-z.wav". A sub-directory splitAudio was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.

    Observations: Some of the syllables were cut off, possibly needing modification of following parameters,
    min_silence_len=500
    silence_thresh=-16

    One may want to tune these to one's own requirement.

    Demo Code:

    from pydub import AudioSegment
    from pydub.silence import split_on_silence
    
    sound_file = AudioSegment.from_wav("a-z.wav")
    audio_chunks = split_on_silence(sound_file, 
        # must be silent for at least half a second
        min_silence_len=500,
    
        # consider it silent if quieter than -16 dBFS
        silence_thresh=-16
    )
    
    for i, chunk in enumerate(audio_chunks):
    
        out_file = ".//splitAudio//chunk{0}.wav".format(i)
        print "exporting", out_file
        chunk.export(out_file, format="wav")
    

    Output:

    Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>> ================================ RESTART ================================
    >>> 
    exporting .//splitAudio//chunk0.wav
    exporting .//splitAudio//chunk1.wav
    exporting .//splitAudio//chunk2.wav
    exporting .//splitAudio//chunk3.wav
    exporting .//splitAudio//chunk4.wav
    exporting .//splitAudio//chunk5.wav
    exporting .//splitAudio//chunk6.wav
    exporting .//splitAudio//chunk7.wav
    exporting .//splitAudio//chunk8.wav
    exporting .//splitAudio//chunk9.wav
    exporting .//splitAudio//chunk10.wav
    exporting .//splitAudio//chunk11.wav
    exporting .//splitAudio//chunk12.wav
    exporting .//splitAudio//chunk13.wav
    exporting .//splitAudio//chunk14.wav
    exporting .//splitAudio//chunk15.wav
    exporting .//splitAudio//chunk16.wav
    exporting .//splitAudio//chunk17.wav
    exporting .//splitAudio//chunk18.wav
    exporting .//splitAudio//chunk19.wav
    exporting .//splitAudio//chunk20.wav
    exporting .//splitAudio//chunk21.wav
    exporting .//splitAudio//chunk22.wav
    exporting .//splitAudio//chunk23.wav
    exporting .//splitAudio//chunk24.wav
    exporting .//splitAudio//chunk25.wav
    exporting .//splitAudio//chunk26.wav
    >>> 
    
    0 讨论(0)
提交回复
热议问题