问题
It's possible to use Google's Speech recognition API to get a transcription for an audio file (WAV, MP3, etc.) by doing a request to http://www.google.com/speech-api/v2/recognize?...
Example: I have said "one two three for five" in a WAV file. Google API gives me this:
{
u'alternative':
[
{u'transcript': u'12345'},
{u'transcript': u'1 2 3 4 5'},
{u'transcript': u'one two three four five'}
],
u'final': True
}
Question: is it possible to get the time (in seconds) at which each word has been said?
With my example:
['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.
i.e. the word "one" has been said between time 00:00:00.23 and 00:00:00.80,
the word "two" has been said between time 00:00:01.03 and 00:00:01.45 (in seconds).
PS: looking for an API supporting other languages than English, especially French.
回答1:
I believe the other answer is now out of date. This is now possible with the Google Cloud Search API: https://cloud.google.com/speech/docs/async-time-offsets
回答2:
It is not possible with google API.
If you want word timestamps, you can use other APIs, for example:
CMUSphinx - free offline speech recognition API
SpeechMatics SaaS speech recognition API
Speech Recognition API from IBM
回答3:
Yes, it is very much possible. All you need to do is:
In the config set enable_word_time_offsets=True
config = types.RecognitionConfig(
....
enable_word_time_offsets=True)
Then, for each word in the alternative, you can print its start time and end time as in this code:
for result in result.results:
alternative = result.alternatives[0]
print(u'Transcript: {}'.format(alternative.transcript))
print('Confidence: {}'.format(alternative.confidence))
for word_info in alternative.words:
word = word_info.word
start_time = word_info.start_time
end_time = word_info.end_time
print('Word: {}, start_time: {}, end_time: {}'.format(
word,
start_time.seconds + start_time.nanos * 1e-9,
end_time.seconds + end_time.nanos * 1e-9))
This would give you output in the following format:
Transcript: Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7
Source: https://cloud.google.com/speech-to-text/docs/async-time-offsets
来源:https://stackoverflow.com/questions/34086379/google-speech-recognition-api-timestamp-for-each-word