问题
I am trying to set up a server to receive audio from a client browser using SocketIO
, then process it through Google Speech-to-Text, and finally reply back to the client with the text.
Originally and ideally, I wanted to set up to function somewhat like the tool on this page: https://cloud.google.com/speech-to-text/
I tried using getUserMedia
and streaming it through SocketIO-Stream
, but I couldn't figure out how to 'pipe' MediaStream
.
Instead, now I've decided to use MediaRecorder
on the client side, and then send the data altogether as a blob(seen in this example).
I then apply toString('base64')
to the blob and call google-cloud/speech's client.recognize()
on the blob.
Client Side(i'm using VueJS):
new Vue({
el: '#app',
data: function () {
return ({
msgs: [],
socket: null,
recorder: null,
: []
})
},
mounted: function () {
this.socket = io.connect('localhost:3000/user');
console.log('Connected!')
this.socket.on('text', function (text) {
this.msgs.push(text)
})
},
methods: {
startRecording: function () {
if (this.recorder && this.recorder.state == 'recording') {
console.log("Stopping!")
this.recorder.stop()
} else {
console.log("Starting!")
navigator.mediaDevices.getUserMedia({ audio: true, video: false })
.then(this.handleSuccess);
}
},
handleSuccess: function (stream) {
this.recorder = new MediaRecorder(stream)
this.recorder.start(10000)
this.recorder.ondataavailable = (e) => {
this.chunks.push(e.data)
console.log(e.data)
}
this.recorder.onstop = (e) => {
const blob = new Blob(this.chunks, { 'type': 'audio/webm; codecs=opus' })
this.socket.emit('audio', blob)
}
}
}
})
Server Side:
const speech = require('@google-cloud/speech');
const client = new speech.SpeechClient();
const io = require('socket.io').listen(3000)
const ss = require('socket.io-stream')
const encoding = 'LINEAR16';
const sampleRateHertz = 16000;
const languageCode = 'en-US';
const audio = {
content: null
}
const config = {
encoding: encoding,
sampleRateHertz: sampleRateHertz,
languageCode: languageCode,
}
async function main() {
const [response] = await client.recognize({
audio: audio,
config: config
})
const transcription = response.results
.map(result => result.alternatives[0].transcript)
.join('\n');
console.log(`Transcription: ${transcription}`);
}
io.of('/user').on('connection', function (socket) {
console.log('Connection made!')
socket.on('audio', function (data) {
audio.content = data.toString('base64')
main().catch(console.error)
});
});
The log from the main()
function in the Server side is always:
"Transcription: "
-- which is empty!
It should contain the text from the audio sent. Thank you in advance!
回答1:
Your nodejs application asks for the processing of raw audio data, recorded as an array of 16-bit signed integers ('LINEAR16'
) at a rate if 16k samples/sec (16000
) . This sort of audio representation is known as pulse-code modulation (PCM) for reasons lost in ancient telephony lore.
But the Blob you send from your client-side code is not that. It's a media object with the content-type audio/webm; codecs=opus
. That means the audio track is compressed using the Opus codec and boxed (multiplexed) in the webm (Matroska, ebml) container format. The cloud text-to-speech code tries to interpret that as raw audio data, fails, throws up its hands and returns an empty transcription string. It's analogous to trying to view a zip file in a text editor: it's just gibberish.
To get text-to-speech to work with a media object, you have to extract the PCM audio from it first. This is a notorious pain in the neck to set up on a server; you have to use ffmpeg. There's a tutorial on it in the text-to-speech documentation. The tutorial mentions scraping the audio out of video files. Your Blob is, basically, a video file with no video track in it, so the same techniques work.
But, you'll be much better off returning to your first approach, using the MediaStream browser javascript APIs. In particular, your browser code should use elements of the Web Audio API to intercept the raw PCM audio data and send it to your server or directly from your browser to text-to-speech.
Explaining all this is way beyond the scope of a StackOverflow answer. Here are some hints. How to use web audio api to get raw pcm audio?
回答2:
The Google Text-To-Speech v1p1beta1
API end point supports MP3 files now. As O.Jones says, above MediaRecorder API is a good option, but now you can just get MP3 instead of raw PCM data, I found it difficult to implement the RecordRTC library with the intention of getting raw PCM because I ran into sound quality and cross-browser issues.
My solution: I used the mimeType audio/mp3
when creating my blob as such: const blob = new Blob(chunks, { 'type' : 'audio/mp3' });
Then I converted the blob to a base64 string like in this SO example. Then when you send an api call to Google's Speech-To-Text api, you have to specify the v1p1beta1
beta endpoint, as well as set the config as I have done in the cURL request below. Note that the default sampling rate for MediaRecorder is 16000Hz. An example CURL call could be the following (you must specify your api key):
curl --location --request POST 'https://speech.googleapis.com/v1p1beta1/speech:recognize?key=yourkey' \
--header 'Content-Type: application/json' \
--data-raw '{
"config": {
"encoding":"MP3",
"sampleRateHertz": 16000,
"languageCode": "en-US"
},
"audio": {
"content":""
}
}'
Also, this is working for me on Chrome, Firefox and Safari, but for Safari you must enable the MediaRecorder in Develop -> Experimental Features -> Media Recorder
来源:https://stackoverflow.com/questions/56453937/how-to-google-speech-to-text-using-blob-sent-from-browser-to-nodejs-server