问题
Trying to send a continuous audio stream from microphone directly to IBM Watson SpeechToText Web service using the Java SDK. One of the examples provided with the distribution (RecognizeUsingWebSocketsExample
) shows how to stream a file in .WAV format to the service. However, .WAV files require that their length be specified ahead of time, so the naive approach of just appending to the file one buffer at a time is not feasible.
It appears that SpeechToText.recognizeUsingWebSocket
can take a stream, but feeding it an instance of AudioInputStream
does not seem to do it appears like the connection is established but no transcripts are returned even though RecognizeOptions.interimResults(true)
.
public class RecognizeUsingWebSocketsExample {
private static CountDownLatch lock = new CountDownLatch(1);
public static void main(String[] args) throws FileNotFoundException, InterruptedException {
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");
AudioInputStream audio = null;
try {
final AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
TargetDataLine line;
line = (TargetDataLine)AudioSystem.getLine(info);
line.open(format);
line.start();
audio = new AudioInputStream(line);
} catch (LineUnavailableException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
RecognizeOptions options = new RecognizeOptions.Builder()
.continuous(true)
.interimResults(true)
.contentType(HttpMediaType.AUDIO_WAV)
.build();
service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
@Override
public void onTranscription(SpeechResults speechResults) {
System.out.println(speechResults);
if (speechResults.isFinal())
lock.countDown();
}
});
lock.await(1, TimeUnit.MINUTES);
}
}
Any help would be greatly appreciated.
-rg
Here's an update based on German's comment below (thanks for that).
I was able to use javaFlacEncode to covert the WAV stream arriving from the mic into a FLAC stream and save it into a temporary file. Unlike a WAV audio file, whose size is fixed at creation, the FLAC file can be appended to easily.
WAV_audioInputStream = new AudioInputStream(line);
FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile);
StreamConfiguration streamConfiguration = new StreamConfiguration();
streamConfiguration.setSampleRate(16000);
streamConfiguration.setBitsPerSample(8);
streamConfiguration.setChannelCount(1);
flacEncoder = new FLACEncoder();
flacOutputStream = new FLACFileOutputStream(tempFile); // write to temp disk file
flacEncoder.setStreamConfiguration(streamConfiguration);
flacEncoder.setOutputStream(flacOutputStream);
flacEncoder.openFLACStream();
...
// convert data
int frameLength = 16000;
int[] intBuffer = new int[frameLength];
byte[] byteBuffer = new byte[frameLength];
while (true) {
int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength);
for (int j1=0;j1<count;j1++)
intBuffer[j1] = byteBuffer[j1];
flacEncoder.addSamples(intBuffer, count);
flacEncoder.encodeSamples(count, false); // 'false' means non-final frame
}
flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true); // final frame
WAV_audioInputStream.close();
flacOutputStream.close();
FLAC_audioInputStream.close();
The resulting file can be analyzed (using curl
or recognizeUsingWebSocket()
) without any problems after adding an arbitrary number of frames. However, the recognizeUsingWebSocket()
will return the final result as soon as it reaches the end of the FLAC file, even though the file's last frame may not be final (i.e., after encodeSamples(count, false)
).
I would expect recognizeUsingWebSocket()
to block till the final frame is written to the file. In practical terms, it means that the analysis stops after the first frame, as it takes less time to analyze the first frame than to collect the 2nd, so upon returning the results, the end of file is reached.
Is this the right way to implement streaming audio from a mic in Java? Seems like a common use case.
Here's a modification of RecognizeUsingWebSocketsExample
, incorporating some of Daniel's suggestions below. It uses PCM content type (passed as a String
, together with a frame size), and an attempt to signal the end of the audio stream, albeit not a very successful one.
As before, the connection is made, but the recognize callback is never called. Closing the stream does not seem to be interpreted as an end of audio either. I must be misunderstanding something here...
public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException {
final PipedOutputStream output = new PipedOutputStream();
final PipedInputStream input = new PipedInputStream(output);
final AudioFormat format = new AudioFormat(16000, 8, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info);
line.open(format);
line.start();
Thread thread1 = new Thread(new Runnable() {
@Override
public void run() {
try {
final int MAX_FRAMES = 2;
byte buffer[] = new byte[16000];
for(int j1=0;j1<MAX_FRAMES;j1++) { // read two frames from microphone
int count = line.read(buffer, 0, buffer.length);
System.out.println("Read audio frame from line: " + count);
output.write(buffer, 0, buffer.length);
System.out.println("Written audio frame to pipe: " + count);
}
/** no need to fake end-of-audio; StopMessage will be sent
* automatically by SDK once the pipe is drained (see WebSocketManager)
// signal end of audio; based on WebSocketUploader.stop() source
byte[] stopData = new byte[0];
output.write(stopData);
**/
} catch (IOException e) {
}
}
});
thread1.start();
final CountDownLatch lock = new CountDownLatch(1);
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");
RecognizeOptions options = new RecognizeOptions.Builder()
.continuous(true)
.interimResults(false)
.contentType("audio/pcm; rate=16000")
.build();
service.recognizeUsingWebSocket(input, options, new BaseRecognizeCallback() {
@Override
public void onConnected() {
System.out.println("Connected.");
}
@Override
public void onTranscription(SpeechResults speechResults) {
System.out.println("Received results.");
System.out.println(speechResults);
if (speechResults.isFinal())
lock.countDown();
}
});
System.out.println("Waiting for STT callback ... ");
lock.await(5, TimeUnit.SECONDS);
line.stop();
System.out.println("Done waiting for STT callback.");
}
Dani, I instrumented the source for WebSocketManager
(comes with SDK) and replaced a call to sendMessage()
with an explicit StopMessage
payload as follows:
/**
* Send input steam.
*
* @param inputStream the input stream
* @throws IOException Signals that an I/O exception has occurred.
*/
private void sendInputSteam(InputStream inputStream) throws IOException {
int cumulative = 0;
byte[] buffer = new byte[FOUR_KB];
int read;
while ((read = inputStream.read(buffer)) > 0) {
cumulative += read;
if (read == FOUR_KB) {
socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer));
} else {
System.out.println("completed sending " + cumulative/16000 + " frames over socket");
socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read))); // partial buffer write
System.out.println("signaling end of audio");
socket.sendMessage(RequestBody.create(WebSocket.TEXT, buildStopMessage().toString())); // end of audio signal
}
}
inputStream.close();
}
Neither of sendMessage() options (sending 0-length binary content or sending the stop text message) seems to work. The caller code is unchanged from above. The resulting output is:
Waiting for STT callback ...
Connected.
Read audio frame from line: 16000
Written audio frame to pipe: 16000
Read audio frame from line: 16000
Written audio frame to pipe: 16000
completed sending 2 frames over socket
onFailure: java.net.SocketException: Software caused connection abort: socket write error
REVISED: actually, the end-of-audio call is never reached. Exception is thrown while writing the last (partial) buffer to the socket.
Why is the connection aborted? That typically happens when the peer closes the connection.
As for point 2): Would either of these matter at this stage? It appears that recognition process is not being started at all... Audio is valid (I wrote the stream out to a disk and was able to recognize it by streaming it from a file, as I point out above).
Also, on further review of WebSocketManager
source code, onMessage()
already sends StopMessage
immediately upon return
from sendInputSteam()
(ie.e., when the audio stream, or pipe in the example above, drains), so no need to call it explicitly. The problem is definitely occurring before the audio data transmission completes. The behavior is the same, regardless if PipedInputStream
or AudioInputStream
is passed as input. Exception is thrown while sending binary data in both cases.
回答1:
The Java SDK has an example and supports this.
Update your pom.xml
with:
<dependency>
<groupId>com.ibm.watson.developer_cloud</groupId>
<artifactId>java-sdk</artifactId>
<version>3.3.1</version>
</dependency>
Here is an example of how to listen to your microphone.
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");
// Signed PCM AudioFormat with 16kHz, 16 bit sample size, mono
int sampleRate = 16000;
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
if (!AudioSystem.isLineSupported(info)) {
System.out.println("Line not supported");
System.exit(0);
}
TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();
AudioInputStream audio = new AudioInputStream(line);
RecognizeOptions options = new RecognizeOptions.Builder()
.continuous(true)
.interimResults(true)
.timestamps(true)
.wordConfidence(true)
//.inactivityTimeout(5) // use this to stop listening when the speaker pauses, i.e. for 5s
.contentType(HttpMediaType.AUDIO_RAW + "; rate=" + sampleRate)
.build();
service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
@Override
public void onTranscription(SpeechResults speechResults) {
System.out.println(speechResults);
}
});
System.out.println("Listening to your voice for the next 30s...");
Thread.sleep(30 * 1000);
// closing the WebSockets underlying InputStream will close the WebSocket itself.
line.stop();
line.close();
System.out.println("Fin.");
回答2:
what you need to do is feed the audio to the STT service not as a file, but as a headerless stream of audio samples. You just feed the samples that you capture from the microphone over a WebSocket. You need to set the content type to "audio/pcm; rate=16000" where 16000 is the sampling rate in Hz. If your sampling rate is different, which depends on how the microphone is encoding the audio, you will replace the 16000 by your value, for example: 44100, 48000, etc.
When feeding pcm audio the STT service wont stop recognizing until you signal the end of audio by sending an empty binary message over the websocket.
Dani
Looking at the new version of your code I see some issues:
1) signaling end of audio can be done by sending an empty binary message through the websocket, that is not what you are doing. The lines
// signal end of audio; based on WebSocketUploader.stop() source
byte[] stopData = new byte[0];
output.write(stopData);
are not doing anything since they wont result in an empty websocket message being sent. Can you please call the method "WebSocketUploader.stop()" instead?
- You are capturing audio at 8 bits per sample, you should do 16 bits for enough queality. Also you are only feeding a couple of seconds of audio, not ideal for testing. Can you please write whatever audio you push to STT to a file and then open it with Audacity (using the import feature)? This way you can make sure what you are feeding to STT is good audio.
来源:https://stackoverflow.com/questions/37232560/stream-audio-from-mic-to-ibm-watson-speechtotext-web-service-using-java-sdk