Google Cloud Text-to-Speech Interface Confusion (How do I download the mp3 files?)

坚强是说给别人听的谎言 提交于 2019-12-06 11:34:27

All of Google's ML related tools have a pretty poor 'general user' user experience, and are designed very specifically for programatic usage. If you're just looking for some basic tools with a reasonable nice usage it's probably not GCP at the moment.

Given that, the samples aren't that difficult to turn into something more if you're willing to struggle a little at the beginning. I'd suggest using the command line described here.

I'm going to add some initial steps. 1) Download and setup the Gcloud SDK tools. 2) In a terminal run gcloud auth application-default login. This will open a browser, log in like you would to the GCP Console. 3) They provided a sample request to general a file:

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
  -H "Content-Type: application/json; charset=utf-8" \
  --data "{
    'input':{
      'text':'Android is a mobile operating system developed by Google,
         based on the Linux kernel and designed primarily for
         touchscreen mobile devices such as smartphones and tablets.'
    },
    'voice':{
      'languageCode':'en-gb',
      'name':'en-GB-Standard-A',
      'ssmlGender':'FEMALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt

This is what I meant about poor experience, the code https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt writes the results of the text to speech operation to synthesize-text.txt, and inside the txt is your mp3 file. But wait, they expect you to use it programatically so the MP3 isn't just a direct file, you might want to do something else with it so instead it's returned in an encoding called Base64, which makes it easier to use binary data over http(where text is most common). So instead of an mp3 you get a json file, like:

{ "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.." }

That text starting with // IS your audio. But because you're doing this manually you need to copy out everything inside the quotes (It'll be a really long string of text characters starting with //... keep the // characters) into a new file called whatever you want, they named it synthesize-output-base64.txt. Then run the base64 synthesize-output-base64.txt --decode > synthesized-audio.mp3

And you're done.... the original request lets you specify the text, voice etc. But realistically if you're looking for casual text-to-speech with a pretty UI, GCP isn't there yet.

Or this one-liner (requires jq):

curl -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data "{
    'input':{
      'text':'Hello Android is a mobile operating system developed by Google,
         based on the Linux kernel and designed primarily for
         touchscreen mobile devices such as smartphones and tablets.'
    },
    'voice':{
      'languageCode':'en-gb',
      'name':'en-GB-Standard-A',
      'ssmlGender':'FEMALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" | jq --raw-output '.audioContent' | base64 --decode > synthesized-audio.mp3
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!