Python Requests URL with Unicode Parameters

后端 未结 3 2088
闹比i
闹比i 2021-01-06 07:25

I\'m currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.

相关标签:
3条回答
  • 2021-01-06 08:01

    The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).

    You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.

    What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:

    http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
    

    Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.

    The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.

    import requests
    
    one = u'\u3072\u3068\u3064'
    kanji = u'\u65e5\u672c\u8a9e'
    hiragana = u'\u306b\u307b\u3093\u3054'
    katakana = u'\u30cb\u30db\u30f3\u30b4'
    url = 'http://translate.google.com/translate_tts'
    
    for text in one, kanji, hiragana, katakana:
        r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
        print u"{} -> {}".format(text, r.url)
        open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)
    
    0 讨论(0)
  • 2021-01-06 08:10

    I made this little method before to help me with UTF-8 encoding. I was having issues printing cyrllic and CJK languages to csvs and this did the trick.

    def assist(unicode_string):
        utf8 = unicode_string.encode('utf-8')
        read = utf8.decode('string_escape')
    
        return read   ## UTF-8 encoded string
    

    Also, make sure you have these two lines at the beginning of your .py.

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    

    The first line is just a good python habit, it specifies which compiler to use on the .py (really only useful if you have more than one version of python loaded on your machine). The second line specifies the encoding of the python file. A slightly longer answer for this is given here.

    0 讨论(0)
  • 2021-01-06 08:18

    Setting the User-Agent to Mozilla/5.0 fixes this issue.

    from StringIO import StringIO
    import urllib
    import requests
    
    __author__ = 'jacob'
    
    langs = {'japanese': 'ja',
             'english': 'en'}
    
    def get_sound_file_for_text(text, download=False, lang='japanese'):
    
        r = StringIO()
        glang = langs[lang]
        text = text.replace('*', '')
        text = text.replace('/', '')
        text = text.replace('x', '')
        url = 'http://translate.google.com/translate_tts'
        if download:
            result = requests.get(url, params={'tl': glang, 'q': text}, headers={'User-Agent': 'Mozilla/5.0'})
            r.write(result.content)
            r.seek(0)
            return r
        else:
            return url
    
    0 讨论(0)
提交回复
热议问题