How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

前端 未结 4 1820
清酒与你
清酒与你 2021-01-11 16:36

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don\'t understand how to solve it.

I nee

4条回答
  •  时光说笑
    2021-01-11 16:47

    addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')

    You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.

    is there a better solution?

    It depends what you are trying to do.

    If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.

    If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).

    If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.

    The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.

    ETA:

    url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
    

    Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:

    params = dict(
        address=address1.encode('utf-8'),
        key=googlekey
    )
    url2 = '...?' + urllib.urlencode(params)
    

    (in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)

    data2 = urllib.request.urlopen(url2).read().decode('utf-8')
    data3=json.loads(data2)
    

    json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:

    data3 = json.load(urllib.request.urlopen(url2))
    

提交回复
热议问题