How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

前端 未结 4 1826
清酒与你
清酒与你 2021-01-11 16:36

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don\'t understand how to solve it.

I nee

相关标签:
4条回答
  • 2021-01-11 16:47

    addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')

    You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.

    is there a better solution?

    It depends what you are trying to do.

    If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.

    If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).

    If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.

    The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.

    ETA:

    url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
    

    Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:

    params = dict(
        address=address1.encode('utf-8'),
        key=googlekey
    )
    url2 = '...?' + urllib.urlencode(params)
    

    (in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)

    data2 = urllib.request.urlopen(url2).read().decode('utf-8')
    data3=json.loads(data2)
    

    json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:

    data3 = json.load(urllib.request.urlopen(url2))
    
    0 讨论(0)
  • You can use the translate() method from python. Here's an example copied from tutorialspoint.com:

    #!/usr/bin/python
    
    from string import maketrans   # Required to call maketrans function.
    
    intab = "aeiou"
    outtab = "12345"
    trantab = maketrans(intab, outtab)
    
    str = "this is string example....wow!!!";
    print str.translate(trantab)
    

    This outputs:

    th3s 3s str3ng 2x1mpl2....w4w!!!

    So you can define what characters you wish to replace more easily than with replace()

    0 讨论(0)
  • 2021-01-11 16:58

    Generally, there are two approaches: (1) regular expressions and (2) str.translate.

    1) regular expressions

    Decompose string and replace characters from the Unicode block \u0300-\u036f:

    import unicodedata
    import re
    word = unicodedata.normalize("NFD", word)
    word = re.sub("[\u0300-\u036f]", "", word)
    

    It removes accents, circumflex, diaeresis, and so on:

    pingüino > pinguino
    εἴκοσι εἶσι > εικοσι εισι
    

    For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.

    2) str.translate

    First, create replacement table (case-sensitive) and then apply it.

    repl = str.maketrans(
        "áéúíó",
        "aeuio"
    )
    word.translate(repl)
    

    Multi-char replacements are made as following:

    repl = {
        ord("æ"): "ae",
        ord("œ"): "oe",
    }
    word.translate(repl)
    
    0 讨论(0)
  • 2021-01-11 17:02

    with 3rd party package: unidecode

    3>> unidecode.unidecode("32 rue d'Athènes Paris France")
    "32 rue d'Athenes Paris France"
    
    0 讨论(0)
提交回复
热议问题