How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

前端未结

关注

 4  1826

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don\'t understand how to solve it.

I nee

相关标签:

4条回答

时光说笑

2021-01-11 16:47
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')

You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.

is there a better solution?

It depends what you are trying to do.

If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.

If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).

If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.

The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.

ETA:
```
url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
```
Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:
```
params = dict(
    address=address1.encode('utf-8'),
    key=googlekey
)
url2 = '...?' + urllib.urlencode(params)
```
(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)
```
data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)
```
json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:
```
data3 = json.load(urllib.request.urlopen(url2))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-01-11 16:48
You can use the translate() method from python. Here's an example copied from tutorialspoint.com:
```
#!/usr/bin/python

from string import maketrans   # Required to call maketrans function.

intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)

str = "this is string example....wow!!!";
print str.translate(trantab)
```
This outputs:

th3s 3s str3ng 2x1mpl2....w4w!!!

So you can define what characters you wish to replace more easily than with replace()
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2021-01-11 16:58
Generally, there are two approaches: (1) regular expressions and (2) str.translate.

1) regular expressions

Decompose string and replace characters from the Unicode block \u0300-\u036f:
```
import unicodedata
import re
word = unicodedata.normalize("NFD", word)
word = re.sub("[\u0300-\u036f]", "", word)
```
It removes accents, circumflex, diaeresis, and so on:
```
pingüino > pinguino
εἴκοσι εἶσι > εικοσι εισι
```
For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.

2) str.translate

First, create replacement table (case-sensitive) and then apply it.
```
repl = str.maketrans(
    "áéúíó",
    "aeuio"
)
word.translate(repl)
```
Multi-char replacements are made as following:
```
repl = {
    ord("æ"): "ae",
    ord("œ"): "oe",
}
word.translate(repl)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2021-01-11 17:02
with 3rd party package: unidecode
```
3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

1) regular expressions

2) str.translate