Maintaining the consistency of strings before and after converting to ASCII

前端未结

关注

 1  722

I have many strings in unicode format such as carbon copolymers—III\\n12- Géotechnique\\n and many more having many different unicode characters, in a string variable n

相关标签:

1条回答

滥情空心

2021-01-26 17:31

Use the \w regular expression to strip non-alphanumerics before the decomposing trick:

#coding:utf8
from __future__ import unicode_literals,print_function
import unicodedata as ud
import re
txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
print(txtWords)

Output (Python 2 and 3):

carbon copolymers iii
12  geotechnique

0 讨论(0)