Maintaining the consistency of strings before and after converting to ASCII

前端 未结 1 723
粉色の甜心
粉色の甜心 2021-01-26 17:10

I have many strings in unicode format such as carbon copolymers—III\\n12- Géotechnique\\n and many more having many different unicode characters, in a string variable n

1条回答
  •  滥情空心
    2021-01-26 17:31

    Use the \w regular expression to strip non-alphanumerics before the decomposing trick:

    #coding:utf8
    from __future__ import unicode_literals,print_function
    import unicodedata as ud
    import re
    txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
    txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
    txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
    print(txtWords)
    

    Output (Python 2 and 3):

    carbon copolymers iii
    12  geotechnique
    

    0 讨论(0)
提交回复
热议问题