Replace non-ASCII characters with a single space

后端 未结 7 1451
死守一世寂寞
死守一世寂寞 2020-11-22 16:17

I need to replace all non-ASCII (\\x00-\\x7F) characters with a space. I\'m surprised that this is not dead-easy in Python, unless I\'m missing something. The following func

7条回答
  •  失恋的感觉
    2020-11-22 16:29

    For character processing, use Unicode strings:

    PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
    >>> s='ABC马克def'
    >>> import re
    >>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
    'ABC  def'
    >>> b = s.encode('utf8')
    >>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
    b'ABC      def'
    

    But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

    >>> s = 'mañana'
    >>> len(s)
    6
    >>> import unicodedata as ud
    >>> n=ud.normalize('NFD',s)
    >>> n
    'mañana'
    >>> len(n)
    7
    >>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
    'ma ana'
    >>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
    'man ana'
    

提交回复
热议问题