Replace non-ASCII characters with a single space

后端 未结 7 1434
死守一世寂寞
死守一世寂寞 2020-11-22 16:17

I need to replace all non-ASCII (\\x00-\\x7F) characters with a space. I\'m surprised that this is not dead-easy in Python, unless I\'m missing something. The following func

相关标签:
7条回答
  • 2020-11-22 16:29

    For character processing, use Unicode strings:

    PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
    >>> s='ABC马克def'
    >>> import re
    >>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
    'ABC  def'
    >>> b = s.encode('utf8')
    >>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
    b'ABC      def'
    

    But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

    >>> s = 'mañana'
    >>> len(s)
    6
    >>> import unicodedata as ud
    >>> n=ud.normalize('NFD',s)
    >>> n
    'mañana'
    >>> len(n)
    7
    >>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
    'ma ana'
    >>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
    'man ana'
    
    0 讨论(0)
  • 2020-11-22 16:29

    What about this one?

    def replace_trash(unicode_string):
         for i in range(0, len(unicode_string)):
             try:
                 unicode_string[i].encode("ascii")
             except:
                  #means it's non-ASCII
                  unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
         return unicode_string
    
    0 讨论(0)
  • 2020-11-22 16:33

    Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

    return ''.join([i if ord(i) < 128 else ' ' for i in text])
    

    This handles characters one by one and would still use one space per character replaced.

    Your regular expression should just replace consecutive non-ASCII characters with a space:

    re.sub(r'[^\x00-\x7F]+',' ', text)
    

    Note the + there.

    0 讨论(0)
  • 2020-11-22 16:35

    For you the get the most alike representation of your original string I recommend the unidecode module:

    from unidecode import unidecode
    def remove_non_ascii(text):
        return unidecode(unicode(text, encoding = "utf-8"))
    

    Then you can use it in a string:

    remove_non_ascii("Ceñía")
    Cenia
    
    0 讨论(0)
  • 2020-11-22 16:35

    If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():

    """Test the performance of different non-ASCII replacement methods."""
    
    
    import re
    from timeit import timeit
    
    
    # 10_000 is typical in the project that I'm working on and most of the text
    # is going to be non-ASCII.
    text = 'Æ' * 10_000
    
    
    print(timeit(
        """
    result = ''.join([c if ord(c) < 128 else '?' for c in text])
        """,
        number=1000,
        globals=globals(),
    ))
    
    print(timeit(
        """
    result = text.encode('ascii', 'replace').decode()
        """,
        number=1000,
        globals=globals(),
    ))
    

    Results:

    0.7208260721400134
    0.009975979187503592
    
    0 讨论(0)
  • 2020-11-22 16:36

    Potentially for a different question, but I'm providing my version of @Alvero's answer (using unidecode). I want to do a "regular" strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a "regular" space, i.e.

    "Ceñíaㅤmañanaㅤㅤㅤㅤ"
    

    to

    "Ceñía mañana"
    

    ,

    def safely_stripped(s: str):
        return ' '.join(
            stripped for stripped in
            (bit.strip() for bit in
             ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
            if stripped)
    

    We first replace all non-unicode spaces with a regular space (and join it back again),

    ''.join((c if unidecode(c) else ' ') for c in s)
    

    And then we split that again, with python's normal split, and strip each "bit",

    (bit.strip() for bit in s.split())
    

    And lastly join those back again, but only if the string passes an if test,

    ' '.join(stripped for stripped in s if stripped)
    

    And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.

    0 讨论(0)
提交回复
热议问题