Replace non-ASCII characters with a single space

后端未结

关注

 7  1443

I need to replace all non-ASCII (\\x00-\\x7F) characters with a space. I\'m surprised that this is not dead-easy in Python, unless I\'m missing something. The following func

相关标签:

7条回答

失恋的感觉

2020-11-22 16:29

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

0 讨论(0)

终归单人心

2020-11-22 16:29

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

0 讨论(0)

萌比男神i

2020-11-22 16:33
Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
```
return ''.join([i if ord(i) < 128 else ' ' for i in text])
```
This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:
```
re.sub(r'[^\x00-\x7F]+',' ', text)
```
Note the + there.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2020-11-22 16:35
For you the get the most alike representation of your original string I recommend the unidecode module:
```
from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))
```
Then you can use it in a string:
```
remove_non_ascii("Ceñía")
Cenia
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

别跟我提以往

2020-11-22 16:35

If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

0 讨论(0)

暖寄归人

2020-11-22 16:36
Potentially for a different question, but I'm providing my version of @Alvero's answer (using unidecode). I want to do a "regular" strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a "regular" space, i.e.
```
"Ceñíaㅤmañanaㅤㅤㅤㅤ"
```
to
```
"Ceñía mañana"
```
,
```
def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)
```
We first replace all non-unicode spaces with a regular space (and join it back again),
```
''.join((c if unidecode(c) else ' ') for c in s)
```
And then we split that again, with python's normal split, and strip each "bit",
```
(bit.strip() for bit in s.split())
```
And lastly join those back again, but only if the string passes an if test,
```
' '.join(stripped for stripped in s if stripped)
```
And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页