Remove “characters with encodings larger than 3 bytes” using Python 3

问题

I want to remove characters with encodings larger than 3 bytes. Because when I upload my CSV data to Amazon Mechanical Turk system, it asks me to do it.

Your CSV file needs to be UTF-8 encoded and cannot contain characters with encodings larger than 3 bytes. For example, some non-English characters are not allowed (learn more).

To overcome this problem, I want to make a filter_max3bytes funciton to remove those characters in Python3.

x = 'below ð\x9f~\x83,'
y = remove_max3byes(x)  # y=="below ~,"

Then I will apply the function before saving it to a CSV file, which is UTF-8 encoded.

This post is related my problem, but they uses python 2 and the solution did not worked for me.

Thank you!

回答1:

None of the characters in your string seems to take 3 bytes in UTF-8:

x = 'below ð\x9f~\x83,'

Anyway, the way to remove them, if there were any would be:

filtered_x = ''.join(char for char in x if len(char.encode('utf-8')) < 3)

For example (with such characters):

>>> x = 'abcd漢字efg'
>>> ''.join(char for char in x if len(char.encode('utf-8')) < 3)
'abcdefg'

BTW, you can verify that your original string does not have 3-byte encodings by doing the following:

>>> for char in 'below ð\x9f~\x83,':
...     print(char, [hex(b) for b in char.encode('utf-8')])
...
b ['0x62']
e ['0x65']
l ['0x6c']
o ['0x6f']
w ['0x77']
  ['0x20']
ð ['0xc3', '0xb0']
  ['0xc2', '0x9f']
~ ['0x7e']
  ['0xc2', '0x83']
, ['0x2c']

EDIT: A wild guess

I believe the OP asks the wrong question and the question is in fact whether the character is printable. I'll assume anything Python displays as \x<number> is not printable, so this solution should work:

x = 'below ð\x9f~\x83,'
filtered_x = ''.join(char for char in x if not repr(char).startswith("'\\x"))

Result:

'below ð~,'

回答2:

While indirectly stated, the website only allows characters from the Basic Multilingual Plane (BMP). That includes Unicode code points U+0000 to U+FFFF. In UTF-8, it takes four bytes to encode anything above U+FFFF:

>>> '\uffff'.encode('utf8')
b'\xef\xbf\xbf'
>>> '\U00010000'.encode('utf8')
b'\xf0\x90\x80\x80'

This filters out Unicode code points above U+FFFF:

>>> test_string = 'abc马克😀' # emoticon is U+1F600
>>> ''.join(c for c in test_string if ord(c) < 0x10000)
'abc马克'

When encoded (note three bytes for each Chinese character):

>>> ''.join(c for c in test_string if ord(c) < 0x10000).encode('utf8')
b'abc\xe9\xa9\xac\xe5\x85\x8b'

回答3:

According to the UTF-8 standard, characters with Unicode code points below U+0800 will use at most two bytes in the encoding. So just remove any character at or above U+0800. This code copies all characters that take at most two bytes and just leave out the other characters.

def remove_max3byes(x):
    return ''.join(c for c in x if ord(c) < 0x800)

As a comment pointed out, your example string has no characters that take more than two bytes. But this command at the REPL

remove_max3byes(chr(0x07ff))

gives

'\u07ff'

and this command

remove_max3byes(chr(0x0800))

gives

''

Both are as wanted.

来源：https://stackoverflow.com/questions/50730466/remove-characters-with-encodings-larger-than-3-bytes-using-python-3

标签

python

python-3.x

unicode

byte

encode