How to find/replace non printable / non-ascii characters using Python 3?

问题

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.

I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.

When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.

I don't know what the ^I represents, possibly a tab.

I have tried this function to no avail:

def remove_non_ascii(text):
    new_text = re.sub(r"[\n\t\r]", "", text)
    new_text = ''.join(new_text.split("\n"))
    new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
    new_text = "".join([x for x in new_text if ord(x) < 128])
    new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
    new_text = new_text.rstrip('\r\n')
    new_text = new_text.strip('\n')
    new_text = new_text.strip('\r')
    new_text = new_text.strip('\t')
    new_text = new_text.replace('\n', '')
    new_text = new_text.replace('\r', '')
    new_text = new_text.replace('\t', '')
    new_text = filter(lambda x: x in string.printable, new_text)
    new_text = "".join(list(new_text))

    return new_text

Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?

I am opening the file like so (the .csv was saved as UTF-8)

f_csv_in = open(csv_in, "r", encoding="utf-8")

Below are two lines that should be one with the problem non-ascii characters visible.

These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.

Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.

37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 @ 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$

回答1:

In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below

>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'

This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()

Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.

回答2:

A simple way to remove non-ascii chars could be doing:

new_text = "".join([c for c in text if c.isascii()])

NB: If you are reading this text from a file, make sure you read it with the correct encoding

回答3:

It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.

If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.

The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.

It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.

This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.

with open('lines.csv') as f, open('fixed.csv', 'w') as out:
    reader = csv.reader(f)
    writer = csv.writer(out, quoting=csv.QUOTE_NONE)
    for line in reader:
        new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
        writer.writerow(new_row)

来源：https://stackoverflow.com/questions/54590466/how-to-find-replace-non-printable-non-ascii-characters-using-python-3

标签

python

python-3.x

csv

non-ascii-characters