问题
I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.
import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')
for line in infile.readlines():
for word in line.split():
outfile.write(word+" ")
outfile.write("\n")
infile.close()
outfile.close()
The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??
回答1:
codecs.open()
doesn't support universal newlines e.g., it doesn't translate \r\n
to \n
while reading on Windows.
Use io.open()
instead:
#!/usr/bin/env python
from __future__ import print_function
import io
with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
for line in infile:
print(*line.split(), file=outfile)
btw, if you want to remove non-ascii characters, you should use ascii
instead of utf-8
.
If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate()
to remove non-ascii characters:
#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
outfile.write(line.translate(None, nonascii))
It doesn't normalize whitespace like the first code example.
回答2:
From the docs for codecs.open:
Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.
I presume you're using Windows, where the newline sequence is actually '\r\n'
. A file opened in text mode will do the conversion from \n
to \r\n
automatically, but that doesn't happen with codecs.open
.
Simply write "\r\n"
instead of "\n"
and it should work fine, at least on Windows.
回答3:
use codecs to open the csv file and then you can avoid the non-ascii characters
import codecs
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
for reading in reader:
print (reader)
来源:https://stackoverflow.com/questions/26369051/python-read-from-file-and-remove-non-ascii-characters