Python read from file and remove non-ascii characters

时间秒杀一切 提交于 2020-01-01 17:11:10

问题


I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.

import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')


for line in infile.readlines():
    for word in line.split():
        outfile.write(word+" ")
    outfile.write("\n")

infile.close()
outfile.close()

The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??


回答1:


codecs.open() doesn't support universal newlines e.g., it doesn't translate \r\n to \n while reading on Windows.

Use io.open() instead:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

btw, if you want to remove non-ascii characters, you should use ascii instead of utf-8.

If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate() to remove non-ascii characters:

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

It doesn't normalize whitespace like the first code example.




回答2:


From the docs for codecs.open:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

I presume you're using Windows, where the newline sequence is actually '\r\n'. A file opened in text mode will do the conversion from \n to \r\n automatically, but that doesn't happen with codecs.open.

Simply write "\r\n" instead of "\n" and it should work fine, at least on Windows.




回答3:


use codecs to open the csv file and then you can avoid the non-ascii characters

 import codecs   
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
    for reading in reader:
        print (reader)


来源:https://stackoverflow.com/questions/26369051/python-read-from-file-and-remove-non-ascii-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!