Re-encode Unicode stream as Ascii ignoring errors

后端未结

关注

 2  697

一生所求 2021-01-24 15:12

I\'m trying to take a Unicode file stream, which contains odd characters, and wrap it with a stream reader that will convert it to Ascii, ignoring or replacing all characters th

2条回答

旧时难觅i (楼主)

2021-01-24 16:10
You're mixing up the encode and decode sides.

For decoding, you're doing fine. You open it as binary data, chardet the first 1K, then reopen in text mode using the detected encoding.

But then you're trying to further decode that already-decoded data as ASCII, by using codecs.getreader. That function returns a StreamReader, which decodes data from a stream. That isn't going to work. You need to encode that data to ASCII.

But it's not clear why you're using a codecs stream decoder or encoder in the first place, when all you want to do is encode a single chunk of text in one go so you can log it. Why not just call the encode method?
```
log(csv_file.read().encode('ascii', 'ignore'))
```
If you want something that you can use as a lazy iterable of lines, you could build something fully general, but it's a lot simpler to just do something like the UTF8Recorder example in the csv docs:
```
class AsciiRecoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)    
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("ascii", "ignore")
```
Or, even more simply:
```
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
    csv_ascii_stream = (line.encode('ascii', 'ignore') for line in csv_file)
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...