Read a unicode file in python which declares its encoding in the same way as python source

前端未结

关注

 3  1166

情书的邮戳 2021-02-14 19:07

I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin

3条回答

北荒 (楼主)

2021-02-14 19:31

You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.

If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o.. or the reverse, depending on the byte order mark. First, generate a few test files which advertise their encoding:

import codecs, sys
for encoding in ('utf-8', 'cp1252'):
    out = codecs.open('%s.txt' % encoding, 'w', encoding)
    out.write('# coding = %s\n' % encoding)
    out.write(u'\u201chello se\u00f1nor\u201d')
    out.close()

Then write the decoder:

import codecs, re

def open_detect(path):
    fin = open(path, 'rb')
    prefix = fin.read(80)
    encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
    encoding = encs[0] if encs else 'utf-8'
    fin.seek(0)
    return codecs.EncodedFile(fin, 'utf-8', encoding)

for path in ('utf-8.txt','cp1252.txt'):
    fin = open_detect(path)
    print repr(fin.readlines())

Output:

['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']

0 讨论(0)

查看其它3个回答