Read a unicode file in python which declares its encoding in the same way as python source

前端 未结 3 1166
情书的邮戳
情书的邮戳 2021-02-14 19:07

I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin

3条回答
  •  北荒
    北荒 (楼主)
    2021-02-14 19:31

    You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.

    If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o.. or the reverse, depending on the byte order mark. First, generate a few test files which advertise their encoding:

    import codecs, sys
    for encoding in ('utf-8', 'cp1252'):
        out = codecs.open('%s.txt' % encoding, 'w', encoding)
        out.write('# coding = %s\n' % encoding)
        out.write(u'\u201chello se\u00f1nor\u201d')
        out.close()
    

    Then write the decoder:

    import codecs, re
    
    def open_detect(path):
        fin = open(path, 'rb')
        prefix = fin.read(80)
        encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
        encoding = encs[0] if encs else 'utf-8'
        fin.seek(0)
        return codecs.EncodedFile(fin, 'utf-8', encoding)
    
    for path in ('utf-8.txt','cp1252.txt'):
        fin = open_detect(path)
        print repr(fin.readlines())
    

    Output:

    ['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
    ['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
    

提交回复
热议问题