I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin
You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.
If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o..
or the reverse, depending on the byte order mark.
First, generate a few test files which advertise their encoding:
import codecs, sys
for encoding in ('utf-8', 'cp1252'):
out = codecs.open('%s.txt' % encoding, 'w', encoding)
out.write('# coding = %s\n' % encoding)
out.write(u'\u201chello se\u00f1nor\u201d')
out.close()
Then write the decoder:
import codecs, re
def open_detect(path):
fin = open(path, 'rb')
prefix = fin.read(80)
encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
encoding = encs[0] if encs else 'utf-8'
fin.seek(0)
return codecs.EncodedFile(fin, 'utf-8', encoding)
for path in ('utf-8.txt','cp1252.txt'):
fin = open_detect(path)
print repr(fin.readlines())
Output:
['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']