I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin
I examined the sources of tokenizer.c
(thanks to @Ninefingers for suggesting this in another answer and giving a link to the source browser). It seems that the exact algorithm used by Python is (equivalent to) the following. In various places I'll describe the algorithm as reading byte by byte---obviously one wants to do something buffered in practice, but it's easier to describe this way. The initial part of the file is processed as follows:
Now the rules for what to do when 'I found a coding declaration':
'utf-8'
, or something beginning with 'utf-8-'
.)codecs
module. In particular, the division of the rest of the bytes in the file into lines is the job of the new encoding.codecs
module. On the other hand, if the encoding is 'utf-8' in some form or 'latin-1' in some form, transform lines ending '\r' or '\r\n' into lines ending '\n'. (''utf-8' in some form' means the same as before. ''latin-1' in some form' means means anything which, after converting to lower case and converting underscores to hyphens, is one of the literal strings 'latin-1'
, 'iso-latin-1'
or 'iso-8859-1'
, or any string beginning with one of 'latin-1-'
, 'iso-latin-1-'
or 'iso-8859-1-'
.For what I'm doing, fidelity to Python's behaviour is important. My plan is to roll an implementation of the algorithm above in Python, and use this. Thanks for everyone who answered!
You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.
If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o..
or the reverse, depending on the byte order mark.
First, generate a few test files which advertise their encoding:
import codecs, sys
for encoding in ('utf-8', 'cp1252'):
out = codecs.open('%s.txt' % encoding, 'w', encoding)
out.write('# coding = %s\n' % encoding)
out.write(u'\u201chello se\u00f1nor\u201d')
out.close()
Then write the decoder:
import codecs, re
def open_detect(path):
fin = open(path, 'rb')
prefix = fin.read(80)
encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
encoding = encs[0] if encs else 'utf-8'
fin.seek(0)
return codecs.EncodedFile(fin, 'utf-8', encoding)
for path in ('utf-8.txt','cp1252.txt'):
fin = open_detect(path)
print repr(fin.readlines())
Output:
['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']