I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin
I examined the sources of tokenizer.c
(thanks to @Ninefingers for suggesting this in another answer and giving a link to the source browser). It seems that the exact algorithm used by Python is (equivalent to) the following. In various places I'll describe the algorithm as reading byte by byte---obviously one wants to do something buffered in practice, but it's easier to describe this way. The initial part of the file is processed as follows:
Now the rules for what to do when 'I found a coding declaration':
'utf-8'
, or something beginning with 'utf-8-'
.)codecs
module. In particular, the division of the rest of the bytes in the file into lines is the job of the new encoding.codecs
module. On the other hand, if the encoding is 'utf-8' in some form or 'latin-1' in some form, transform lines ending '\r' or '\r\n' into lines ending '\n'. (''utf-8' in some form' means the same as before. ''latin-1' in some form' means means anything which, after converting to lower case and converting underscores to hyphens, is one of the literal strings 'latin-1'
, 'iso-latin-1'
or 'iso-8859-1'
, or any string beginning with one of 'latin-1-'
, 'iso-latin-1-'
or 'iso-8859-1-'
.For what I'm doing, fidelity to Python's behaviour is important. My plan is to roll an implementation of the algorithm above in Python, and use this. Thanks for everyone who answered!
You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.
If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o..
or the reverse, depending on the byte order mark.
First, generate a few test files which advertise their encoding:
import codecs, sys
for encoding in ('utf-8', 'cp1252'):
out = codecs.open('%s.txt' % encoding, 'w', encoding)
out.write('# coding = %s\n' % encoding)
out.write(u'\u201chello se\u00f1nor\u201d')
out.close()
Then write the decoder:
import codecs, re
def open_detect(path):
fin = open(path, 'rb')
prefix = fin.read(80)
encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
encoding = encs[0] if encs else 'utf-8'
fin.seek(0)
return codecs.EncodedFile(fin, 'utf-8', encoding)
for path in ('utf-8.txt','cp1252.txt'):
fin = open_detect(path)
print repr(fin.readlines())
Output:
['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
From said PEP (0268):
Python's tokenizer/compiler combo will need to be updated to work as follows:
read the file
decode it into Unicode assuming a fixed per-file encoding
convert it into a UTF-8 byte string
tokenize the UTF-8 content
compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
Indeed, if you check Parser/tokenizer.c
in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.
It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't Py
prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.