Read a unicode file in python which declares its encoding in the same way as python source

前端 未结 3 1168
情书的邮戳
情书的邮戳 2021-02-14 19:07

I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin

相关标签:
3条回答
  • 2021-02-14 19:21

    I examined the sources of tokenizer.c (thanks to @Ninefingers for suggesting this in another answer and giving a link to the source browser). It seems that the exact algorithm used by Python is (equivalent to) the following. In various places I'll describe the algorithm as reading byte by byte---obviously one wants to do something buffered in practice, but it's easier to describe this way. The initial part of the file is processed as follows:

    1. Upon opening a file, attempt to recognize the UTF-8 BOM at the beginning of the file. If you see it, eat it and make a note of the fact you saw it. Do not recognize the UTF-16 byte order mark.
    2. Read 'a line' of text from the file. 'A line' is defined as follows: you keep reading bytes until you see one of the strings '\n', '\r' or '\r\n' (trying to match as long a string as possible---this means that if you see '\r' you have to speculatively read the next character, and if it's not a '\n', put it back). The terminator is included in the line, as is usual Python practice.
    3. Decode this string using the UTF-8 codec. Unless you have seen the UTF-8 BOM, generate an error message if you see any non-ASCII characters (i.e. any characters above 127). (Python 3.0 does not, of course, generate an error here.) Pass this decoded line on to the user for processing.
    4. Attempt to interpret this line as a comment containing a coding declaration, using the regexp in PEP 0263. If you find a coding declaration, skip to the instructions below for 'I found a coding declaration'.
    5. OK, so you didn't find a coding declaration. Read another line from the input, using the same rules as in step 2 above.
    6. Decode it, using the same rules as step 3, and pass it on to the user for processing.
    7. Attempt again to interpred this line as a coding declaration comment, as in step 4. If you find one, skip to the instructions below for 'I found a coding declaration'.
    8. OK. We've now checked the first two lines. According to PEP 0263, if there was going to be a coding declaration, it would have been on the first two lines, so we now know we're not going to see one. We now read the rest of the file using the same reading instructions as we used to read the first two lines: we read the lines using the rules in step 2, decode using the rules in step 3 (making an error if we see non-ASCII bytes unless we saw a BOM).

    Now the rules for what to do when 'I found a coding declaration':

    1. If we previously saw a UTF-8 BOM, check that the coding declaration says 'utf-8' in some form. Throw an error otherwise. (''utf-8' in some form' means anything which, after converting to lower case and converting underscores to hyphens, is either the literal string 'utf-8', or something beginning with 'utf-8-'.)
    2. Read the rest of the file using the decoder associated to the given encoding in the Python codecs module. In particular, the division of the rest of the bytes in the file into lines is the job of the new encoding.
    3. One final wrinkle: universal newline type stuff. The rules here are as follows. If the encoding is anything except 'utf-8' in some form or 'latin-1' in some form, do no universal-newline stuff at all; just pass out lines exactly as they come from the decoder in the codecs module. On the other hand, if the encoding is 'utf-8' in some form or 'latin-1' in some form, transform lines ending '\r' or '\r\n' into lines ending '\n'. (''utf-8' in some form' means the same as before. ''latin-1' in some form' means means anything which, after converting to lower case and converting underscores to hyphens, is one of the literal strings 'latin-1', 'iso-latin-1' or 'iso-8859-1', or any string beginning with one of 'latin-1-', 'iso-latin-1-' or 'iso-8859-1-'.

    For what I'm doing, fidelity to Python's behaviour is important. My plan is to roll an implementation of the algorithm above in Python, and use this. Thanks for everyone who answered!

    0 讨论(0)
  • 2021-02-14 19:31

    You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.

    If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o.. or the reverse, depending on the byte order mark. First, generate a few test files which advertise their encoding:

    import codecs, sys
    for encoding in ('utf-8', 'cp1252'):
        out = codecs.open('%s.txt' % encoding, 'w', encoding)
        out.write('# coding = %s\n' % encoding)
        out.write(u'\u201chello se\u00f1nor\u201d')
        out.close()
    

    Then write the decoder:

    import codecs, re
    
    def open_detect(path):
        fin = open(path, 'rb')
        prefix = fin.read(80)
        encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
        encoding = encs[0] if encs else 'utf-8'
        fin.seek(0)
        return codecs.EncodedFile(fin, 'utf-8', encoding)
    
    for path in ('utf-8.txt','cp1252.txt'):
        fin = open_detect(path)
        print repr(fin.readlines())
    

    Output:

    ['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
    ['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
    
    0 讨论(0)
  • 2021-02-14 19:40

    From said PEP (0268):

    Python's tokenizer/compiler combo will need to be updated to work as follows:

    1. read the file

    2. decode it into Unicode assuming a fixed per-file encoding

    3. convert it into a UTF-8 byte string

    4. tokenize the UTF-8 content

    5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding

    Indeed, if you check Parser/tokenizer.c in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.

    It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't Py prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.

    0 讨论(0)
提交回复
热议问题