codecs.open(utf-8) fails to read plain ASCII file

前端 未结 1 1860
一向
一向 2021-01-12 21:29

I have a plain ASCII file. When I try to open it with codecs.open(..., \"utf-8\"), I am unable to read single characters. ASCII is a subset of UTF-8, so why can

相关标签:
1条回答
  • 2021-01-12 22:03

    Found your problem:

    When passed an encoding, codecs.open returns a StreamReaderWriter, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader and StreamWriter. Problem is:

    1. StreamReaderWriter provides a "normal" read method (that is, it takes a size parameter and that's it)
    2. It delegates to the internal StreamReader.read method, where the size argument is only a hint as to the number of bytes to read, but not a limit; the second argument, chars, is a strict limiter, but StreamReaderWriter never passes that argument along (it doesn't accept it)
    3. When size hinted, but not capped using chars, if StreamReader has buffered data, and it's large enough to match the size hint StreamReader.read blindly returns the contents of the buffer, rather than limiting it in any way based on the size hint (after all, only chars imposes a maximum return size)

    The API of StreamReader.read and the meaning of size/chars for the API is the only documented thing here; the fact that codecs.open returns StreamReaderWriter is not contractual, nor is the fact that StreamReaderWriter wraps StreamReader, I just used ipython's ?? magic to read the source code of the codecs module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter, it's all Python level, so it's easy).

    The best solution is to switch to io.open, which is faster and more correct in every standard case (codecs.open supports the weirdo codecs that don't convert between bytes [Py2 str] and str [Py2 unicode], but rather, handle str to str or bytes to bytes encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes and str). All you need to do is import io instead of codecs, and change the codecs.open line to:

    f = io.open("test.py", encoding="utf-8")
    

    The rest of your code can remain unchanged (and will likely run faster to boot).

    As an alternative, you could explicitly bypass StreamReaderWriter to get the StreamReader's read method and pass the limiting argument directly, e.g. change:

    c = f.read(1)
    

    to:

    # Pass second, character limiting argument after size hint
    c = f.reader.read(6, 1)  # 6 is sort of arbitrary; should ensure a full char read in one go
    

    I suspect Python Bug #8260, which covers intermingling readline and read on codecs.open created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read and readline will be able to break it.

    Again, just use io.open; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.

    0 讨论(0)
提交回复
热议问题