问题
Memory mapped file is an efficient way for using regex or doing manipulation on large binary files.
In case I have a large text file (~1GB), is it possible to work with an encoding-aware mapped file?
Regex like [\u1234-\u5678]
won't work on bytes
objects and converting the pattern to unicode will not work either (as "[\u1234-\u5678]".encode("utf-32")
for example will not understand the range correctly).
Searching might work if I convert the search pattern from str
to bytes
using .encode()
but it's still somewhat limited and there should be a simpler way instead of decoding and encoding all day.
I have tried wrapping it with io.TextIOWrapper
inside an io.BufferedRandom
but to no avail:
AttributeError: 'mmap.mmap' object has no attribute 'seekable'
Creating a wrapper (using inheritance) and setting the methods seekable
, readable
and writable
to return True
did not work either.
Regarding encoding, a fixed length encoding like utf-32
, code-points or the lower BMP of utf-16
(if it's even possible referring just to that part) might be assumed.
Solution is welcome for any python version.
回答1:
You can't do this without essentially reinventing the wheel from scratch (writing all new versions of the re
module, the mmap
module, etc.), or writing extraordinarily complex regexes that can't use the niceties of stuff like true Unicode character ranges (you'd have an alternation between three different patterns to make [\u1234-\u5678]
, something like (?:\x12[\x34-\xff]|[\x13-\x55].|\x56[\x00-\x78])
).
Basically, re
patterns only work with str
, or work with bytes
-like objects (and you can't try to work around it with memoryview
s and casts, because re
still treats it as bytes, not larger types).
For simple searches, you could try using mmap.find after encoding the string to use for searching, but that's still prone to subtle bugs; for UCS-2 or UTF-32, you'd need to check that the return value from find
was aligned on a two or four byte boundary respectively to ensure you didn't mistake the end of one character and the beginning of the next for a completely different character. If the alignment test failed, you'd have to repeat the search with a start
offset of the last return value + 1 until you either got a hit or find
returned -1
. It's just not a reasonable thing to do in the general case.
来源:https://stackoverflow.com/questions/36229717/opening-memory-mapped-file-with-encoding