Opening memory-mapped file with encoding

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-05 12:55:58

You can't do this without essentially reinventing the wheel from scratch (writing all new versions of the re module, the mmap module, etc.), or writing extraordinarily complex regexes that can't use the niceties of stuff like true Unicode character ranges (you'd have an alternation between three different patterns to make [\u1234-\u5678], something like (?:\x12[\x34-\xff]|[\x13-\x55].|\x56[\x00-\x78])).

Basically, re patterns only work with str, or work with bytes-like objects (and you can't try to work around it with memoryviews and casts, because re still treats it as bytes, not larger types).

For simple searches, you could try using mmap.find after encoding the string to use for searching, but that's still prone to subtle bugs; for UCS-2 or UTF-32, you'd need to check that the return value from find was aligned on a two or four byte boundary respectively to ensure you didn't mistake the end of one character and the beginning of the next for a completely different character. If the alignment test failed, you'd have to repeat the search with a start offset of the last return value + 1 until you either got a hit or find returned -1. It's just not a reasonable thing to do in the general case.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!