I\'m currently writing something that needs to handle very large text files (a few GiB at least). What\'s needed here (and this is fixed) is:
It's very difficult to maintain a 1:1 mapping between a sequence of Java chars (which are effectively UTF-16) and bytes which could be anything depending on your file encoding. Even with UTF-8, the "obvious" mapping of 1 byte to 1 char only works for ASCII. Neither UTF-16 nor UTF-8 guarantees that a unicode character can be stored in a single machine char
or byte
.
I would maintain my window into the file as a byte buffer, not a char buffer. Then to find line endings in the byte buffer, I'd encode the Java string "\r\n"
(or possibly just "\n"
) as a byte sequence using the same encoding as the file is in. I'd then use that byte sequence to search for line endings in the byte buffer. The position of a line ending in the buffer + the offset of the buffer from the start of the file maps exactly to the byte position in the file of the line ending.
Appending lines is just a case of seeking to the end of the file and adding your new lines. Changing lines is more tricky. I think I would maintain a list or map of byte positions of changed lines and what the change is. When ready to write the changes: