How to deal with a very large text file?

前端 未结 7 2050
鱼传尺愫
鱼传尺愫 2021-02-07 14:40

I\'m currently writing something that needs to handle very large text files (a few GiB at least). What\'s needed here (and this is fixed) is:

  • CSV-based, following
7条回答
  •  太阳男子
    2021-02-07 15:46

    It's very difficult to maintain a 1:1 mapping between a sequence of Java chars (which are effectively UTF-16) and bytes which could be anything depending on your file encoding. Even with UTF-8, the "obvious" mapping of 1 byte to 1 char only works for ASCII. Neither UTF-16 nor UTF-8 guarantees that a unicode character can be stored in a single machine char or byte.

    I would maintain my window into the file as a byte buffer, not a char buffer. Then to find line endings in the byte buffer, I'd encode the Java string "\r\n" (or possibly just "\n") as a byte sequence using the same encoding as the file is in. I'd then use that byte sequence to search for line endings in the byte buffer. The position of a line ending in the buffer + the offset of the buffer from the start of the file maps exactly to the byte position in the file of the line ending.

    Appending lines is just a case of seeking to the end of the file and adding your new lines. Changing lines is more tricky. I think I would maintain a list or map of byte positions of changed lines and what the change is. When ready to write the changes:

    1. sort the list of changes by byte position
    2. read the original file up to the next change and write it to a temporary file.
    3. write the changed line to the temporary file.
    4. skip the changed line in the original file.
    5. go back to step 2 unless you have reached the end of the original file
    6. move the temp file over the original file.

提交回复
热议问题