How to deal with a very large text file?

前端未结

关注

 7  1034

I\'m currently writing something that needs to handle very large text files (a few GiB at least). What\'s needed here (and this is fixed) is:

CSV-based, following

相关标签:

7条回答

遥遥无期

2021-02-07 15:22

In case of fixed column count I'd split the file logically and/or physically into columns and implemented some wrappers/adapters for IO tasks and managing the file as a whole.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2021-02-07 15:28

If you had fixed width lines then using a RandomAccessFile might solve a lot of your problems. I realise that your lines are probably not fixed width, but you could artificially impose this by adding an end of line indicator and then padding lines (eg with spaces).

This obviously works best if your file currently has a fairly uniform distribution of line lengths and doesn't have some lines that are very, very long. The downside is that this will artificially increase the size of your file.

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2021-02-07 15:34
It's very difficult to maintain a 1:1 mapping between a sequence of Java chars (which are effectively UTF-16) and bytes which could be anything depending on your file encoding. Even with UTF-8, the "obvious" mapping of 1 byte to 1 char only works for ASCII. Neither UTF-16 nor UTF-8 guarantees that a unicode character can be stored in a single machine char or byte.

I would maintain my window into the file as a byte buffer, not a char buffer. Then to find line endings in the byte buffer, I'd encode the Java string "\r\n" (or possibly just "\n") as a byte sequence using the same encoding as the file is in. I'd then use that byte sequence to search for line endings in the byte buffer. The position of a line ending in the buffer + the offset of the buffer from the start of the file maps exactly to the byte position in the file of the line ending.

Appending lines is just a case of seeking to the end of the file and adding your new lines. Changing lines is more tricky. I think I would maintain a list or map of byte positions of changed lines and what the change is. When ready to write the changes:
1. sort the list of changes by byte position
2. read the original file up to the next change and write it to a temporary file.
3. write the changed line to the temporary file.
4. skip the changed line in the original file.
5. go back to step 2 unless you have reached the end of the original file
6. move the temp file over the original file.
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2021-02-07 15:35

How about a table of offsets at somewhat regular intervals in the file, so you can restart parsing somewhere near the spot you are looking for?

The idea would be that these would be byte offsets where the encoding would be in its initial state (i.e. if the data was ISO-2022 encoded, then this spot would be in the ASCII compatible mode). Any index into the data would then consist of a pointer into this table plus whatever is required to find the actual row. If you place the restart points such that each are between two points fits into the mmap window, then you can omit the check/remap/restart code from the parsing layer, and use a parser that assumes that data is sequentially mapped.

0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2021-02-07 15:38

CharBuffer assumes all characters are UTF-16 or UCS-2 (perhaps someone knows the difference)

The problem using a proper text format is that you need to read every byte to know where the n-th character is or where the n'th line is. I use multi-GB text files but assume ASCII-7 data, and I only read/write sequentially.

If you want random access on an unindexed text file, you can't expect it to be performant.

If you are willing to buy a new server you can get one with 24 GB for around £1,800 and 64GB for around £4,200. These would allow you to load even multi-GB files into memory.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-02-07 15:40
- Finding the start of line:
Stick with UTF-8 and \n denoting the end of the line should not be a problem. Alternatively you can allow UTF-16, and recognize the data: it has to be quoted (for instance), has N commans (semicolons) and another end of line. Can read the header to know how many columns the structure.
- Inserting into the middle of the file
can be achieved by reserving some space at the end/beginning of each line.
- appending lines at the end
That's trivial as long as the file is locked (as any other modifications)
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页