Access File through multiple threads

后端未结

关注

 10  803

天涯浪人 2021-01-31 10:52

I want to access a large file (file size may vary from 30 MB to 1 GB) through 10 threads and then process each line in the file and write them to another file through 10 threads

10条回答

无人共我 (楼主)

2021-01-31 11:22
Be aware that the ideal number of threads is limited by the hardware architecture and other stuffs (you could think about consulting the thread pool to calculate the best number of threads). Assuming that "10" is a good number, we proceed. =)

If you are looking for performance, you could do the following:
- Read the file using the threads you have and process each one according to your business rule. Keep one control variable that indicates the next expected line to be inserted on the output file.
- If the next expected line is done processing, append it to a buffer (a Queue) (it would be ideal if you could find a way to insert direct in the output file, but you would have lock problems). Otherwise, store this "future" line inside a binary-search-tree, ordering the tree by line position. Binary-search-tree gives you a time complexity of "O(log n)" for searching and inserting, which is really fast for your context. Continue to fill the tree until the next "expected" line is done processing.
Activates the thread that will be responsible to open the output file, consume the buffer periodically and write the lines into the file.

Also, keep track of the "minor" expected node of the BST to be inserted in the file. You can use it to check if the future line is inside the BST before starting searching on it.
- When the next expected line is done processing, insert into the Queue and verify if the next element is inside the binary-search-tree. In the case that the next line is in the tree, remove the node from the tree and append the content of the node to the Queue and repeat the search if the next line is already inside the tree.
- Repeat this procedure until all files are done processing, the tree is empty and the Queue is empty.
This approach uses - O(n) to read the file (but is parallelized) - O(1) to insert the ordered lines into a Queue - O(Logn)*2 to read and write the binary-search-tree - O(n) to write the new file

plus the costs of your business rule and I/O operations.

Hope it helps.
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...