Efficient substring search in a large text file containing 100 millions strings(no duplicate string)

后端未结

关注

 4  1747

有刺的猬 2021-02-10 07:50

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in

4条回答

忘掉有多难 (楼主)

2021-02-10 08:07

Is there expected to be a lot of overlap in your keywords? If so, you might be able to store a hash map from keyword (String) to file locations (ArrayList). You can not store all the lines in memory though with the object overhead.

Once you have the file location, you can seek to that location in the text file and then look nearby to get the enclosing newline characters, returning the line. That will definitely be less than 4 seconds. Here is a little info on that. If this is just for a little exercise, that would work fine.

A better solution though would be a two tiered index, one mapping keywords to line numbers, and then another mapping line numbers to line text. This will not fit in memory on your machine. There are great disk based key-value stores though that would work well. If this is anything beyond a toy problem, go with the Reddis route.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...