Efficient substring search in a large text file containing 100 millions strings(no duplicate string)

后端 未结 4 1745
有刺的猬
有刺的猬 2021-02-10 07:50

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in

相关标签:
4条回答
  • 2021-02-10 08:03

    Try to use hash tables. One more thing that can be done is any method similar to MAP-REDUCE. What i want to say is that you can try to use inverted index. Google uses the same technique. All you can create a file of stopwords where you can put words that can be ignored e.g. I, am, the, a, an, in, on etc.

    this is the only thing which i suppose is possible. I read somewhere that for searching, u can arrays.

    0 讨论(0)
  • 2021-02-10 08:06

    You could build a directory structure based on the first few letters of each word. For example:

    /A
    /A/AA
    /A/AB
    /A/AC
    ...
    /Z/ZU
    

    Under that structure, you can keep a file containing all the strings with the first characters matching the folder name. The first characters in your search term will narrow the selection down to a folder with a small fraction of your overall list. From there, you do can do a full search of just that file. If it's too slow, increase the depth of your directory tree to cover more letters.

    0 讨论(0)
  • 2021-02-10 08:07

    Is there expected to be a lot of overlap in your keywords? If so, you might be able to store a hash map from keyword (String) to file locations (ArrayList). You can not store all the lines in memory though with the object overhead.

    Once you have the file location, you can seek to that location in the text file and then look nearby to get the enclosing newline characters, returning the line. That will definitely be less than 4 seconds. Here is a little info on that. If this is just for a little exercise, that would work fine.

    A better solution though would be a two tiered index, one mapping keywords to line numbers, and then another mapping line numbers to line text. This will not fit in memory on your machine. There are great disk based key-value stores though that would work well. If this is anything beyond a toy problem, go with the Reddis route.

    0 讨论(0)
  • 2021-02-10 08:12

    Since you have more RAM than the size of the file, you might be able to store the entire data as a structure in the RAM and search it very quickly. A trie might be a good data structure to use; it does have fast prefix finding, but not sure how it performs for substrings.

    0 讨论(0)
提交回复
热议问题