Search Large Text File for Thousands of strings

前端未结

关注

 3  485

I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.

I hav

相关标签:

3条回答

醉话见心

2021-01-15 04:15

The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.

Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2021-01-15 04:17

Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-01-15 04:20
Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:
```
hand, has, have, foot, file
```
The resulting tree would look something like this:

The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.

Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.
- If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
- If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree
This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.

Edit

The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm
```
tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
  foreach node in nodes:
    if node.has_child(character):
      node.follow_edge(character)
      if node.isLeaf():
        # You found a match!!
    else:
      nodes.delete(node)
  if tree.has_child(character):
    nodes.add(tree.get_child(character))
```
Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Search Large Text File for Thousands of strings

Edit