Search Large Text File for Thousands of strings

前端 未结 3 485
无人共我
无人共我 2021-01-15 03:31

I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.

I hav

相关标签:
3条回答
  • 2021-01-15 04:15

    The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.

    Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.

    0 讨论(0)
  • 2021-01-15 04:17

    Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find.

    0 讨论(0)
  • 2021-01-15 04:20

    Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:

    hand, has, have, foot, file
    

    The resulting tree would look something like this: Tree generated by the list of search terms

    The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.

    Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.

    • If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
    • If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree

    This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.

    Edit

    The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm

    tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
    # Keeps track of where I'm currently in the tree
    nodes = []
    for character in huge_file:
      foreach node in nodes:
        if node.has_child(character):
          node.follow_edge(character)
          if node.isLeaf():
            # You found a match!!
        else:
          nodes.delete(node)
      if tree.has_child(character):
        nodes.add(tree.get_child(character))
    

    Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

    0 讨论(0)
提交回复
热议问题