Efficient algorithm for finding all keywords in a text

后端 未结 5 1146
傲寒
傲寒 2021-02-08 21:08

I have lots of strings containing text in lots of different spellings. I am tokenizing these strings by searching for keywords and if a keyword is found I use an assoicated text

5条回答
  •  攒了一身酷
    2021-02-08 21:55

    This seems to fit "Algorithms using finite set of patterns"

    The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

    The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).

提交回复
热议问题