Efficient algorithm for finding all keywords in a text

后端未结

关注

 5  1166

I have lots of strings containing text in lots of different spellings. I am tokenizing these strings by searching for keywords and if a keyword is found I use an assoicated text

相关标签:

5条回答

旧巷少年郎

2021-02-08 21:44
I would use precompiled regular expressions for each group of keywords to match. In the background these are "compiled" to finite automata, so they are pretty fast in recognizing the pattern in your string and much faster than a Contains for each of the possible strings.

using: System.Text.RegularExpressions.

In your example:
- "schw.", "schwa." and "schwarz"
- new Regex(@"schw(a?\.|arz)", RegexOptions.Compiled)
Further documentation available here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions(v=VS.90).aspx
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2021-02-08 21:54

I suggest to approaches:

1) Tokenise using string.Split and match against a Dictionary of keys you have

2) Implement tokeniser yourself a reader with ReadToken() method which it adds the characters to a buffer until it finds (Split could be doing that) a split character and outputs that as token. Then you check against your dictionary.

0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2021-02-08 21:55

This seems to fit "Algorithms using finite set of patterns"

The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).

0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2021-02-08 21:57

Maybe it's a little overpowered but you should definitely take a look at ANTLR.

0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2021-02-08 22:06

If you have a fixed set of keywords you can use (f)lex, re2c or ragel

0 讨论(0)
发布评论:

提交评论
- 加载中...