I have lots of strings containing text in lots of different spellings. I am tokenizing these strings by searching for keywords and if a keyword is found I use an assoicated text
I would use precompiled regular expressions for each group of keywords to match. In the background these are "compiled" to finite automata, so they are pretty fast in recognizing the pattern in your string and much faster than a Contains
for each of the possible strings.
using: System.Text.RegularExpressions
.
In your example:
new Regex(@"schw(a?\.|arz)", RegexOptions.Compiled)
Further documentation available here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions(v=VS.90).aspx
I suggest to approaches:
1) Tokenise using string.Split
and match against a Dictionary of keys you have
2) Implement tokeniser yourself a reader with ReadToken()
method which it adds the characters to a buffer until it finds (Split could be doing that) a split character and outputs that as token. Then you check against your dictionary.
This seems to fit "Algorithms using finite set of patterns"
The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).
The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).
Maybe it's a little overpowered but you should definitely take a look at ANTLR.
If you have a fixed set of keywords you can use (f)lex, re2c or ragel