I\'m trying to figure out if there\'s a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the
It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*
' or '\d+
', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c
' in ^\d+a\*b\*c:\s+\w+
) , as in your examples. You can do this common trick that scales well to millions of regexes:
Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.