Hashtable/dictionary/map lookup with regular expressions

前端 未结 19 1297
难免孤独
难免孤独 2021-02-01 05:36

I\'m trying to figure out if there\'s a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the

19条回答
  •  清歌不尽
    2021-02-01 06:14

    It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*' or '\d+', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c' in ^\d+a\*b\*c:\s+\w+) , as in your examples. You can do this common trick that scales well to millions of regexes:

    Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.

提交回复
热议问题