I\'m trying to figure out if there\'s a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the
What happens if you have a dictionary such as
regex_dict = { re.compile("foo.*"): 5, re.compile("f.*"): 6 }
In this case regex_dict["food"]
could legitimately return either 5 or 6.
Even ignoring that problem, there's probably no way to do this efficiently with the regex module. Instead, what you'd need is an internal directed graph or tree structure.
The problem has nothing to do with regular expressions - you'd have the same problem with a dictionary with keys as functions of lambdas. So the problem you face is figuring is there a way of classifying your functions to figure which will return true or not and that isn't a search problem because f(x) is not known in general before hand.
Distributed programming or caching answer sets assuming there are common values of x may help.
-- DM
A special case of this problem came up in the 70s AI languages oriented around deductive databases. The keys in these databases could be patterns with variables -- like regular expressions without the * or | operators. They tended to use fancy extensions of trie structures for indexes. See krep*.lisp in Norvig's Paradigms of AI Programming for the general idea.
If you have a small set of possible inputs, you can cache the matches as they appear in a second dict and get O(1) for the cached values.
If the set of possible inputs is too big to cache but not infinite, either, you can just keep the last N matches in the cache (check Google for "LRU maps" - least recently used).
If you can't do this, you can try to chop down the number of regexps you have to try by checking a prefix or somesuch.
I created this exact data structure for a project once. I implemented it naively, as you suggested. I did make two immensely helpful optimizations, which may or may not be feasible for you, depending on the size of your data:
To avoid the problem of multiple keys matching the input, I gave each regex key a priority and the highest priority was used.
The fundamental assumption is flawed, I think. you can't map hashes to regular expressions.