Hashtable/dictionary/map lookup with regular expressions

前端 未结 19 1282
难免孤独
难免孤独 2021-02-01 05:36

I\'m trying to figure out if there\'s a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the

相关标签:
19条回答
  • 2021-02-01 06:27

    Here's an efficient way to do it by combining the keys into a single compiled regexp, and so not requiring any looping over key patterns. It abuses the lastindex to find out which key matched. (It's a shame regexp libraries don't let you tag the terminal state of the DFA that a regexp is compiled to, or this would be less of a hack.)

    The expression is compiled once, and will produce a fast matcher that doesn't have to search sequentially. Common prefixes are compiled together in the DFA, so each character in the key is matched once, not many times, unlike some of the other suggested solutions. You're effectively compiling a mini lexer for your keyspace.

    This map isn't extensible (can't define new keys) without recompiling the regexp, but it can be handy for some situations.

    # Regular expression map
    # Abuses match.lastindex to figure out which key was matched
    # (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
    # Mostly for amusement.
    # Richard Brooksby, Ravenbrook Limited, 2013-06-01
    
    import re
    
    class ReMap(object):
    
        def __init__(self, items):
            if not items:
                items = [(r'epsilon^', None)] # Match nothing
            key_patterns = []
            self.lookup = {}
            index = 1
            for key, value in items:
                # Ensure there are no capturing parens in the key, because
                # that would mess up match.lastindex
                key_patterns.append('(' + re.sub(r'\((?!\?:)', '(?:', key) + ')')
                self.lookup[index] = value
                index += 1
            self.keys_re = re.compile('|'.join(key_patterns))
    
        def __getitem__(self, key):
            m = self.keys_re.match(key)
            if m:
                return self.lookup[m.lastindex]
            raise KeyError(key)
    
    if __name__ == '__main__':
        remap = ReMap([(r'foo.', 12), (r'FileN.*', 35)])
        print remap['food']
        print remap['foot in my mouth']
        print remap['FileNotFoundException: file.x does not exist']
    
    0 讨论(0)
  • 2021-02-01 06:28

    In the general case, what you need is a lexer generator. It takes a bunch of regular expressions and compiles them into a recognizer. "lex" will work if you are using C. I have never used a lexer generator in Python, but there seem to be a few to choose from. Google shows PLY, PyGgy and PyLexer.

    If the regular expressions all resemble each other in some way, then you may be able to take some shortcuts. We would need to know more about the ultimate problem that you are trying to solve in order to come up with any suggestions. Can you share some sample regular expressions and some sample data?

    Also, how many regular expressions are you dealing with here? Are you sure that the naive approach won't work? As Rob Pike once said, "Fancy algorithms are slow when n is small, and n is usually small." Unless you have thousands of regular expressions, and thousands of things to match against them, and this is an interactive application where a user is waiting for you, you may be best off just doing it the easy way and looping through the regular expressions.

    0 讨论(0)
  • 2021-02-01 06:28

    As other respondents have pointed out, it's not possible to do this with a hash table in constant time.

    One approximation that might help is to use a technique called "n-grams". Create an inverted index from n-character chunks of a word to the entire word. When given a pattern, split it into n-character chunks, and use the index to compute a scored list of matching words.

    Even if you can't accept an approximation, in most cases this would still provide an accurate filtering mechanism so that you don't have to apply the regex to every key.

    0 讨论(0)
  • 2021-02-01 06:29

    This is definitely possible, as long as you're using 'real' regular expressions. A textbook regular expression is something that can be recognized by a deterministic finite state machine, which primarily means you can't have back-references in there.

    There's a property of regular languages that "the union of two regular languages is regular", meaning that you can recognize an arbitrary number of regular expressions at once with a single state machine. The state machine runs in O(1) time with respect to the number of expressions (it runs in O(n) time with respect to the length of the input string, but hash tables do too).

    Once the state machine completes you'll know which expressions matched, and from there it's easy to look up values in O(1) time.

    0 讨论(0)
  • 2021-02-01 06:34

    What about the following:

    class redict(dict):
    def __init__(self, d):
        dict.__init__(self, d)
    
    def __getitem__(self, regex):
        r = re.compile(regex)
        mkeys = filter(r.match, self.keys())
        for i in mkeys:
            yield dict.__getitem__(self, i)
    

    It's basically a subclass of the dict type in Python. With this you can supply a regular expression as a key, and the values of all keys that match this regex are returned in an iterable fashion using yield.

    With this you can do the following:

    >>> keys = ["a", "b", "c", "ab", "ce", "de"]
    >>> vals = range(0,len(keys))
    >>> red = redict(zip(keys, vals))
    >>> for i in red[r"^.e$"]:
    ...     print i
    ... 
    5
    4
    >>>
    
    0 讨论(0)
  • 2021-02-01 06:37

    It may be possible to get the regex compiler to do most of the work for you by concatenating the search expressions into one big regexp, separated by "|". A clever regex compiler might search for commonalities in the alternatives in such a case, and devise a more efficient search strategy than simply checking each one in turn. But I have no idea whether there are compilers which will do that.

    0 讨论(0)
提交回复
热议问题