Hashtable/dictionary/map lookup with regular expressions

前端 未结 19 1312
难免孤独
难免孤独 2021-02-01 05:36

I\'m trying to figure out if there\'s a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the

相关标签:
19条回答
  • 2021-02-01 06:14

    I don't think it's even theoretically possible. What happens if someone passes in a string that matches more than 1 regular expression.

    For example, what would happen if someone did:

    >>> regex_dict['FileNfoo']
    

    How can something like that possibly be O(1)?

    0 讨论(0)
  • 2021-02-01 06:14

    It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*' or '\d+', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c' in ^\d+a\*b\*c:\s+\w+) , as in your examples. You can do this common trick that scales well to millions of regexes:

    Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.

    0 讨论(0)
  • 2021-02-01 06:15

    Ok, I have a very similar requirements, I have a lot of lines of different syntax, basically remark lines and lines with some codes for to use in a process of smart-card format, also, descriptor lines of keys and secret codes, in every case, I think that the "model" pattern/action is the beast approach for to recognize and to process a lot of lines.
    I'm using C++/CLI for to develop my assembly named LanguageProcessor.dll, the core of this library is a lex_rule class that basically contains :

    • a Regex member
    • an event member

    The constructor loads the regex string and call the necessary codes for to build the event on the fly using DynamicMethod, Emit and Reflexion... also into the assembly exists other class like meta and object that constructs ans instantiates the objects by the simple names of the publisher and the receiver class, receiver class provides the action handlers for each rule matched.

    Late, I have a class named fasterlex_engine that build a Dictionary<Regex, action_delegate> that load the definitions from an array for to run.

    The project is in advanced point but I'm still building, today. I will try to enhance the performance of running surrounding the sequential access to every pair foreach line input, thru using some mechanism of lookup the dictionary directly using the regexp like:

    map_rule[gcnew Regex("[a-zA-Z]")];
    

    Here, some of segments of my code:

    public ref class lex_rule: ILexRule
    {
    private:
        Exception           ^m_exception;
        Regex               ^m_pattern;
    
        //BACKSTORAGE delegates, esto me lo aprendi asiendo la huella.net de m*e*da JEJE
        yy_lexical_action   ^m_yy_lexical_action; 
        yy_user_action      ^m_yy_user_action;
    
    public: 
        virtual property    String ^short_id; 
    private:
        void init(String ^_short_id, String ^well_formed_regex);
    public:
    
        lex_rule();
        lex_rule(String ^_short_id,String ^well_formed_regex);
        virtual event    yy_lexical_action ^YY_RULE_MATCHED
        {
            virtual void add(yy_lexical_action ^_delegateHandle)
            {
                if(nullptr==m_yy_lexical_action)
                    m_yy_lexical_action=_delegateHandle;
            }
            virtual void remove(yy_lexical_action ^)
            {
                m_yy_lexical_action=nullptr;
            }
    
            virtual long raise(String ^id_rule, String ^input_string, String ^match_string, int index) 
            {
                long lReturn=-1L;
                if(m_yy_lexical_action)
                    lReturn=m_yy_lexical_action(id_rule,input_string, match_string, index);
                return lReturn;
            }
        }
    };
    

    Now the fasterlex_engine class that execute a lot of pattern/action pair:

    public ref class fasterlex_engine 
    {
    private: 
        Dictionary<String^,ILexRule^> ^m_map_rules;
    public:
        fasterlex_engine();
        fasterlex_engine(array<String ^,2>^defs);
        Dictionary<String ^,Exception ^> ^load_definitions(array<String ^,2> ^defs);
        void run();
    };
    

    AND FOR TO DECORATE THIS TOPIC..some code of my cpp file:

    this code creates a constructor invoker by parameter sign

    inline Exception ^object::builder(ConstructorInfo ^target, array<Type^> ^args)
    {
    try
    {
        DynamicMethod ^dm=gcnew DynamicMethod(
            "dyna_method_by_totem_motorist",
            Object::typeid,
            args,
            target->DeclaringType);
        ILGenerator ^il=dm->GetILGenerator();
        il->Emit(OpCodes::Ldarg_0);
        il->Emit(OpCodes::Call,Object::typeid->GetConstructor(Type::EmptyTypes)); //invoca a constructor base
        il->Emit(OpCodes::Ldarg_0);
        il->Emit(OpCodes::Ldarg_1);
        il->Emit(OpCodes::Newobj, target); //NewObj crea el objeto e invoca al constructor definido en target
        il->Emit(OpCodes::Ret);
        method_handler=(method_invoker ^) dm->CreateDelegate(method_invoker::typeid);
    }
    catch (Exception ^e)
    {
        return  e;
    }
    return nullptr;
    

    }

    This code attach an any handler function (static or not) for to deal with a callback raised by matching of a input string

    Delegate ^connection_point::hook(String ^receiver_namespace,String ^receiver_class_name, String ^handler_name)
    {
    Delegate ^d=nullptr;
    if(connection_point::waitfor_hook<=m_state) // si es 0,1,2 o mas => intenta hookear
    { 
        try 
        {
            Type ^tmp=meta::_class(receiver_namespace+"."+receiver_class_name);
            m_handler=tmp->GetMethod(handler_name);
            m_receiver_object=Activator::CreateInstance(tmp,false); 
    
            d=m_handler->IsStatic?
                Delegate::CreateDelegate(m_tdelegate,m_handler):
                Delegate::CreateDelegate(m_tdelegate,m_receiver_object,m_handler);
    
            m_add_handler=m_connection_point->GetAddMethod();
            array<Object^> ^add_handler_args={d};
            m_add_handler->Invoke(m_publisher_object, add_handler_args);
            ++m_state;
            m_exception_flag=false;
        }
        catch(Exception ^e)
        {
            m_exception_flag=true;
            throw gcnew Exception(e->ToString()) ;
        }
    }
    return d;       
    }
    

    finally the code that call the lexer engine:

    array<String ^,2> ^defs=gcnew array<String^,2>  {/*   shortID    pattern         namespc    clase           fun*/
                                                        {"LETRAS",  "[A-Za-z]+"     ,"prueba",  "manejador",    "procesa_directriz"},
                                                        {"INTS",    "[0-9]+"        ,"prueba",  "manejador",    "procesa_comentario"},
                                                        {"REM",     "--[^\\n]*"     ,"prueba",  "manejador",    "nullptr"}
                                                    }; //[3,5]
    
    //USO EL IDENTIFICADOR ESPECIAL "nullptr" para que el sistema asigne el proceso del evento a un default que realice nada
    fasterlex_engine ^lex=gcnew fasterlex_engine();
    Dictionary<String ^,Exception ^> ^map_error_list=lex->load_definitions(defs);
    lex->run();
    
    0 讨论(0)
  • 2021-02-01 06:17

    What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.

    Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.

    • single characters just become trie nodes.
    • .'s become wildcard insertions covering all children of the current trie node.
    • *'s become back links in the trie to node at the start of the previous item.
    • [a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
    • (foo)|(bar) nodes become multiple insertions

    This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.

    Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)

    0 讨论(0)
  • 2021-02-01 06:17

    There is a Perl module that does just this Tie::Hash::Regex.

    use Tie::Hash::Regex;
    my %h;
    
    tie %h, 'Tie::Hash::Regex';
    
    $h{key}   = 'value';
    $h{key2}  = 'another value';
    $h{stuff} = 'something else';
    
    print $h{key};  # prints 'value'
    print $h{2};    # prints 'another value'
    print $h{'^s'}; # prints 'something else'
    
    print tied(%h)->FETCH(k); # prints 'value' and 'another value'
    
    delete $h{k};   # deletes $h{key} and $h{key2};
    
    0 讨论(0)
  • 2021-02-01 06:18

    @rptb1 you don't have to avoid capturing groups, because you can use re.groups to count them. Like this:

    # Regular expression map
    # Abuses match.lastindex to figure out which key was matched
    # (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
    # Mostly for amusement.
    # Richard Brooksby, Ravenbrook Limited, 2013-06-01
    
    import re
    
    class ReMap(object):
        def __init__(self, items):
            if not items:
                items = [(r'epsilon^', None)] # Match nothing
            self.re = re.compile('|'.join('('+k+')' for (k,v) in items))
            self.lookup = {}
            index = 1
            for key, value in items:
                self.lookup[index] = value
                index += re.compile(key).groups + 1
    
        def __getitem__(self, key):
            m = self.re.match(key)
            if m:
                return self.lookup[m.lastindex]
            raise KeyError(key)
    
    def test():
        remap = ReMap([(r'foo.', 12),
                       (r'.*([0-9]+)', 99),
                       (r'FileN.*', 35),
                       ])
        print remap['food']
        print remap['foot in my mouth']
        print remap['FileNotFoundException: file.x does not exist']
        print remap['there were 99 trombones']
        print remap['food costs $18']
        print remap['bar']
    
    if __name__ == '__main__':
        test()
    

    Sadly very few RE engines actually compile the regexps down to machine code, although it's not especially hard to do. I suspect there's an order of magnitude performance improvement waiting for someone to make a really good RE JIT library.

    0 讨论(0)
提交回复
热议问题