I\'m trying to figure out if there\'s a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the
I don't think it's even theoretically possible. What happens if someone passes in a string that matches more than 1 regular expression.
For example, what would happen if someone did:
>>> regex_dict['FileNfoo']
How can something like that possibly be O(1)?
It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*
' or '\d+
', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c
' in ^\d+a\*b\*c:\s+\w+
) , as in your examples. You can do this common trick that scales well to millions of regexes:
Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.
Ok, I have a very similar requirements, I have a lot of lines of different syntax, basically remark lines and lines with some codes for to use in a process of smart-card format, also, descriptor lines of keys and secret codes, in every case, I think that the "model" pattern/action is the beast approach for to recognize and to process a lot of lines.
I'm using C++/CLI
for to develop my assembly named LanguageProcessor.dll
, the core of this library is a lex_rule class that basically contains :
The constructor loads the regex string and call the necessary codes for to build the event on the fly using DynamicMethod
, Emit
and Reflexion
... also into the assembly exists other class like meta and object that constructs ans instantiates the objects by the simple names of the publisher and the receiver class, receiver class provides the action handlers for each rule matched.
Late, I have a class named fasterlex_engine
that build a Dictionary<Regex, action_delegate>
that load the definitions from an array for to run.
The project is in advanced point but I'm still building, today. I will try to enhance the performance of running surrounding the sequential access to every pair foreach line input, thru using some mechanism of lookup the dictionary directly using the regexp like:
map_rule[gcnew Regex("[a-zA-Z]")];
Here, some of segments of my code:
public ref class lex_rule: ILexRule
{
private:
Exception ^m_exception;
Regex ^m_pattern;
//BACKSTORAGE delegates, esto me lo aprendi asiendo la huella.net de m*e*da JEJE
yy_lexical_action ^m_yy_lexical_action;
yy_user_action ^m_yy_user_action;
public:
virtual property String ^short_id;
private:
void init(String ^_short_id, String ^well_formed_regex);
public:
lex_rule();
lex_rule(String ^_short_id,String ^well_formed_regex);
virtual event yy_lexical_action ^YY_RULE_MATCHED
{
virtual void add(yy_lexical_action ^_delegateHandle)
{
if(nullptr==m_yy_lexical_action)
m_yy_lexical_action=_delegateHandle;
}
virtual void remove(yy_lexical_action ^)
{
m_yy_lexical_action=nullptr;
}
virtual long raise(String ^id_rule, String ^input_string, String ^match_string, int index)
{
long lReturn=-1L;
if(m_yy_lexical_action)
lReturn=m_yy_lexical_action(id_rule,input_string, match_string, index);
return lReturn;
}
}
};
Now the fasterlex_engine class that execute a lot of pattern/action pair:
public ref class fasterlex_engine
{
private:
Dictionary<String^,ILexRule^> ^m_map_rules;
public:
fasterlex_engine();
fasterlex_engine(array<String ^,2>^defs);
Dictionary<String ^,Exception ^> ^load_definitions(array<String ^,2> ^defs);
void run();
};
AND FOR TO DECORATE THIS TOPIC..some code of my cpp file:
this code creates a constructor invoker by parameter sign
inline Exception ^object::builder(ConstructorInfo ^target, array<Type^> ^args)
{
try
{
DynamicMethod ^dm=gcnew DynamicMethod(
"dyna_method_by_totem_motorist",
Object::typeid,
args,
target->DeclaringType);
ILGenerator ^il=dm->GetILGenerator();
il->Emit(OpCodes::Ldarg_0);
il->Emit(OpCodes::Call,Object::typeid->GetConstructor(Type::EmptyTypes)); //invoca a constructor base
il->Emit(OpCodes::Ldarg_0);
il->Emit(OpCodes::Ldarg_1);
il->Emit(OpCodes::Newobj, target); //NewObj crea el objeto e invoca al constructor definido en target
il->Emit(OpCodes::Ret);
method_handler=(method_invoker ^) dm->CreateDelegate(method_invoker::typeid);
}
catch (Exception ^e)
{
return e;
}
return nullptr;
}
This code attach an any handler function (static or not) for to deal with a callback raised by matching of a input string
Delegate ^connection_point::hook(String ^receiver_namespace,String ^receiver_class_name, String ^handler_name)
{
Delegate ^d=nullptr;
if(connection_point::waitfor_hook<=m_state) // si es 0,1,2 o mas => intenta hookear
{
try
{
Type ^tmp=meta::_class(receiver_namespace+"."+receiver_class_name);
m_handler=tmp->GetMethod(handler_name);
m_receiver_object=Activator::CreateInstance(tmp,false);
d=m_handler->IsStatic?
Delegate::CreateDelegate(m_tdelegate,m_handler):
Delegate::CreateDelegate(m_tdelegate,m_receiver_object,m_handler);
m_add_handler=m_connection_point->GetAddMethod();
array<Object^> ^add_handler_args={d};
m_add_handler->Invoke(m_publisher_object, add_handler_args);
++m_state;
m_exception_flag=false;
}
catch(Exception ^e)
{
m_exception_flag=true;
throw gcnew Exception(e->ToString()) ;
}
}
return d;
}
finally the code that call the lexer engine:
array<String ^,2> ^defs=gcnew array<String^,2> {/* shortID pattern namespc clase fun*/
{"LETRAS", "[A-Za-z]+" ,"prueba", "manejador", "procesa_directriz"},
{"INTS", "[0-9]+" ,"prueba", "manejador", "procesa_comentario"},
{"REM", "--[^\\n]*" ,"prueba", "manejador", "nullptr"}
}; //[3,5]
//USO EL IDENTIFICADOR ESPECIAL "nullptr" para que el sistema asigne el proceso del evento a un default que realice nada
fasterlex_engine ^lex=gcnew fasterlex_engine();
Dictionary<String ^,Exception ^> ^map_error_list=lex->load_definitions(defs);
lex->run();
What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.
Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.
This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.
Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)
There is a Perl module that does just this Tie::Hash::Regex.
use Tie::Hash::Regex;
my %h;
tie %h, 'Tie::Hash::Regex';
$h{key} = 'value';
$h{key2} = 'another value';
$h{stuff} = 'something else';
print $h{key}; # prints 'value'
print $h{2}; # prints 'another value'
print $h{'^s'}; # prints 'something else'
print tied(%h)->FETCH(k); # prints 'value' and 'another value'
delete $h{k}; # deletes $h{key} and $h{key2};
@rptb1 you don't have to avoid capturing groups, because you can use re.groups to count them. Like this:
# Regular expression map
# Abuses match.lastindex to figure out which key was matched
# (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
# Mostly for amusement.
# Richard Brooksby, Ravenbrook Limited, 2013-06-01
import re
class ReMap(object):
def __init__(self, items):
if not items:
items = [(r'epsilon^', None)] # Match nothing
self.re = re.compile('|'.join('('+k+')' for (k,v) in items))
self.lookup = {}
index = 1
for key, value in items:
self.lookup[index] = value
index += re.compile(key).groups + 1
def __getitem__(self, key):
m = self.re.match(key)
if m:
return self.lookup[m.lastindex]
raise KeyError(key)
def test():
remap = ReMap([(r'foo.', 12),
(r'.*([0-9]+)', 99),
(r'FileN.*', 35),
])
print remap['food']
print remap['foot in my mouth']
print remap['FileNotFoundException: file.x does not exist']
print remap['there were 99 trombones']
print remap['food costs $18']
print remap['bar']
if __name__ == '__main__':
test()
Sadly very few RE engines actually compile the regexps down to machine code, although it's not especially hard to do. I suspect there's an order of magnitude performance improvement waiting for someone to make a really good RE JIT library.