Create set of all possible matches for a given regex

夙愿已清 提交于 2019-12-02 22:16:12

The transformation from a regex to a DFA is pretty straightforward. The issue you'll run into there, though, is that the DFA generated can contain loops (e.g, for * or +), which will make it impossible to expand fully. Additionally, {n,n} can't be represented cleanly in a DFA, as a DFA has no "memory" of how many times it's looped.

What a solution to this problem will boil down to is building a function which tokenizes and parses a regular expression, then returns an array of all possible matches. Using recursion here will help you a lot.

A starting point, in pseudocode, might look like:

to GenerateSolutionsFor(regex):
    solutions = [""]
    for token in TokenizeRegex(regex):
        if token.isConstantString:
            for sol in solutions: sol.append(token.string)
        else if token.isLeftParen:
            subregex = get content until matching right paren
            subsols = GenerateSolutionsFor(subregex)
            for sol in solutions:
                for subsol in subsols:
                    sol.append(subsol)
        else if token.isVerticalBar:
            solutions.add(GenerateSolutionsFor(rest of the regex))
        else if token.isLeftBrace:
            ...

I'm wondering how to find a set of all matches to a given regex with a finite number of matches.

Because you're only considering regular expressions denoting finite languages, you're actually considering a subset of the regular expressions over an alphabet. In particular, you're not dealing with regular expressions constructed using the Kleene star operator. This suggests a simple recursive algorithm for constructing the set of strings denoted by the regular expressions without Kleene star over an alphabet Σ.

LANG(a)     = {a} for all a ∈ Σ
LANG(x ∪ y) = LANG(x) ∪ LANG(y)
LANG(xy)    = {vw : v ∈ LANG(x) ∧ w ∈ LANG(y)}

Consider a regular expression such as a(b ∪ c)d. This is precisely the structure of your cats and dogs example. Executing the algorithm will correctly determine the language denoted by the regular expression:

LANG(a((b ∪ c)d)) = {xy : x ∈ LANG(a) ∧ y ∈ LANG((b ∪ c)d)}
                  = {xy : x ∈ {a} ∧ y ∈ {vw : v ∈ LANG(b ∪ c) ∧ w ∈ LANG{d}}}
                  = {ay : y ∈ {vw : v ∈ (LANG(b) ∪ LANG(c)) ∧ w ∈ {d}}}
                  = {ay : y ∈ {vd : v ∈ {b} ∪ {c}}
                  = {ay : y ∈ {vd : v ∈ {b,c}}}
                  = {ay : y ∈ {bd, cd}}
                  = {abd, acd}

You also ask whether there is an algorithm that determines whether a regular language is finite. The algorithm consists in constructing the deterministic finite automaton accepting the language, then determining whether the transition graph contains a walk from the start state to a final state containing a cycle. Note that the subset of regular expressions constructed without Kleene star denote finite languages. Because the union and concatenation of finite sets is finite, this follows by easy induction.

You might want to look at this Regex library, which parses a RegEx syntax (albeit a bit different from the perl standard) and can construct a DFA from it: http://www.brics.dk/automaton/

I have begun working on a solution on Github. It can already lex most examples and give the solution set for finite regex.

It currently passes the following unit tests.

<?php

class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
{

    function dataProviderForTestSimpleRead()
    {
        return array(
            array( "^ab$", array( "ab" ) ),
            array( "^(ab)$", array( "ab" ) ),
            array( "^(ab|ba)$", array( "ab", "ba" ) ),
            array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
            array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
            array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
            array( "^hello?$", array( "hell", "hello" ) ),
            array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
            array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
            array( '^\n$', array( "\n" ) ),
            array( '^\r$', array( "\r" ) ),
            array( '^\t$', array( "\t" ) ),
            array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
            array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
        );
    }

    /**
     * @dataProvider dataProviderForTestSimpleRead
     */

    function testSimpleRead( $regex_string, $expected_matches_array )
    {
        $lexer = new RegexCompiler_Lexer();
        $actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
        sort( $actualy_matches_array );
        sort( $expected_matches_array );
        $this->assertSame( $expected_matches_array, $actualy_matches_array );
    }

}

?>

I would like to build an MatchIterator class that could handle infinite lists as well as one that would randomly generate matches from the regex. I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data.

aurora

This probably doesn't answer all your questions / needs, but maybe it's a good starting point. I was searching for a solution for auto-generating data that matches a regexp a while ago, and i found this perl module Parse::RandGen, Parse::RandGen::RegExp, which worked quite impressivly good for my needs:

http://metacpan.org/pod/Parse::RandGen

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!