I'm wondering how to find a set of all matches to a given regex with a finite number of matches.
For example:
All of these example you can assume they start with ^
and end with $
`hello?` -> (hell, hello)
`[1-9][0-9]{0,3}` -> (1,2,3 ..., 9998, 9999)
`My (cat|dog) is awesome!` -> (My cat is awesome!, My dog is awesome!)
`1{1,10}` -> (1,11, ..., 111111111, 1111111111)
`1*` -> //error
`1+` -> //error
`(1|11){2}` -> (1,11,111,1111) //notice how it doesn't repeat any of the possibilities
I'd also be interested if there was a way of retrieving count a unique a solutions to the regex or if there is a way to determine if the regex has a finite solutions.
It would be nice if the algorithm could parse any regex, but a powerful enough subset of the regex would be fine.
I'm interested in a PHP solution to this problem, but other languages would also be fine.
EDIT:
I've learned in my Formal Theory class about DFA which can be used to implement regex (and other regular languages). If I could transform the regex into a DFA the solution seems fairly straight forward to me, but that transformation seems rather tricky to me.
EDIT 2:
Thanks for all the suggestions, see my post about the public github project I'm working on to "answer" this question.
The transformation from a regex to a DFA is pretty straightforward. The issue you'll run into there, though, is that the DFA generated can contain loops (e.g, for *
or +
), which will make it impossible to expand fully. Additionally, {n,n}
can't be represented cleanly in a DFA, as a DFA has no "memory" of how many times it's looped.
What a solution to this problem will boil down to is building a function which tokenizes and parses a regular expression, then returns an array of all possible matches. Using recursion here will help you a lot.
A starting point, in pseudocode, might look like:
to GenerateSolutionsFor(regex):
solutions = [""]
for token in TokenizeRegex(regex):
if token.isConstantString:
for sol in solutions: sol.append(token.string)
else if token.isLeftParen:
subregex = get content until matching right paren
subsols = GenerateSolutionsFor(subregex)
for sol in solutions:
for subsol in subsols:
sol.append(subsol)
else if token.isVerticalBar:
solutions.add(GenerateSolutionsFor(rest of the regex))
else if token.isLeftBrace:
...
I'm wondering how to find a set of all matches to a given regex with a finite number of matches.
Because you're only considering regular expressions denoting finite languages, you're actually considering a subset of the regular expressions over an alphabet. In particular, you're not dealing with regular expressions constructed using the Kleene star operator. This suggests a simple recursive algorithm for constructing the set of strings denoted by the regular expressions without Kleene star over an alphabet Σ.
LANG(a) = {a} for all a ∈ Σ
LANG(x ∪ y) = LANG(x) ∪ LANG(y)
LANG(xy) = {vw : v ∈ LANG(x) ∧ w ∈ LANG(y)}
Consider a regular expression such as a(b ∪ c)d
. This is precisely the structure of your cats and dogs example. Executing the algorithm will correctly determine the language denoted by the regular expression:
LANG(a((b ∪ c)d)) = {xy : x ∈ LANG(a) ∧ y ∈ LANG((b ∪ c)d)}
= {xy : x ∈ {a} ∧ y ∈ {vw : v ∈ LANG(b ∪ c) ∧ w ∈ LANG{d}}}
= {ay : y ∈ {vw : v ∈ (LANG(b) ∪ LANG(c)) ∧ w ∈ {d}}}
= {ay : y ∈ {vd : v ∈ {b} ∪ {c}}
= {ay : y ∈ {vd : v ∈ {b,c}}}
= {ay : y ∈ {bd, cd}}
= {abd, acd}
You also ask whether there is an algorithm that determines whether a regular language is finite. The algorithm consists in constructing the deterministic finite automaton accepting the language, then determining whether the transition graph contains a walk from the start state to a final state containing a cycle. Note that the subset of regular expressions constructed without Kleene star denote finite languages. Because the union and concatenation of finite sets is finite, this follows by easy induction.
You might want to look at this Regex library, which parses a RegEx syntax (albeit a bit different from the perl standard) and can construct a DFA from it: http://www.brics.dk/automaton/
I have begun working on a solution on Github. It can already lex most examples and give the solution set for finite regex.
It currently passes the following unit tests.
<?php
class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
{
function dataProviderForTestSimpleRead()
{
return array(
array( "^ab$", array( "ab" ) ),
array( "^(ab)$", array( "ab" ) ),
array( "^(ab|ba)$", array( "ab", "ba" ) ),
array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
array( "^hello?$", array( "hell", "hello" ) ),
array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
array( '^\n$', array( "\n" ) ),
array( '^\r$', array( "\r" ) ),
array( '^\t$', array( "\t" ) ),
array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
);
}
/**
* @dataProvider dataProviderForTestSimpleRead
*/
function testSimpleRead( $regex_string, $expected_matches_array )
{
$lexer = new RegexCompiler_Lexer();
$actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
sort( $actualy_matches_array );
sort( $expected_matches_array );
$this->assertSame( $expected_matches_array, $actualy_matches_array );
}
}
?>
I would like to build an MatchIterator
class that could handle infinite lists as well as one that would randomly generate matches from the regex. I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data.
This probably doesn't answer all your questions / needs, but maybe it's a good starting point. I was searching for a solution for auto-generating data that matches a regexp a while ago, and i found this perl module Parse::RandGen, Parse::RandGen::RegExp, which worked quite impressivly good for my needs:
来源:https://stackoverflow.com/questions/7614962/create-set-of-all-possible-matches-for-a-given-regex