I\'m just wondering if it\'s possible to use one regular expression to match another, that is some sort of:
[\'a-z\'].match([\'b-x\'])
True
[\'m-n\'].match(
Some clarification is required, I think:
.
rA = re.compile('(?<! )[a-z52]+')
'(?<! )[a-z52]+'
is a pattern
rA
is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .
Personally I use the term regex exclusively for this kind of objects, not for the pattern .
Note that rB = re.compile(rA)
is also possible, it produces the same object (id(rA) == id(rB) equals to True)
ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch)
# or
y = rA.search(ch)
x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*
.
That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.
I don't think such a way exists, whatever theoretically or practically.
I think — in theory — to tell whether regexp A
matches a subset of what regexp B
matches, an algorithm could:
B
and also of the "union" A|B
.However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.
But what are you trying to do? Perhaps there's an easier approach...?
REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]
REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]
[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]
{A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
{B} -----------------epsilon --------> {E}
{C} <--- epsilon ------ {E}
{A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
{A} --------------epsilon -----------> {D}
{B} <--- epsilon ------ {D}
NFA:
{state A} ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
{state C} ---epsilon --> {state D}
{state C} <---epsilon -- {state D}
{state A} ---epsilon --> {state E} ---5--> {state F}
{state E} ---epsilon --> {state F}
{state E} <---epsilon -- {state F}
NFA -> DFA:
| 5 | epsilon*
----+--------------+--------
A | B,C,E,F,G | A,C,E,F
B | C,D,E,F | B,C,E,F
c | C,D,E,F | C
D | C,D,E,F,G | C,D,E,F
E | C,D,E,F,G | C,E,F
F | C,E,F,G | F
G | C,D,E,G | C,E,F,G
5(epsilon*)
-------------+---------------------
A | B,C,E,F,G
B,C,E,F,G | C,D,E,F,G
C,D,E,F,G | C,D,E,F,G
Finally the DFA for (REGEX_A|REGEX_B) is:
{A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
{C,D,E,F,G}---5--> {C,D,E,F,G}
Note: {A} is start state and {C,D,E,F,G} is accepting state.
| 5 | epsilon*
----+--------+--------
A | B,C,E | A
B | C,D,E | B,C,E
C | C,D,E | C
D | C,D,E | C,D,E
E | C,D,E | C,E
5(epsilon*)
-------+---------------------
A | B,C,E
B,C,E | C,D,E
C,D,E | C,D,E
{A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
{C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state
| 5 | epsilon*
----+--------+--------
A | B,C,D | A,B,D
B | B,C,D | B
C | B,C,D | B,C,D
D | B,C,D | B,D
5(epsilon*)
-------+---------------------
A | B,C,D
B,C,D | B,C,D
{A} ---- 5 -----> {B,C,D}
{B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state
DFA of REGX_A|REGX_B identical to DFA of REGX_A
-- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B
-- cannot infer about either gerexes.
It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile
. I don't see what use this would be, though. Also, it takes a different syntax.
Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re
module doesn't do it.
You should do something along these lines:
re.match("\[[b-x]-[b-x]\]", "[a-z]")
The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.
If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.
pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False
Also tested successfully with Python 2. See also how to do it with a different library.
Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip
and:
from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n