Is it possible to check if two groups are equal?

前端 未结 2 1451
遥遥无期
遥遥无期 2021-01-21 10:16

If I have some HTML like this:

 123

And the following regex:

 \\<[^\\>\\/]+\\>(.         


        
相关标签:
2条回答
  • 2021-01-21 10:33

    Is there a way to do this?

    Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.

    Numbered Captures

    Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:

    #!/usr/bin/env perl
    #
    # solution A: numbered captures
    #
    use v5.10;
    while (<>) {
        say "$1: $2" while m{
              < ( [ib] ) >
              (
                  (?:
                      (?!  < /? \1  > ) .
                  ) *
              )
              </ \1  >
        }gsix;
    }
    

    Which when run, produces this:

    $ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
    i: foo
    b: bar
    

    Named Captures

    It would be better to use named captures, which leads to this equivalent solution:

    #!/usr/bin/env perl
    #
    # Solution B: named captures
    #
    use v5.10;
    while (<>) {
        say "$+{name}: $+{contents}" while m{      
              < (?<name> [ib] ) >
              (?<contents>
                  (?:
                      (?!  < /? \k<name>  > ) .
                  ) *
              )
              </ \k<name>  >
        }gsix;
    }
    

    Recursive Captures

    Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:

    ( \( (?: [^()]++ | (?-1) )*+ \) )
    

    I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.

    #!/usr/bin/perl
    use v5.10;
    # Solution C: recursive captures, plus bonus iteration 
    while (my $line = <>) {
        my @input = ( $line );
        while (@input) { 
            my $cur = shift @input;
            while ($cur =~ m{      
                              < (?<name> [ib] ) >
                              (?<contents>
                                  (?:
                                        [^<]++
                                      | (?0)
                                      | (?!  </ \k<name>  > )
                                         .
                                  ) *+
                              )
                              </ \k<name>  >
                   }gsix)
            {
                say "$+{name}: $+{contents}";
                push @input, $+{contents};
            } 
        }
    }
    

    Which when demo’d produces this:

    $ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
    i: foo <i>nested</i> and <b>bar</b> bits
    i: nested
    b: bar
    

    That’s still fairly simple, so if it works on your data, go for it.

    Grammatical Patterns

    However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.

    As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.

    For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:

    (?(DEFINE)   # begin regex subroutine defs for grammatical regex
    
        (?<i_tag_end> < / i > )
    
        (?<i_tag_start> < i (?&attributes) > )
    
        (?<attributes> (?: \s* (?&one_attribute) ) *)
    
        (?<one_attribute>
            \b
            (?&legal_attribute)
            \s* = \s* 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )
    
        (?<legal_attribute> 
              (?&standard_attribute) 
            | (?&event_attribute)
        )
    
        (?<standard_attribute>
              class
            | dir
            | ltr
            | id
            | lang
            | style
            | title
            | xml:lang
        )
    
        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.
    
        (?<event_attribute>
              on click
            | on dbl   click
            | on mouse down
            | on mouse move
            | on mouse out
            | on mouse over
            | on mouse up
            | on key   down
            | on key   press
            | on key   up
        )
    
        (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
        (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
        (?<equals>          (?&might_white)  = (?&might_white)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
        (?<unquoted_value>  [\w\-] *                              )
        (?<might_white>     \s *                                  )
        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )
        (?<start_tag>  < (?&might_white) )
        (?<end_tag>          
            (?&might_white)
            (?: (?&html_end_tag) 
              | (?&xhtml_end_tag) 
             )
        )
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    
    )
    

    Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.

    However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.

    SUMMARY

    I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:

    1. You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
    2. You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
    3. You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.

    Any one or more of those might well apply. In which case, don’t do it this way.

    For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.

    Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.

    But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.

    Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.


    This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.

    0 讨论(0)
  • 2021-01-21 10:34

    You probably don't want to use regular expressions with HTML.

    But if you still want to do this you need to take a look at backreferences.

    Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.


    Related issues:

    • RegEx match open tags except XHTML self-contained tags
    0 讨论(0)
提交回复
热议问题