Match regex across multiple lines in bash

问题

I want to match all patterns that start with [% and end with %] in a file.

I've tried multiple tools such as awk, sed, pcregrep and none of them seem to work, although they are suggested as top answers on similar questions.

[% FOREACH selection = selections -%]
      case SELECTION_ID_[% SELECTION_NAME %]: {
        const [% selectionType %]& source = this->[% selectionName %]();
        rc = bcem_AggregateUtil::toAggregate(result,
                                             d_selectionId,
                                             source);
      } break;
[% END -%]

[% foo ]

[% INCLUDE attributeSearchBlock

    tree=attributeSearchTree depth=0

    visit='ReturnAttributeInfo' name='name' nameLength='nameLength' -%]

For the code above, I expect the following result:

[% FOREACH selection = selections -%]
      case SELECTION_ID_[% SELECTION_NAME %]: {
        const [% selectionType %]& source = this->[% selectionName %]();
[% END -%]
[% INCLUDE attributeSearchBlock

    tree=attributeSearchTree depth=0

    visit='ReturnAttributeInfo' name='name' nameLength='nameLength' -%]

But I am getting all the lines matched instead.

What am I doing wrong?

LATER EDIT:

If it's on multiple lines, it should also be matched. For example:

[% foo
bar -%]

LATER EDIT 2: None of the answers seems to work, so I did the whole thing manually using the following:

        hasPatternStarted=false
        while read -r line; do
            if [[ $line =~ '[%' ]]; then
                hasPatternStarted=true
            fi
            if [[ $line =~ '%]' ]]; then
                hasPatternStarted=false
                echo $line
            fi
            if [ "$hasPatternStarted" = true ]; then
                echo $line
            fi
        done < "$filename"

It works fine, but if anyone has a one liner to solve this problem (using sed, awek, pcregrep, perl, grep anything), please say so.

回答1:

If you look at what you ask for you get two lines, since only two ends with -%]

 awk '/\[%.*-%\]/' file
[% FOREACH selection = selections -%]
[% END -%]

You can do this to get the result with all start with [% and ends with %]

awk '/\[%.*%\]/' file
[% FOREACH selection = selections -%]
      case SELECTION_ID_[% SELECTION_NAME %]: {
        const [% selectionType %]& source = this->[% selectionName %]();
[% END -%]

回答2:

This is one way using GNU awk for multi-char RS and RT:

$ awk -v RS='%]' -v ORS= '{print gensub(/.*(\n[^\n]*\[%)/,"\\1",1) RT}' file
[% FOREACH selection = selections -%]
      case SELECTION_ID_[% SELECTION_NAME %]
        const [% selectionType %]& source = this->[% selectionName %]
[% END -%]
[% INCLUDE attributeSearchBlock

    tree=attributeSearchTree depth=0

    visit='ReturnAttributeInfo' name='name' nameLength='nameLength' -%]

and here's another with multi-char RS and FPAT:

$ cat tst.awk
BEGIN {
    RS = "^$"
    FPAT = "[^\n]*{[^{}]*}"
}
{
    gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C")
    gsub(/\[%/,"{")
    gsub(/%\]/,"}")
    for (i=1; i<=NF; i++) {
        str = $i
        gsub(/}/,"%]",str)
        gsub(/{/,"[%",str)
        gsub(/@C/,"}",str); gsub(/@B/,"{",str) gsub(/@A/,"@",str)
        print str
    }
}

$ awk -f tst.awk file
[% FOREACH selection = selections -%]
      case SELECTION_ID_[% SELECTION_NAME %]
        const [% selectionType %]& source = this->[% selectionName %]
[% END -%]
[% INCLUDE attributeSearchBlock

    tree=attributeSearchTree depth=0

    visit='ReturnAttributeInfo' name='name' nameLength='nameLength' -%]

The 2nd script demonstrates a common idiom when using a tool like awk or sed that only supports greedy matches but you need to match text between multi-character strings which is to convert those multi-character delimiter strings to single characters so you can then use a negated character class between them.

So in the above with:

gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C")

I convert all @s to @As to free up the @ character, then convert all {s to @Bs (which is now a string that we KNOW doesn't occur in the input since we just put an A after every @) and then convert all }s to @Cs thereby ensuring that there are no { or } characters in the input and so freeing them up for us to uses as the regexp delimiters. I can now do:

gsub(/\[%/,"{")
gsub(/%\]/,"}")

to convert your real delimiter strings to characters so that I can use the negation of them in a regexp to match the string between those delimiters:

FPAT = "{[^{}]*}"

In GNU awk assigning FPAT like that automatically saves the matching strings in $1, $2, etc. so then I just have to unwind the above replacements before printing each field, hence:

gsub(/}/,"%]",str)
gsub(/{/,"[%",str)
gsub(/@C/,"}",str); gsub(/@B/,"{",str) gsub(/@A/,"@",str)

The equivalent to the 2nd script above for any POSIX awk is:

$ cat tst.awk
{ rec = (NR>1 ? rec ORS : "") $0 }
END {
    $0 = rec
    FPAT = "[^\n]*[{][^{}]*[}]"
    gsub(/@/,"@A"); gsub(/[{]/,"@B"); gsub(/[}]/,"@C")
    gsub(/\[%/,"{")
    gsub(/%\]/,"}")
    while ( match($0,FPAT) ) {
        str = substr($0,RSTART,RLENGTH)
        $0 = substr($0,RSTART+RLENGTH)
        gsub(/[}]/,"%]",str)
        gsub(/[{]/,"[%",str)
        gsub(/@C/,"}",str); gsub(/@B/,"{",str) gsub(/@A/,"@",str)
        print str
    }
}

$ awk -f tst.awk file
[% FOREACH selection = selections -%]
      case SELECTION_ID_[% SELECTION_NAME %]
        const [% selectionType %]& source = this->[% selectionName %]
[% END -%]
[% INCLUDE attributeSearchBlock

    tree=attributeSearchTree depth=0

    visit='ReturnAttributeInfo' name='name' nameLength='nameLength' -%]

回答3:

TL;DR: perl -ne 'print if /\[%/../%\]/' file

You'd think you could do this: sed -n '/[%/,/%]/p' but it doesn't terminate properly inline.

So you can convert the above to perl: perl -ne 'print if /\[%/.../%\]/' and that has the same problem because of the ... operator.

Perl, though, has an operator to save the day here: perl -ne 'print if /\[%/../%\]/'

As perlop says:

In scalar context, ".." returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated. It can test the right operand and become false on the same evaluation it became true (as in awk), but it still returns true once. If you don't want it to test the right operand until the next evaluation, as in sed, just use three dots ("..." ) instead of two. In all other regards, "..." behaves just like ".." does.

All that to say: for the line-range operation, with perl you can have it both ways, because of .. (like awk) and ... (like sed)

来源：https://stackoverflow.com/questions/56658151/match-regex-across-multiple-lines-in-bash

标签

bash

awk

sed

scripting

grep