How can I use a look after to match either a single or a double quote?

前端 未结 3 567
天涯浪人
天涯浪人 2021-01-26 20:27

I have a series of strings I want to extract:

hello.this_is(\"bla bla bla\")
some random text
hello.this_is(\'hello hello\')
other stuff

What I

相关标签:
3条回答
  • 2021-01-26 20:40

    Note: The sed command at the bottom of this answer works only as long as your strings are nice behaving strings like

    "foo"
    

    or

    'bar'
    

    As soon as your strings start to misbehave :) like:

    "hello \"world\""
    

    it won't work any more.

    Your input looks like source code. For a stable solution I recommend to use a parser for that language to extract the strings.


    For trivial use cases:

    You can use sed. The solution is supposed to work on any POSIX platform in contrast to grep -oP which only works with GNU grep:

    sed -n 's/hello\.this_is(\(["'\'']\)\([^"]*\)\(["'\'']\).*/\2/gp' file
    #                                    ^^^^^^^^              ^^
    #                                          capture group 2 ^
    
    0 讨论(0)
  • 2021-01-26 20:44

    Use a capturing group and look for its content like the following:

    grep -Po 'hello\.this_is\(([\047"])((?!\1).|\\.)*\1\)' file
    

    This cares about escaped characters too e.g. hello.this_is("bla b\"la bla")

    See live demo here

    If the output should be what comes between parentheses then utilize both \K and a positive lookahead:

    grep -Po 'hello\.this_is\(([\047"])\K((?!\1).|\\.)*(?=\1\))' file
    

    Outputs:

    bla bla bla
    hello hello
    
    0 讨论(0)
  • 2021-01-26 20:51

    Based on revo and hek2mgl excellent answers, I ended up using grep like this:

    grep -Po '(?<=hello\.this_is\((["'\''])).*(?=\1)' file
    

    Which can be explained as:

    • grep
    • -Po use Perl regexp machine and just prints the matches
    • '(?<=hello\.this_is\((["'\''])).*(?=\1)' the expression
      • (?<=hello\.this_is\((["'\''])) look-behind: search strings preceeded by "hello.this_is(" followed by either ' or ". Also, capture this last character to be used later on.
      • .* match everything...
      • (?=\1) until the captured character (that is, either ' or ") appears again.

    The key here was to use ["'\''] to indicate either ' or ". By doing '\'' we are closing the enclosing expression, populating with a literal ' (that we have to escape) and opening the enclosing expression again.

    0 讨论(0)
提交回复
热议问题