awk: fatal: Invalid regular expression when setting multiple field separators

后端 未结 3 1034
耶瑟儿~
耶瑟儿~ 2021-01-07 12:42

I was trying to solve Grep regex to select only 10 character using awk. The question consists in a string XXXXXX[YYYYY--ZZZZZ and the OP wants to p

3条回答
  •  一生所求
    2021-01-07 13:24

    IMHO this is best explained if we start by looking at a regexp being used by the split() command since that explicitly shows what is happening when a string is split into fields using a literal vs dynamic regexp and then we can relate that to Field Separators.

    This uses a literal regexp (delimited by /s):

    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/\[|--/); print f[2]}'
    YYYYY
    

    and so requires the [ to be escaped so it is taken literally since [ is a regexp metacharacter.

    These use a dynamic regexp (one stored as a string):

    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,"\\[|--"); print f[2]}'
    YYYYY
    
    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk 'BEGIN{re="\\[|--"} {split($0,f,re); print f[2]}'
    YYYYY
    
    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re='\\[|--' '{split($0,f,re); print f[2]}'
    YYYYY
    

    and so require the [ to be escaped 2 times since awk has to convert the string holding the regexp (a variable named re in the last 2 examples) to a regexp (which uses up one backslash) before it's used as the separator in the split() call (which uses up the second backslash).

    This:

    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re="\\\[|--" '{split($0,f,re); print f[2]}'
    YYYYY
    

    exposes the variable contents to the shell for it's evaluation and so requires the [ to be escaped 3 times since the shell parses the string first to try to expand shell variables etc. (which uses up one backslash) and then awk has to convert the string holding the regexp to a regexp (which uses up a second backslash) before it's used as the separator in the split() call (which uses up the third backslash).

    A Field Separator is just a regexp stored as variable named FS (like re above) with some extra semantics so all of the above applies to it to, hence:

    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '\\[|--' '{print $2}'
    YYYYY
    
    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "\\\[|--" '{print $2}'
    YYYYY
    

    Note that we could have used a bracket expression instead of escaping it to have the [ treated literally:

    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/[[]|--/); print f[2]}'
    YYYYY
    

    and then we don't have to worry about escaping the escapes as we add layers of parsing:

    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "[[]|--" '{print $2}'
    YYYYY
    
    $ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '[[]|--' '{print $2}'
    YYYYY
    

提交回复
热议问题