Understanding awk delimiter - escaping in a regex-based field separator

送分小仙女□ 提交于 2019-12-10 10:52:25

问题


I have the following shell command:

awk -F'\[|\]' '{print $2}'

What is this command doing? Split into fields using as delimiter [sometext]?

E.g.:

$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line

Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.


回答1:


The apparent intent is to treat literal [ and ] as field-separator characters, i.e., to split each input record into fields by each occurrence of [ and/or ], which, with the sample line, yields this  as field 1 ($1), line as field 2 ($2), and  passed to awk as the last field ($3).

This is achieved by a regex (regular expression) that uses alternation (|), either side of which defines a field separator (delimiter): \[ and \] in a regex are needed to represent literal [ and ], because, by default, [ and ] are so-called metacharacters (characters with special syntactical meaning).
Note that awk always interprets the value of the FS variable (-F option) as a regex.

However, the correct form is '\\[|\\]':

$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line

That said, a more concise version that uses a character set ([...]) rather than alternation (|) is:

$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line

Note the careful placement of ] before [ inside the enclosing [...] to make this work, and how the enclosing [...] now have special meaning: they enclose a set of characters, any of which matches.


As for why 2 \ instances are needed in '\\[|\\]':

Taken as a regex in isolation, \[|\] would work:

  • \[ matches literal [
  • \] matches literal ]
  • | is an alternation that matches one or the other.

However, Awk's string processing comes first:

  • It should, due to \ handling in a string, reduce \[|\] to [|] before interpretation as a regex.

    • Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]
  • [|], interpreted as a regex, would then only match a single, literal |

Thus, the robust and portable way is to use \\ in a string literal when you mean to pass a single \ as part of a regex.

This quote from the relevant section of the GNU Awk manual sums it up well:

To get a backslash into a regular expression inside a string, you have to type two backslashes.


[1] Implementation differences:

Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \ before a regex metacharacter inside a string literal.

BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \-prefixed regex metacharacter is found:

# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c

# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c

# Mawk: *Guesses* that a *regex* was intended.
#       The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b

Optional reading: regex literals inside Awk scripts

Awk supports regex literals enclosed in /.../, the use of which bypasses the double-escaping problem.

However:

  • These literals (which are invariably constant) are only available inside an Awk script,
  • and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.

Therefore, even though /\[|\]/ is in principle equivalent to "\\[|\\]", you can not use the following, because the regex literal cannot be assigned to (special) variable FS:

# !! DOES NOT WORK in any of the 3 major Awk implementations.
#    Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'

# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b


来源:https://stackoverflow.com/questions/44072715/understanding-awk-delimiter-escaping-in-a-regex-based-field-separator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!