问题
I have the following shell command:
awk -F'\[|\]' '{print $2}'
What is this command doing? Split into fields using as delimiter [sometext]
?
E.g.:
$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line
Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.
回答1:
The apparent intent is to treat literal [
and ]
as field-separator characters, i.e., to split each input record into fields by each occurrence of [
and/or ]
, which, with the sample line, yields this
as field 1 ($1
), line
as field 2 ($2
), and passed to awk
as the last field ($3
).
This is achieved by a regex (regular expression) that uses alternation (|
), either side of which defines a field separator (delimiter): \[
and \]
in a regex are needed to represent literal [
and ]
, because, by default, [
and ]
are so-called metacharacters (characters with special syntactical meaning).
Note that awk
always interprets the value of the FS
variable (-F
option) as a regex.
However, the correct form is '\\[|\\]'
:
$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line
That said, a more concise version that uses a character set ([...]
) rather than alternation (|
) is:
$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line
Note the careful placement of ]
before [
inside the enclosing [...]
to make this work, and how the enclosing [...]
now have special meaning: they enclose a set of characters, any of which matches.
As for why 2 \
instances are needed in '\\[|\\]'
:
Taken as a regex in isolation, \[|\]
would work:
\[
matches literal[
\]
matches literal]
|
is an alternation that matches one or the other.
However, Awk's string processing comes first:
It should, due to
\
handling in a string, reduce\[|\]
to[|]
before interpretation as a regex.- Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]
[|]
, interpreted as a regex, would then only match a single, literal|
Thus, the robust and portable way is to use \\
in a string literal when you mean to pass a single \
as part of a regex.
This quote from the relevant section of the GNU Awk manual sums it up well:
To get a backslash into a regular expression inside a string, you have to type two backslashes.
[1] Implementation differences:
Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \
before a regex metacharacter inside a string literal.
BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \
-prefixed regex metacharacter is found:
# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c
# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c
# Mawk: *Guesses* that a *regex* was intended.
# The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b
Optional reading: regex literals inside Awk scripts
Awk supports regex literals enclosed in /.../
, the use of which bypasses the double-escaping problem.
However:
- These literals (which are invariably constant) are only available inside an Awk script,
- and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.
Therefore, even though /\[|\]/
is in principle equivalent to "\\[|\\]"
, you can not use the following, because the regex literal cannot be assigned to (special) variable FS
:
# !! DOES NOT WORK in any of the 3 major Awk implementations.
# Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'
# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b
来源:https://stackoverflow.com/questions/44072715/understanding-awk-delimiter-escaping-in-a-regex-based-field-separator