问题
I'm seeing BASH bracket ranges (e.g. [A-Z]) behaving in an unexpected way.
Is there's an explanation for such behavior, or it is a bug?
Let's say I have a variable, from which I want to strip all uppercase letters:
$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"
The result I get is this:
a0123
If I do it with sed
, I get an expected result:
$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123
The same seems to be the case for BASH built-in regex match:
$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0
If I check all lowercase letters from 'a' to 'z', it seems that only 'a' is an exception:
$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a
I do not have case-insensitive matching enabled, and even if I did, it should not make letter 'a' behave differently:
$ shopt -p nocasematch
shopt -u nocasematch
For the reference, I'm using Cygwin, and I don't see this behavior on any other machine:
$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
EDIT:
I've found the exact same issue reported here:
https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So, I guess it's a bug(?) of "en_GB.UTF-8" collation, but not BASH itself.
Setting LC_COLLATE=C
indeed solves this.
回答1:
It certainly had to do with setting of your locale
. An excerpt from the GNU bash man page under Pattern Matching
[..] in the default
C
locale,[a-dx-z]
is equivalent to[abcdxyz]
. Many locales sort characters in dictionary order, and in these locales[a-dx-z]
is typically not equivalent to[abcdxyz]
; it might be equivalent to[aBbCcDdxXyYz]
, for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting theLC_COLLATE
orLC_ALL
environment variable to the valueC
, or enable theglobasciiranges
shell option.[..]
Use the POSIX
character-classess, [[:upper:]]
in this case or change your locale
setting LC_ALL
or LC_COLLATE
to C
as mentioned above.
LC_ALL=C var='ABCDabcd0123'
echo "${var//[A-Z]/}"
abcd0123
Also, your negative test to do upper-case check will fail for all the lower case letters when setting this locale hence printing the letters,
LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
Also, under the above locale setting
[[ a =~ [A-Z] ]] ; echo $?
1
[[ b =~ [A-Z] ]] ; echo $?
1
but will be true for all lower-case ranges,
[[ a =~ [a-z] ]] ; echo $?
0
[[ b =~ [a-z] ]] ; echo $?
0
Said this, all these can be avoided by using the POSIX
specified character classes, under a new shell without any locale
setting,
echo "${var//[[:upper:]]/}"
abcd0123
and
for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done
来源:https://stackoverflow.com/questions/43448655/weird-behavior-of-bash-glob-regex-ranges