问题
regcomp
(from glibc) is a POSIX function for compiling regular expressions.
int regcomp(regex_t *restrict preg, const char *restrict pattern,
int cflags);
There are some constructions in regular expressions which depend on the idea of a single character, for example [abc]
.
If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters.
Here I illustrate this idea with grep
(which must not be the same in this respect as the C function regcomp
):
$ { echo Г; echo Д; } | egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=C egrep '[Д]'
Г
Д
$
LANG
is the default value if any of the specific locale variables are not set, so the question is: which one of them would affect the regcomp
's idea about the encoding.
$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$
回答1:
As for grep
(which must not have the same behavior as regcomp
), it seems to honor LC_CTYPE
for this decision:
$ { echo Г; echo Д; } | LANG=en_US.utf8 egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[Д]'
Г
Д
$
来源:https://stackoverflow.com/questions/40809460/what-does-constitute-one-character-for-regcomp-which-multibyte-encoding-does-de