问题
I've been seeing this issue on Windows. When I try to clear any whitespace on each line on Unix:
const input =
`===
HELLO
WOLRD
===`
console.log(input.replace(/^\s+$/gm, ''))
This produces what I expect:
===
HELLO
WOLRD
===
i.e. if there were spaces on blank lines, they'd get removed.
On the other hand, on Windows, the regex clears the WHOLE string. To illustrate:
const input =
`===
HELLO
WOLRD
===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))
(template literals will always print only \n
in JS, so I had to replace with \r\n
to emulate Windows (?
after \r
just to be sure for those who don't believe). The result:
===
HELLO
WOLRD
===
The whole line is gone! But my regex has ^
and $
with the m
flag set, so it's kind of /^-to-$/m
. What's the difference between \r
and \r\n
then that makes it produce different results?
when I do some logging
console.log(input.replace(/^\s*$/gm, (m) => {
console.log('matched')
return ''
}))
With \r\n I'm seeing
matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===
and with \n only
matched
matched
matched
===
HELLO
WOLRD
===
回答1:
TL;DR a pattern including whitespace and line breaks will also match characters part of a \r\n
sequence, if you let it.
First of all, let's actually examine what characters are there and aren't there when you do a replacement. Starting with a string that only uses line feeds:
const inputLF =
`===
HELLO
WOLRD
===`.replace(/\r?\n/g, "\n");
console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')
debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);
const replaceLF = inputLF.replace(/^\s+$/gm, '');
console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')
debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);
console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);
console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)
function debugPrint(str, charIndex) {
console.log(`index: ${charIndex}
charcode: ${str.charCodeAt(charIndex)}
character: ${str.charAt(charIndex)}`
);
}
Each line ends with char code 10 which is the Line Feed (LF) character that is represented in a string literal with \n
. Before and after the replacement, the two strings are the same - not only look the same but actually equal each other, so the replacement did nothing.
Now let's examine the other case:
const inputCRLF =
`===
HELLO
WOLRD
===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')
debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);
const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;
console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')
debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);
function debugPrint(str, charIndex) {
console.log(`index: ${charIndex}
charcode: ${str.charCodeAt(charIndex)}
character: ${str.charAt(charIndex)}`
);
}
This time each line ends with char code 13 which is the Carriage Return (CR) character that is represented in a string literal with \r
and then the LF follows. After the replacement, instead of having a sequence of =\r\n\r\nH
instead it's not just =\r\nH
. Let's look at why.
Here is what MDN says about the meta character ^
:
Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character.
And here is what MDN says about the meta character $
Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character.
So they match after and before a line break character. In that, MDN means the LF or the CR. This can be seen if we test a string that contains different line breaks:
const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";
const regexStart = /^\s/m;
const regexEnd = /\s$/m;
console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));
console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));
If we try to match whitespace near a line break, this doesn't match anything if there is an LF but it does match the CR with CRLF. So, in that case $
would match here:
"hello\r\nworld"
^^ what `^\s` matches
"hello\r\nworld"
^^ what `\s$` matches
So both ^
and $
recognise either of the CRLF sequence as end of line. This will make a difference when you do a search and replace. Since your regex specifies ^\s+$
that means that when you have a line that is entirely \r\n
then it matches. But for a reason that is not obvious:
const re = /^\s+$/m;
const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";
console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));
So, the regex doesn't match an\r\n
but rather \n\r
(two whitespace characters) between two other line breakcharacters. That's because +
is eager and will consume as much of the character sequence as it can get away with. Here is what the regex engine will try. Somewhat simplified for brevity:
input = "hello\r\n\r\nworld
regex = /^\s+$/
Step 1
hello[\r]\n\r\nworld
matches `^`, symbol satisfied -> continue with next symbol in regex
Step 2
hello[\r\n]\r\nworld
matches `^\s+` -> continue matching to satisfy `+` quantifier
Step 3
hello[\r\n\r]\nworld
matches `^\s+` -> continue matching to satisfy `+` quantifier
Step 4
hello[\r\n\r\n]world
matches `^\s+` -> continue matching to satisfy `+` quantifier
Step 5
hello[\r\n\r\nw]orld
does not match `\s` -> backtrack
Step 6
hello[\r\n\r\n]world
matches `^\s+`, quantifier satisfied -> continue to next symbol in regex
Step 7
hello[\r\n\r\nw]orld
does not match `$` in `^\s+$` -> backtrack
Step 8
hello[\r\n\r\n]world
matches `^\s+$`, last symbol satisfied -> finish
Lastly, there is something slightly hidden here - it matters that you're matching whitespace. This is because it will behave differently to most other symbols in that it explicitly matches a line break character, whereas . will not:
Matches any single character except line terminators
So, if you specify \s$
this will match the CR in \r\n
because the regex engine is forced to look for a match for both \s
and $
, therefore it finds the \r
before the \n
. However, this will not happen for many other patterns, since $
will usually be satisfied when it's before CR (or at the end of the string).
Same with ^\s
it will explicitly look for a whitespace character after a linebreak which is satisfied by the LF in CRLF, however if you're not seeking that, then it will happily match after the LF:
const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";
const regexStartAll = /^./mg;
const regexEndAll = /.$/gm;
console.log(stringLF.match(regexStartAll));
console.log(stringCRLF.match(regexStartAll));
console.log(stringLF.match(regexEndAll));
console.log(stringCRLF.match(regexEndAll));
So, all of this means that ^\s+$
has some unintuitive behaviour yet perfectly coherent once you understand that the regex engine matches exactly what you tell it to.
来源:https://stackoverflow.com/questions/60729065/why-does-lf-and-crlf-behave-differently-with-s-gm-regex