I often forget about the regular expression modifiers m
and s
and their differences. What is a good way to remember them?
As I understand t
It's not uncommon to find someone who's been using regexes for years who still doesn't understand how those two modifiers work. As you observed, the names "multiline" and "singleline" are not very helpful. They sound like they must be mutually exclusive, but they're completely independent. I suggest you ignore the names and concentrate on what they do: m
changes the behavior of the anchors (^
and $
), and s
changes the behavior of the dot (.
).
One prominent person who mixed up the modes is the author of Ruby. He created his own regex implementation based on Perl's, except he decided to have ^
and $
always be line anchors--that is, multiline mode is always on. Unfortunately, he also incorrectly named the dot-matches-everything mode multiline. So Ruby has no s
modifier, but its m
modifier does what s
does in other flavors.
As for always using /ism
, I recommend against it. It's mostly harmless, as you've discovered, but it sends a confusing message to anyone else who's trying to figure out what the regex was supposed to do (or even to yourself, in the future).
I like the explanation in 'man perlre':
m Treat string as multiple lines.
s Treat string as single line.
With multiple lines, ^ and $ apply to individual lines (i.e. just before and after newlines).
With a single line, ^ and $ apply to the whole, and \n just becomes another character you can match.
[Wrong]By using both m and s as you described, I would expect the second one to take precedence, so you would always be in multiline mode with /ism.[/Wrong]
I didn't read far enough:
The "/s" and "/m" modifiers both override the $* setting. That is, no matter what $* contains, "/s" without "/m" will force "^" to match only at the beginning of the string and "$" to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
I can write more clearly what they are, and a way to remember them, and I am writing it as related to JavaScript:
s
flag. It only has the m
flag. As of January 2020, Firefox still doesn't have it and Chrome has it. And NodeJS has it. It is in the ES2018 specs. s
is also called dotall
or singleline
. And it really is just for the .
to match any (ASCII) character, including \n
, \r
, \u2028
(line break), \u2029
(paragraph break). When people ask you, what does .
match? And if you answer "any character", then it is not entirely correct. It is all (ASCII) characters except the newline character, \r
and the unicode line break and paragraph break. For it to match really all ASCII characters, it needs to have the s
flag on. s
flag in Firefox or in any platform, it can be [^]
, [\s\S]
, [\d\D]
, etc, or (.|\s)
.s
flag that is missing in traditional JavaScript.m
flag. It stands for multiline. And it really is very simple: Without the m
flag, the ^
and $
will match the beginning and end of the whole string only. So "John Doe\nMary Lee".match(/^John Doe$/)
will not match, and "John Doe\nMary Lee".match(/^John Doe$/m)
will match. That's all. Don't think about it in a too complicated way. It just changes how ^
and $
will match.a
and then whatever characters including newline, and f
, but a
must be at the beginning of a line and f
must be at the end of line, even if out of 2000 lines of text, then "a b c \n d e f\nha".match(/^a.*f$/ms)
is what needs to be used. Both .
matching \n
, and ^
and $
matching beginning of line and end of line.That's it. The above was tested on NodeJS and Chrome, which already supports the s
flag. (and the m
flag has long been supported). And remember, you can always fix the s
flag missing issue by using [^]
Now, why was ms
or ism
being used a lot in the past? Because a lot of times, when we have a really long string (e.g. 2000 lines of HTML), such as in the case of some web content we get back, we rarely want to match the ^
with beginning of the entire string and $
with the end of the entire string. So that's why we use the m
flag. Now, we probably want to match across lines, because (although not recommended to use regex for matching HTML), we may use /<h1>.*?</h1>/
for a non-greedy match of a header, for example. We don't mind the \n
in the content, because the author of the HTML can very well have a \n
(or not). So that's why we use the "dotall" flag s
.
But if you are trying to extract some info from a webpage, you probably won't care about if something is at the beginning of line or end of line (because HTML files can have spaces in them (or as indentation), and it doesn't affect the page content (usually, unless if there is <pre>
etc)), so you won't need to use ^
or $
, and therefore you can forget about the m
flag. And if you don't mind using [^]*?
instead of .*?
, then you can forget about the s
flag too -- end of story.
Perl Cookbook said it in two sentences:
The difference between
/m
and/s
is important:/m
makes^
and$
match next to a newline, while/s
makes.
match newlines. You can even use them together - they're not mutually exclusive options.
maybe this way, i will never forget:
when i want to match across lines (usually using .*? to match something that doesn't matter if it span across multiple line), i will naturally think of multiline, and therefore, 'm'. Well, 'm' is actually not the one, so it is 's'.
(since i already remember 'ism' so well... so i can always remember it is not 'm', then it must be 's').
other lame attempt includes:
s
is for DOTALL, it is for DOT to match ALL.
m
is multiline -- it is for ^
and $
to match a lot of times.