I want to use re.MULTILINE but NOT re.DOTALL, so that I can have a regex that includes both an \"any character\" wildcard and the normal .
wild
[^]
In regex, brackets contains a list and/or range of possible values for one matching character. If that list is empty, I mean []
, any character of string can't match it.
Now, the caret in front of that list and/or range, negates those permitted values. So, in front of an empty list, any character (including newline) will match it.
Regular Expression: (Note the use of space ' ' is also there)
[\S\n\t\v ]
import re
text = 'abc def ###A quick brown fox.\nIt jumps over the lazy dog### ghi jkl'
# We want to extract "A quick brown fox.\nIt jumps over the lazy dog"
matches = re.findall('###[\S\n ]+###', text)
print(matches[0])
The 'matches[0]' will contain:
'A quick brown fox.\nIt jumps over the lazy dog'
\S
Matches any character which is not a whitespace character.
( See: https://docs.python.org/3/library/re.html#regular-expression-syntax )
To match a newline, or "any symbol" without re.S
/re.DOTALL
, you may use any of the following:
[\s\S]
[\w\W]
[\d\D]
The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.
Comparing it to (.|\s)
and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a *
or +
quantifier). Compare the small example: it takes (?:.|\n)+ 45 steps to complete, and it takes [\s\S]+ just 2 steps.