I'm currently trying to make a regex that will find all the sentences in a block of text, and so far I've got this;
(?=(?<!mr)\.|(?<!mrs)\.|\?|!)+
Which will find everything that delimits a sentence. I want the regex to find everything that's contained between what this regex finds, but I don't really know where to go from here.
(Moved from your closed newer question)
In your case, the lookbehinds should come before the periods.
Condensing your expression, it is
Update - Between it you could just split discarding delimiters
# (?:(?<!mr)(?<!mrs)\.|\?|!)+
(?:
(?<! mr )
(?<! mrs )
\.
| \?
| !
)+
Or, split keeping delimiters
# ((?:(?<!mr)(?<!mrs)\.|\?|!)+)
(
(?:
(?<! mr )
(?<! mrs )
\.
| \?
| !
)+
)
What about this:
import re
pattern = r'(?=(?<!mr)\.|(?<!mrs)\.|\?|!)+' # I'm assuming this does what you say it does :)
text_block = """long block of sentences"""
sentences = re.split(pattern, text_block)
sentences
will be a list containing the resulting substrings.
re.split
will split text_block
up into different elements of the returned list
. It splits at each point where pattern
matches.
Read about re here:
https://docs.python.org/2/howto/regex.html
EDIT(data imported from your closed newer question):
If you are getting the symbols like ?, ! etc. captured into your returned list aswell, you should try removing the outer parens, like this:
re.split(r"\.(?<!mr)|\.(?<!mrs)|\?|!", somestring)
Ex:
sentences = [s for s in re.split(r"\.(?<!mr)|\.(?<!mrs)|\?|!", somestring) if s]
来源:https://stackoverflow.com/questions/26511010/match-everything-delimited-by-another-regex