I\'m currently trying to make a regex that will find all the sentences in a block of text, and so far I\'ve got this;
(?=(?
(Moved from your closed newer question)
In your case, the lookbehinds should come before the periods.
Condensing your expression, it is
Update - Between it you could just split discarding delimiters
# (?:(?<!mr)(?<!mrs)\.|\?|!)+
(?:
(?<! mr )
(?<! mrs )
\.
| \?
| !
)+
Or, split keeping delimiters
# ((?:(?<!mr)(?<!mrs)\.|\?|!)+)
(
(?:
(?<! mr )
(?<! mrs )
\.
| \?
| !
)+
)
What about this:
import re
pattern = r'(?=(?<!mr)\.|(?<!mrs)\.|\?|!)+' # I'm assuming this does what you say it does :)
text_block = """long block of sentences"""
sentences = re.split(pattern, text_block)
sentences
will be a list containing the resulting substrings.
re.split
will split text_block
up into different elements of the returned list
. It splits at each point where pattern
matches.
Read about re here:
https://docs.python.org/2/howto/regex.html
EDIT(data imported from your closed newer question):
If you are getting the symbols like ?, ! etc. captured into your returned list aswell, you should try removing the outer parens, like this:
re.split(r"\.(?<!mr)|\.(?<!mrs)|\?|!", somestring)
Ex:
sentences = [s for s in re.split(r"\.(?<!mr)|\.(?<!mrs)|\?|!", somestring) if s]