How to match a paragraph using regex

前端 未结 5 2020
误落风尘
误落风尘 2020-12-14 13:11

I have been struggling with python regex for a while trying to match paragraphs within a text, but I haven\'t been successful. I need to obtain the start and end positions o

相关标签:
5条回答
  • 2020-12-14 13:29

    What is the newline symbol? Let us suppose the newline symbol is '\r\n', if you want to match the paragraphs starting with Lorem, you can do like this:

    pattern = re.compile('\r\nLorem.*\r\n')
    str = '...'    # your source text
    matchlist = re.findall(pattern, str)
    

    The matchlist will contain all the paragragh start with Lorem. And the other two words are the same.

    0 讨论(0)
  • i tried to use the recommended RegEx with the default Java RegEx engine. That gave me several times a StackOverflowException, so in the end i rewrote the RegEx and optimized it a little more.

    So this is working fine for me in Java:

    (?s)(.*?[^\:\-\,])(?:$|\n{2,})
    

    This also handles the end of document without new lines and tries to concat lines which ends with ':', '-' or ',' to the next paragraph.

    And to avoid that trailing blanks (whitespace or tabs) breaks the above described feature i am stripping them before with following regex:

    (?m)[[:blank:]]+$
    
    0 讨论(0)
  • 2020-12-14 13:32

    Using split is one way, you can do so with regular expression also like this:

    paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)
    

    The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

    So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)

    0 讨论(0)
  • 2020-12-14 13:38

    Try

    ^(.+?)\n\s*\n
    

    or

    ^(.+?)\r\n\s*\r\n
    

    just do not forget append extra new line at the end of text

    0 讨论(0)
  • 2020-12-14 13:39

    You can split on double-newline like this:

    paragraphs = re.split(r"\n\n", DATA)
    

    Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

    for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
       print match.start(), match.end()
    
    # Prints:
    # 0 214
    # 215 298
    # 299 589
    
    0 讨论(0)
提交回复
热议问题