python regex match full paragraph including new line

烂漫一生 提交于 2020-12-13 03:36:15

问题


I've a text file, from that I want to match the full paragraph block but my current regex doesn't work to match full paragraph including the new line.

Text Example:

NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

EXONERAR DOUGLAS ALVES BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020.

From the above text block I want to match the full paragraph starting with word NOMEAR

NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.


NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020.

What I have tried

import re
pattern = re.compile("NOMEAR (.*)", re.DOTALL)

for i, line in enumerate(open('pdf_text_tika.txt')):
    for match in re.finditer(pattern, line):
        print ('Found on line %s: %s' % (i+1, match.group()))

Output:

Found on line 1305: NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão

Found on line 1316: NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo


回答1:


You may use this simpler regex using MULTILINE mode:

^NOMEAR.+(?:\n.+)*

In python:

import re

pattern = re.compile(r'^NOMEAR.+(?:\n.+)*', re.MULTILINE)

with open('pdf_text_tika.txt', 'r') as file:
    data = file.read()

print (pattern.findall(data))

RegEx Demo




回答2:


Using this pattern:

(NOMEAR (?:.+\n)+)

And this code:

import re

pattern = re.compile(r'(NOMEAR (?:.+\n)+)')
text = 'NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão\n' \
    'OTHER TEXT GOES HERE\n' \
    '....................\n' \
    '020007/002832/2020.\n\n' \
    'EXONERAR DOUGLAS ALVES BORRHER do cargo em comissão\n' \
    'OTHER TEXT GOES HERE\n' \
    '....................\n' \
    '020007/002832/2020.\n\n' \
    'NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo\n' \
    'OTHER TEXT GOES HERE\n' \
    '....................\n' \
    '020007/002832/2020.'

print(pattern.findall(text))

The output is (I formatted the newlines to be more readable since it all came in one line):

['NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão\n
OTHER TEXT GOES HERE\n
....................\n
020007/002832/2020.\n',

'NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo\n
OTHER TEXT GOES HERE\n
....................\n']



回答3:


Are you trying to find 2 matches in your sample text; i.e., the 2 portions that begin with NOMEAR and end with a period followed by either 2 newlines or the end of the whole text?

import re

text = """NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

EXONERAR DOUGLAS ALVES BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020."""

pattern = re.compile("NOMEAR (?:.*?).(?:\n\n|\Z)", re.DOTALL)

matches = re.findall(pattern, text)

print("".join(matches))



回答4:


import re
t = """NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

EXONERAR DOUGLAS ALVES BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.

NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020."""

r = re.compile(r'(?=NOMEAR)(.*?)(?<=\d[.])', flags=re.S)

for i in r.finditer(t):
    print(i.group(0))
NOMEAR JOSIAS CARLOS BORRHER do cargo em comissão
OTHER TEXT GOES HERE
....................
020007/002832/2020.
NOMEAR RAFAEL DOS SANTOS PASSAGEM para exercer o cargo
OTHER TEXT GOES HERE
....................
020007/002832/2020.


来源:https://stackoverflow.com/questions/65010417/python-regex-match-full-paragraph-including-new-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!