python regular expression to split paragraphs

前端未结

关注

 5  530

执笔经年

How would one write a regular expression to use in python to split paragraphs?

A paragraph is defined by 2 linebreaks (\\n). But one can have any amount of spaces/ta

相关标签:

5条回答

夕颜

2021-01-19 02:04

Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?

You might be able to simply use the Docutils parser rather than roll your own.

0 讨论(0)
发布评论:

提交评论
- 加载中...

佛祖请我去吃肉

2021-01-19 02:04

Not a regexp but really elegant:

from itertools import groupby

def paragraph(lines) :
    for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace) :
        if not group_separator :
            yield ''.join(line_iteration)

for p in paragraph('p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp'): 
    print repr(p)

'p1\n'
'p2\t\n\tstill p2\t   \n'
'\tp3'

It's up to you to strip the output as you need it of course.

Inspired from the famous "Python Cookbook" ;-)

0 讨论(0)

清酒与你

2021-01-19 02:21
Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.
```
\s*?\n\s*?\n\s*?
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2021-01-19 02:27

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2021-01-19 02:30

FYI: I just wrote 2 solutions for this type of problem in another thread. First using regular expressions as requested here, and second using a state machine approach which streams through the input one line at a time:

https://stackoverflow.com/a/64863601/5201675

0 讨论(0)
发布评论:

提交评论
- 加载中...