Convert
to end line

前端未结

关注

 6  1797

别跟我提以往

I\'m trying to extract some text using BeautifulSoup. I\'m using get_text() function for this purpose.

My problem is that the text contain

相关标签:

6条回答

情歌与酒

2020-12-08 19:28
Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:
```
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.replace_with(elem.text + "\n\n")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-08 19:35
A regex should do the trick.
```
import re
s = re.sub('<br\s*?>', '\n', yourTextHere)
```
Hope this helps!
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-08 19:39

As official doc says:

You can specify a string to be used to join the bits of text together: soup.get_text("\n")

0 讨论(0)
发布评论:

提交评论
- 加载中...

旧巷少年郎

2020-12-08 19:39

If you call element.text you'll get the text without br tags. Maybe you need define your own custom method for this purpose:

     def clean_text(elem):
        text = ''
        for e in elem.descendants:
            if isinstance(e, str):
                text += e.strip()
            elif e.name == 'br' or e.name == 'p':
                text += '\n'
        return text

    # get page content
    soup = BeautifulSoup(request_response.text, 'html.parser')
    # get your target element
    description_div = soup.select_one('.description-class')
    # clean the data
    print(clean_text(description_div))

0 讨论(0)

庸人自扰

2020-12-08 19:46
Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.

To steal the list from @petezurich:
```
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.append('\n')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-12-08 19:51
You can do this using the BeautifulSoup object itself, or any element of it:
```
for br in soup.find_all("br"):
    br.replace_with("\n")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

Convert to end line

Convert
to end line