Suggestions on get_text() in BeautifulSoup

前端未结

关注

 3  823

遇见更好的自我

I am using BeautifulSoup to parse some content from a html page.

I can extract from the html the content I want (i.e. the text contained in a span defin

相关标签:

3条回答

伪装坚强ぢ

2020-12-30 09:19
If you are using bs4 you can use strings:
```
" ".join(result.strings)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

無奈伤痛

2020-12-30 09:20

Use 'contents' , then replace <br>?

Here is a full (working, tested) example:

from bs4 import BeautifulSoup
import urllib2

url="http://www.floris.us/SO/bstest.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

result = soup.find(attrs={'class':'myclass'})
print "The result of soup.find:"
print result

print "\nresult.contents:"
print result.contents
print "\nresult.get_text():"
print result.get_text()
for r in result:
  if (r.string is None):
    r.string = ' '

print "\nAfter replacing all the 'None' with ' ':"
print result.get_text()

Result:

The result of soup.find:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

result.contents:
[u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']

result.get_text():
Lorem ipsumdolor sit amet,consectetur...

After replacing all the 'None' with ' ':
Lorem ipsum dolor sit amet, consectetur...

This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the <br/> is its own element in the result.contents tuple, but when converted to string there's "nothing left".

0 讨论(0)

被撕碎了的回忆

2020-12-30 09:36

result.get_text(separator=" ") should work.

0 讨论(0)
发布评论:

提交评论
- 加载中...