Suggestions on get_text() in BeautifulSoup

前端 未结 3 821
遇见更好的自我
遇见更好的自我 2020-12-30 08:54

I am using BeautifulSoup to parse some content from a html page.

I can extract from the html the content I want (i.e. the text contained in a span defin

相关标签:
3条回答
  • 2020-12-30 09:19

    If you are using bs4 you can use strings:

    " ".join(result.strings)
    
    0 讨论(0)
  • 2020-12-30 09:20

    Use 'contents' , then replace <br>?

    Here is a full (working, tested) example:

    from bs4 import BeautifulSoup
    import urllib2
    
    url="http://www.floris.us/SO/bstest.html"
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    
    result = soup.find(attrs={'class':'myclass'})
    print "The result of soup.find:"
    print result
    
    print "\nresult.contents:"
    print result.contents
    print "\nresult.get_text():"
    print result.get_text()
    for r in result:
      if (r.string is None):
        r.string = ' '
    
    print "\nAfter replacing all the 'None' with ' ':"
    print result.get_text()
    

    Result:

    The result of soup.find:
    <span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>
    
    result.contents:
    [u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']
    
    result.get_text():
    Lorem ipsumdolor sit amet,consectetur...
    
    After replacing all the 'None' with ' ':
    Lorem ipsum dolor sit amet, consectetur...
    

    This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the <br/> is its own element in the result.contents tuple, but when converted to string there's "nothing left".

    0 讨论(0)
  • 2020-12-30 09:36

    result.get_text(separator=" ") should work.

    0 讨论(0)
提交回复
热议问题