BeautifulSoup extract text from comment html [duplicate]

问题

Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:

<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>

How do I get the part 'views' and 'interaction'?

回答1:

You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:

from bs4 import BeautifulSoup, Comment
html = """<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])

Outputs:

1.56M Clicks 0

If the comment is not the first on the page you would need to index it like this:

comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]

or find it from a parent element.

Updated in response to comment:

You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.

from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')

for tr in soup.find_all('tr'):
    comment = tr.find(text=lambda text:isinstance(text, Comment))
    commentsoup = BeautifulSoup(comment , 'lxml')
    views = commentsoup.find('span', {'class': 'views'})
    shares= commentsoup.find('span', {'class': 'shares'})
    print (views.get_text(), shares['data-shares'])

Outputs:

3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...

回答2:

The simplest and easiest solution would be to opt for .replace() function. All you need to do is kick out this  signs from the html elements and the rest are as it is. Take a look at the below script.

from bs4 import BeautifulSoup

htdoc = """
<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">
        <span class="views" clicks="1564058">1.56M Clicks</span>
        <span class="interaction" likes="0"></span>
    </p>-->
</td>
"""
elem = htdoc.replace("<!--","").replace("-->","")
soup = BeautifulSoup(elem,'lxml')
views = soup.select_one('span.views').get_text(strip=True)
likes = soup.select_one('span.interaction')['likes']
print(f'{views}\n{likes}')

Output:

1.56M Clicks
0

回答3:

If you want only the views then:

views = soup.findAll("span", {"class": "views"})

You also can get the whole paragraph with:

p = soup.findAll("p", {"class": "statistics"})

Then you can get the data from the p.

来源：https://stackoverflow.com/questions/52679150/beautifulsoup-extract-text-from-comment-html