BeautifulSoup extract text from comment html [duplicate]

依然范特西╮ 提交于 2020-08-20 07:18:50

问题


Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:

<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>

How do I get the part 'views' and 'interaction'?


回答1:


You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:

from bs4 import BeautifulSoup, Comment
html = """<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])

Outputs:

1.56M Clicks 0

If the comment is not the first on the page you would need to index it like this:

comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]

or find it from a parent element.

Updated in response to comment:

You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.

from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')

for tr in soup.find_all('tr'):
    comment = tr.find(text=lambda text:isinstance(text, Comment))
    commentsoup = BeautifulSoup(comment , 'lxml')
    views = commentsoup.find('span', {'class': 'views'})
    shares= commentsoup.find('span', {'class': 'shares'})
    print (views.get_text(), shares['data-shares'])

Outputs:

3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...



回答2:


The simplest and easiest solution would be to opt for .replace() function. All you need to do is kick out this <!-- and this --> signs from the html elements and the rest are as it is. Take a look at the below script.

from bs4 import BeautifulSoup

htdoc = """
<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">
        <span class="views" clicks="1564058">1.56M Clicks</span>
        <span class="interaction" likes="0"></span>
    </p>-->
</td>
"""
elem = htdoc.replace("<!--","").replace("-->","")
soup = BeautifulSoup(elem,'lxml')
views = soup.select_one('span.views').get_text(strip=True)
likes = soup.select_one('span.interaction')['likes']
print(f'{views}\n{likes}')

Output:

1.56M Clicks
0



回答3:


If you want only the views then:

views = soup.findAll("span", {"class": "views"})

You also can get the whole paragraph with:

p = soup.findAll("p", {"class": "statistics"})

Then you can get the data from the p.



来源:https://stackoverflow.com/questions/52679150/beautifulsoup-extract-text-from-comment-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!