BeautifulSoup Grab Visible Webpage Text

前端 未结 10 627
北恋
北恋 2020-11-22 07:35

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

10条回答
  •  孤独总比滥情好
    2020-11-22 07:53

    The title is inside an tag, which is nested inside an

    tag and a
    tag with id "article".

    soup.findAll('nyt_headline', limit=1)
    

    Should work.

    The article body is inside an tag, which is nested inside a

    tag with id "articleBody". Inside the element, the text itself is contained within

    tags. Images are not within those

    tags. It's difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.

    text = soup.findAll('nyt_text', limit=1)[0]
    text.findAll('p')
    

提交回复
热议问题