BeautifulSoup does not read 'full' HTML obtained by requests

只谈情不闲聊 提交于 2021-02-11 02:54:25

问题


I am trying to scrape URL's from a website presented as HTML using the BeautifulSoup and requests libraries. I am running both of them on Python 3.5. It seems I am succesfully getting the HTML from requests because when I display r.content, the full HTML of the website I am trying to scrape is displayed. However, when I pass this to BeautifulSoup, BeautifulSoup drops the bulk of the HTML, including the URL I am trying to scrape.

from bs4 import BeautifulSoup
import requests

page = requests.get('www.example.com')
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.findAll('div'))

I already tried using other parsers like html5lib, lxml already without any success.

However, the output does not show all the 'div' that are actually on the website's HTML code.

This is the link to the website.

I want to scrape the URL from 'h1.post-title'.


回答1:


This is because the page you're scraping is dynamic. Meaning that its content is generated with JavaScript and it takes some times to fully render it (not initially present statically).

You should use something like Selenium or Puppeteer to load the page, wait for it to fully render, then scrape the content you need to extract.



来源:https://stackoverflow.com/questions/54568529/beautifulsoup-does-not-read-full-html-obtained-by-requests

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!