问题
I am trying to scrape URL's from a website presented as HTML using the BeautifulSoup and requests libraries. I am running both of them on Python 3.5. It seems I am succesfully getting the HTML from requests because when I display r.content, the full HTML of the website I am trying to scrape is displayed. However, when I pass this to BeautifulSoup, BeautifulSoup drops the bulk of the HTML, including the URL I am trying to scrape.
from bs4 import BeautifulSoup
import requests
page = requests.get('www.example.com')
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.findAll('div'))
I already tried using other parsers like html5lib, lxml already without any success.
However, the output does not show all the 'div' that are actually on the website's HTML code.
This is the link to the website.
I want to scrape the URL from 'h1.post-title'.
回答1:
This is because the page you're scraping is dynamic. Meaning that its content is generated with JavaScript and it takes some times to fully render it (not initially present statically).
You should use something like Selenium or Puppeteer to load the page, wait for it to fully render, then scrape the content you need to extract.
来源:https://stackoverflow.com/questions/54568529/beautifulsoup-does-not-read-full-html-obtained-by-requests