Webscraping Using BeautifulSoup: Retrieving source code of a website

问题

Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips?

from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = urlopen(url).read()
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2 = make_soup(url)

I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!

回答1:

You need to provide the User-Agent header at least.

Example using requests package instead of urllib2:

import requests
from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)

print(soup.select_one("a.next").get('href'))

Prints http://www.alibaba.com/catalogs/products/CID144/2.

来源：https://stackoverflow.com/questions/34318036/webscraping-using-beautifulsoup-retrieving-source-code-of-a-website

标签

python

html

beautifulsoup

html-parsing