Webscraping Using BeautifulSoup: Retrieving source code of a website

家住魔仙堡 提交于 2020-08-24 01:30:24

问题


Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips?

from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = urlopen(url).read()
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2 = make_soup(url)

I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!


回答1:


You need to provide the User-Agent header at least.

Example using requests package instead of urllib2:

import requests
from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)

print(soup.select_one("a.next").get('href'))

Prints http://www.alibaba.com/catalogs/products/CID144/2.



来源:https://stackoverflow.com/questions/34318036/webscraping-using-beautifulsoup-retrieving-source-code-of-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!