问题
Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips?
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = urlopen(url).read()
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2 = make_soup(url)
I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!
回答1:
You need to provide the User-Agent
header at least.
Example using requests package instead of urllib2
:
import requests
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)
print(soup.select_one("a.next").get('href'))
Prints http://www.alibaba.com/catalogs/products/CID144/2
.
来源:https://stackoverflow.com/questions/34318036/webscraping-using-beautifulsoup-retrieving-source-code-of-a-website