BeautifulSoup isn't working while web scraping Amazon

问题

I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.

Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag. On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.

The last line of code displayed is:

<!--[endif]---->

then body tag closes.

Here is the code that i'm trying:

from bs4 import BeautifulSoup as bs
import requests

source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')

print(soup.prettify())
#On printing this it misses some portion of html

article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'

Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.

And of course the information needed is in the missing html code.

Is Amazon blocking my request? Please help.

回答1:

The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.

(Source: http://go-colly.org/articles/scraping_related_http_headers/)

The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

Example:

from bs4 import BeautifulSoup
import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser') 
...
<your code here>

Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:

headers = { 
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip', 
'DNT' : '1', # Do Not Track Request Header 
'Connection' : 'close'
}

来源：https://stackoverflow.com/questions/56049736/beautifulsoup-isnt-working-while-web-scraping-amazon

标签

python

web-scraping

beautifulsoup