问题
I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem
For this, I have the following code:
url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem
review_url = url
#print review_url
#-------------------------------------------------------------------------
# Scrape the ratings
#-------------------------------------------------------------------------
page_no = 1
sum_total_reviews = 0
more = True
while (more):
#print "XXXX"
# Open the URL to get the review data
request = urllib2.Request(review_url)
try:
#print "XXXX"
page = urllib2.urlopen(request)
except urllib2.URLError, e:
#print "XXXXX"
if hasattr(e, 'reason'):
print 'Failed to reach url'
print 'Reason: ', e.reason
sys.exit()
elif hasattr(e, 'code'):
if e.code == 404:
print 'Error: ', e.code
sys.exit()
content = page.read()
#print content
soup = BeautifulSoup(content)
results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})
With this, I'm not able to read anything. I'm guessing I have to give it an accurate destination. Any suggestions ?
回答1:
I understand that the question was initially about BeautifulSoup
, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.
Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
driver.close()
prints:
Renee Culver loves Clorox Wipes 5 out of 5
Men at work 5 out of 5
clorox wipes 5 out of 5
...
Also, take into account that you can use a headless PhantomJS browser (example).
Another option is to make use of Walmart API.
Hope that helps.
回答2:
The reviews are loaded using AJAX call. You can not find those on the link that you provided. The reviews are loaded from the following link:
http://walmart.ugc.bazaarvoice.com/1336/29701960/reviews.djs?format=embeddedhtml&dir=desc&sort=relevancy
Here 29701960
is found from the html source of your current source like this way:
<meta property="og:url" content="http://www.walmart.com/ip/29701960" />
+------+ this one
or
trackProductId : '29701960',
+------+ or this one
And 1336
is from the source:
WALMART.BV.scriptPath = 'http://walmart.ugc.bazaarvoice.com/static/1336/';
+--+ here
Using the values, build the Above url and parse the data from there using BeautifulSoup
.
来源:https://stackoverflow.com/questions/22595693/pin-down-exact-content-location-in-html-for-web-scraping-urllib2-beautiful-soup