Pin down exact content location in html for web scraping urllib2 Beautiful Soup

问题

I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem

For this, I have the following code:

url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem
review_url = url       
#print review_url
    #-------------------------------------------------------------------------
    # Scrape the ratings
    #-------------------------------------------------------------------------
    page_no = 1
    sum_total_reviews = 0
    more = True

    while (more):
        #print "XXXX"
        # Open the URL to get the review data
        request = urllib2.Request(review_url)
        try:
            #print "XXXX"
            page = urllib2.urlopen(request)
        except urllib2.URLError, e:
            #print "XXXXX"
            if hasattr(e, 'reason'):
                print 'Failed to reach url'
                print 'Reason: ', e.reason
                sys.exit()
            elif hasattr(e, 'code'):
                if e.code == 404:
                    print 'Error: ', e.code
                    sys.exit()

        content = page.read()
        #print content
        soup = BeautifulSoup(content)
        results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})

With this, I'm not able to read anything. I'm guessing I have to give it an accurate destination. Any suggestions ?

回答1:

I understand that the question was initially about BeautifulSoup, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.

Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:

from selenium.webdriver.firefox import webdriver


driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')

for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
    title = review.find_element_by_class_name('BVRRReviewTitle').text
    rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
    print title, rating

driver.close()

prints:

Renee Culver loves Clorox Wipes 5 out of 5
Men at work 5 out of 5
clorox wipes 5 out of 5
...

Also, take into account that you can use a headless PhantomJS browser (example).

Another option is to make use of Walmart API.

Hope that helps.

回答2:

The reviews are loaded using AJAX call. You can not find those on the link that you provided. The reviews are loaded from the following link:

http://walmart.ugc.bazaarvoice.com/1336/29701960/reviews.djs?format=embeddedhtml&dir=desc&sort=relevancy

Here 29701960 is found from the html source of your current source like this way:

<meta property="og:url" content="http://www.walmart.com/ip/29701960" />
                                                           +------+ this one

trackProductId : '29701960',
                  +------+ or this one

And 1336 is from the source:

WALMART.BV.scriptPath =  'http://walmart.ugc.bazaarvoice.com/static/1336/';
                                                                    +--+ here

Using the values, build the Above url and parse the data from there using BeautifulSoup.

来源：https://stackoverflow.com/questions/22595693/pin-down-exact-content-location-in-html-for-web-scraping-urllib2-beautiful-soup

标签

python

html

html-parsing

beautifulsoup

urllib2