Trouble getting correct Xpath

问题

I am trying to pull all product links and image links out of a shopping widget using general xpaths.

This is the site: http://www.stopitrightnow.com/

This is the xpath I have:

xpath('.//*[@class="shopthepost-widget"]/a/@href').extract()

I would of thought this would pull all links but it does nothing.

Following is the beginning of the widget source for reference.

class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls ">
    <a class="stp-control stp-left stp-hidden">&lt;</a>
    <div class="stp-inner">
        <div class="stp-slide" style="left: -0%">
                        <a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878713">
                            </a>
                        <a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878708">
                            </a>

Just copying the xpath would be to specific.

Any and all help will be appreciated.

UPDATE

Here is the spider.

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from main.items import MainItem


class WebSpider(scrapy.Spider):
    name = "web"
    allowed_domains = ["stopitrightnow.com"]
    start_urls = (
        'http://www.stopitrightnow.com/',
    )

    def parse(self, response):
        sel = Selector(response)
        titles = sel.xpath('.//h3[@class="post-title entry-title"]//text()').extract()
        dates = sel.xpath('.//h2[@class="date-header"]/span/text()').extract()
        picUrls = sel.xpath('.//div[@class="post-body entry-content"]//@href').extract()
        stockUrls = sel.xpath('.//*[@class="stp-slide"]/a/@href').extract()

        items = []

        for title, date, picUrl, stockUrl in zip(titles, dates, picUrls, stockUrls):
            item = MainItem()
            item["title"] = title.strip()
            item["date"] = date.strip()
            item["picUrl"] = picUrl.strip()
            item["stockUrl"] = stockUrl.strip()
            items.append(item)
        return items

回答1:

If you look at the result what Scrapy sees you can see that there is some JavaScript involved when creating the tags with class="shopthepost-widget":

<div class="shopthepost-widget" data-widget-id="909962">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br>
<div class="rs-adblock">
<img onerror="this.parentNode.innerHTML='Disable your ad blocking software to view this content.'" src="//assets.rewardstyle.com/images/search/350.gif" style="height: 15px; width: 15px;"><noscript>JavaScript is currently disabled in this browser. Reactivate it to view this content.</noscript></div>
</div>

This you can achieve with the following command:

def parse(self, response):
    for widget in response.xpath("//*[@class='shopthepost-widget']"):
        print widget.extract()

The browser executes the JavaScript code -- but Scrapy does not. That's why you have to verify your input in Scrapy too.

回答2:

From your code, you aren't very familiar with the Selector class and how it works. I'd highly recommend you look at the Selector class and become familiar with it in order to reliably use it. It's really important, because response.xpath is just a convenience method for response.selector.xpath, with response.selector being an instanced Selector class with response.body as its text.

With that out of the way, I'm going to assume that that Scrapy actually see's the provided HTML and address only the xpath issue.

In another question (duplicate of this one), you wrote that you're using the following to get at the items:

for widget in response.xpath("//div[@class='shopthepost-widget']"):
    print response.xpath('.//*[@class="shopthepost-widget"]//a/@href').extract()

As it was, you're re-processing the entire tree and re-extracting every matching item, not just the ones in that node, for every widget.

Use widget instead of re-parsing the entire page. widget will be an instanced Selector class working off the selection you've already made.

for widget in response.xpath("//div[@class='shopthepost-widget']"):
    print widget.xpath('.//a/@href').extract()

This is again obvious with another glaring problem in your crawler. In your parse method. You extract data from various places on the page, then zip it all together and just assumed everything matches up correct. What if a post doesn't have a title? You've then wrongly attributed every articles date/links etc. after it. All missing entries cascade to further mess up your items.

Instead, select the post entries one by one and work of them:

for post in response.xpath('//div[@class="post hentry"]'):
    title = post.xpath('.//h3[@class="post-title entry-title"]//text()'.extract()
    date = post.xpath('.//h2[@class="date-header"]/span/text()').extract()
    # Do more stuff here...

This will only select the relevant tags found below the node you're working on, not across the entire breadth of the response. You maintain the hierarchical relationship that already exists, and have reliable data.

I'd highly recommend you re-read the entire Scrapy documentation and familiarize yourself with it. If you're going to be doing anything further or Scraping multiple pages, also convert to a generator and use yield instead of return.

回答3:

you need to add a double slash as well for the a element:

xpath('.//*[@class="shopthepost-widget"]//a/@href').extract()

also the first . in the xpath means that you will need to have the right context at that point. It will work fine if the "current node" is any parent of the widget, else it won't.

The xpath can also be optimized though, as // is relatively expensive. For example you could target the <div class="stp-slide" /> first if you know that is always going to be the immediate parent:

xpath('.//*[@class="stp-slide"]/a/@href').extract()

来源：https://stackoverflow.com/questions/32116534/trouble-getting-correct-xpath

标签

xml

xpath

web-crawler

scrapy

scrapy-spider