Why is Scrapy skipping some URL's but not others?

问题

I am writing a scrapy crawler to grab info on shirts from Amazon. The crawler starts on an amazon page for some search, "funny shirts" for example, and collects all the result item containers. It then parses through each result item collecting data on the shirts.

I use ScraperAPI and Scrapy-user-agents to dodge amazon. The code for my spider is:

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    page_number = 2

    keyword_file = open("keywords.txt", "r+")
    all_key_words = keyword_file.readlines()
    keyword_file.close()
    all_links = []
    keyword_list = []

    for keyword in all_key_words:
        keyword_list.append(keyword)
        formatted_keyword = keyword.replace('\n', '')
        formatted_keyword = formatted_keyword.strip()
        formatted_keyword = formatted_keyword.replace(' ', '+')
        all_links.append("http://api.scraperapi.com/?api_key=mykeyd&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2")

    start_urls = all_links

def parse(self, response):
    print("========== starting parse ===========")

    all_containers = response.css(".s-result-item")
    for shirts in all_containers:
        next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a::attr(href)').get()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield response.follow(second_page, callback=self.parse)


def parse_dir_contents(self, response):
    items = ScrapeAmazonItem()

    print("============= parsing page ==============")

    temp = response.css('#productTitle::text').extract()
    product_name = ''.join(temp)
    product_name = product_name.replace('\n', '')
    product_name = product_name.strip()

    temp = response.css('#priceblock_ourprice::text').extract()
    product_price = ''.join(temp)
    product_price = product_price.replace('\n', '')
    product_price = product_price.strip()

    temp = response.css('#SalesRank::text').extract()
    product_score = ''.join(temp)
    product_score = product_score.strip()
    product_score = re.sub(r'\D', '', product_score)

    product_ASIN = re.search(r'(?<=/)B[A-Z0-9]{9}', response.url)
    product_ASIN = product_ASIN.group(0)

    items['product_ASIN'] = product_ASIN
    items['product_name'] = product_name
    items['product_price'] = product_price
    items['product_score'] = product_score

    yield items

Crawling looks like this:

https://i.stack.imgur.com/UbVUt.png

I'm getting a 200 returned so I know I'm getting the data from the webpage, but sometimes it does not go into parse_dir_contents, or it only grabs info on a few shirts and then moves on to the next keyword without following pagination.

Working with two keywords: the first keyword in my file (keywords.txt) is loaded, it may find 1-3 shirts, then it moves on to the next keyword. The second keyword is then completely successful, finding all shirts and following pagination. In a keyword file with 5+ keywords, the first 2-3 keywords are skipped, then the next keyword is loaded and only 2-3 shirts are found before it moves onto the next word which is again completely successful. In a file with 10+ keywords I get very sporadic behavior.

I have no idea why this is happening can anyone explain?

回答1:

first check if robots.txt is being ignored, from what you have said I suppose you already have that.

Sometimes the html code returned from the response is not the same as the one you are seeing when you look at the product. I dont really know what exactly is going on in your case but you can check what the spider is actually "reading" with.

scrapy shell 'yourURL'

After that

view(response)

There you can check out the code that the spider is actually seeing if the requests succeeds.

Sometimes the request does not succeed (Maybe Amazon is redirecting you to a CAPTCHA or something).

You can check the response while scraping with (Please check the code below, Im doing this from memory)

import request

#inside your parse method

r = request.get("url")
print(r.content)

If I remember correctly, you can get the URL from scrapy itself (something along the lines of response.url.

回答2:

Try to make use of dont_filter=True in your scrapy Requests. I had the same problem, seemed like the scrapy crawler was ignoring some URLs because it thought they were duplicate.

dont_filter=True

This makes sure that scrapy doesn't filter any URLS with its dupefilter.

来源：https://stackoverflow.com/questions/57760431/why-is-scrapy-skipping-some-urls-but-not-others

标签

python

proxy

scrapy

amazon

middleware