scrapy-spider | 易学教程

Scrapy with multiple pages

阅读更多关于 Scrapy with multiple pages

问题 I have created a simple scrapy project, In which, I got the total page number from the initial site example.com/full. Now I need to scrape all the page starting from example.com/page-2 to 100(if total page count is 100). How can I do that? Any advice would be helpful. Code: import scrapy class AllSpider(scrapy.Spider): name = 'all' allowed_domains = ['example.com'] start_urls = ['https://example.com/full/'] total_pages = 0 def parse(self, response): total_pages = response.xpath("//body

requests disappear after queueing in scrapy

阅读更多关于 requests disappear after queueing in scrapy

问题 Scrapy seems to complete without processing all the requests. I know this because i am logging before and after queueing the request and I can clearly see that. I am logging in both parse and error callback methods and none of them got called for those missing requests. How can I debug what happened to those requests? 回答1: You need to add dont_filter=True when re-queueing the request. Though the request may not match other request but Scrapy remembers what requests it has already made and it

Why does my Scrapy spider only scrape some of my data?

阅读更多关于 Why does my Scrapy spider only scrape some of my data?

问题 I'm trying to use Scrapy to scrape IMDb data (episode information and cast list) for each episode of Law & Order: SVU . After I run the code below, I export it to CSV via the command line with "scrapy crawl svu -o svu.csv". The code below successfully pulls episode information, but the CSV does not contain the cast list. How do I fix the code to extract and export both the episode information and the cast list? My thoughts & attempts: I believe that the cast list is extracted because it is

Scrapy not following pagination properly, catches the first link in the pagination

阅读更多关于 Scrapy not following pagination properly, catches the first link in the pagination

问题 Yesterday I started learning Scrapy to extract some information but I can't seem to get the pagination right. I followed the tutorial here but I think the site has a different pagination system. Most pagination's have a class="next" but this one doesn't have that. It only has a list where the current page is listed as a span with the class current: <div class="pagination"> <ul class="page-numbers"> <li><span class='page-numbers current'>1</span></li> <li><a class='page-numbers' href='https:/

Test scrapy spider still working - find page changes

阅读更多关于 Test scrapy spider still working - find page changes

问题 How can I test a scrapy spider against online data. I now from this post that it is possible to test a spider against offline data. My target is to check if my spider still extracts the right data from a page, or if the page changed. I extract the data via XPath and sometimes the page receives and update and my scraper is no longer working. I would love to have the test as close to my code as possible, eg. using the spider and scrapy setup and just hook into the parse method. 回答1: Referring

Custom signal not being handled by Scrapy internal API

阅读更多关于 Custom signal not being handled by Scrapy internal API

问题 I am trying to handle a custom signal ' signalizers.item_extracted ' in a Scrapy extension 'MyExtension' which is successfully enabled when scrapy starts. Here is my code: signalizers.py # custom signals item_extracted = object() item_transformed = object() class MyExtension(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): # first check if the extension should be enabled and raise # NotConfigured otherwise if not crawler.settings.getbool('MYEXTENSION_ENABLED'):

XPath selector works in XPath Helper console, but doesn't work in scrapy

阅读更多关于 XPath selector works in XPath Helper console, but doesn't work in scrapy

问题 I'm using scrapy to parse interest rates from Russian Central Bank website I'm also using Xpath Helper extension in Google Chrome to find a necessary XPath selector. The selector I use in XPath Helper Console below works exactly as I need. The same query for some reason doesn't work in my spider, even though it navigates to the page. You can see my Spider code below. import scrapy import urllib.parse class RatesSpider(scrapy.Spider): name = 'rates' allowed_domains = ['cbr.ru'] start_urls = [

Scrapy and Selenium : only scrap two pages

阅读更多关于 Scrapy and Selenium : only scrap two pages

问题 I want to crawl a website, there are more than 10 pages every page has 10 links, the spider will get the links def parse(): and go the the link to crawl another data I want def parse_detail(): Please guide me how to write to crawl only two pages not all pages THX Here is my code it only crawl one pages and than the spider closed def __init__(self): self.driver = webdriver.Firefox() dispatcher.connect(self.spider_closed, signals.spider_closed) def parse(self, response): self.driver.implicitly

How to crawl data from the linked webpages on a webpage we are crawling

阅读更多关于 How to crawl data from the linked webpages on a webpage we are crawling

问题 I am crawling the names of the colleges on this webpage, but, i also want to crawl the number of faculties in these colleges which is available if open the specific webpages of the colleges by clicking the name of the college. What should i append to this code to get the result. The result should be in the form of [(name1, faculty1), (name2,faculty2),... ] import scrapy class QuotesSpider(scrapy.Spider): name = "student" start_urls = [ 'http://www.engineering.careers360.com/colleges/list-of

URL text file not found when deployed to Scraping Hub and spider run

阅读更多关于 URL text file not found when deployed to Scraping Hub and spider run

问题 Problem My spider relies on a .txt file that contains the URLs the spider goes to. I have placed that file in the same directory the spider code is located, and in every directory before it (Hail Marry approach); the end result is this: Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request request = next(slot.start_requests) File "/app/__main__.egg/CCSpider1/spiders/cc_1_spider.py", line 41, in start_requests for