scrapy-spider | 易学教程

Why does my CrawlerProcess not have the function “crawl”?

阅读更多关于 Why does my CrawlerProcess not have the function “crawl”?

问题 import scrapy from scrapy.crawler import CrawlerProcess from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from items import BackpageItem, CityvibeItem from scrapy.shell import inspect_response import re import time import sys class MySpider(CrawlSpider): name = 'example' allowed_domains = ['www.example.com'] # Set last_age to decide how many pages are crawled last_page = 10 start_urls = ['http://www.example.com/washington/?page=%s' %

scrapy - seperate output file per starurl

阅读更多关于 scrapy - seperate output file per starurl

问题 I have this scrapy spider that runs well: `# -*- coding: utf-8 -*- import scrapy class AllCategoriesSpider(scrapy.Spider): name = 'vieles' allowed_domains = ['examplewiki.de'] start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',] #"Titel": : def parse(self, response): urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages for url in

Scrapy Return Multiple Items

阅读更多关于 Scrapy Return Multiple Items

问题 I'm new to Scrapy and I'm really just lost on how i can return multiple items in one block. Basically, I'm getting one HTML tag which has a quote that contains nested tags of text, author name, and some tags about that quote. The code here only returns one quote and that's it. It doesnt use the loop to return the rest. I've been searching the web for hours and I'm just hopeless I don't get it. Here's my code so far: Spider.py import scrapy from scrapy.loader import ItemLoader from first

Scrapy can not scrape a second page using itemloader

阅读更多关于 Scrapy can not scrape a second page using itemloader

Update: 7/29, 9:29pm: After reading this post , I updated my code. UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database. ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here: from scrapy.spider import Spider from scrapy.selector import Selector from scrapy import Request import re from datetime import datetime, timedelta from CAPjobs.items

Scrapy: Rules set inside init are ignored by CrawlSpider

阅读更多关于 Scrapy: Rules set inside __init__ are ignored by CrawlSpider

I've been stuck on this for a few days, and it's making me go crazy. I call my scrapy spider like this: scrapy crawl example -a follow_links="True" I pass in the "follow_links" flag to determine whether the entire website should be scraped, or just the index page I have defined in the spider. This flag is checked in the spider's constructor to see which rule should be set: def __init__(self, *args, **kwargs): super(ExampleSpider, self).__init__(*args, **kwargs) self.follow_links = kwargs.get('follow_links') if self.follow_links == "True": self.rules = ( Rule(LinkExtractor(allow=()), callback=

Scrapy handle 301/302 response code as well as follow the target url

阅读更多关于 Scrapy handle 301/302 response code as well as follow the target url

问题 I am using scrapy version 1.0.5 for implementation of a crawler. Currently I have set REDIRECT_ENABLED = False and handle_httpstatus_list = [500, 301, 302] to scrape the pages with 301 and 302 responses. However, since REDIRECT_ENABLED is set to False , the spider doesn't goes to the target url in Location response header. How can I achieve this ? 回答1: It is a long tome since I did anything like this but you need to generate a request object with url, meta and callback parameters. But I seem

Scrapy run multiple spiders from a main spider?

阅读更多关于 Scrapy run multiple spiders from a main spider?

问题 I have two spiders that take urls and data scraped by a main spider. My approach to this was to use CrawlerProcess in the main spider and passing data to the two spiders. Here's my approach: class LightnovelSpider(scrapy.Spider): name = "novelDetail" allowed_domains = ["readlightnovel.com"] def __init__(self,novels = []): self.novels = novels def start_requests(self): for novel in self.novels: self.logger.info(novel) request = scrapy.Request(novel, callback=self.parseNovel) yield request def

Scrapy crawl and follow links within href

阅读更多关于 Scrapy crawl and follow links within href

问题 I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is: Initial Page <div class="page-categories"> <a class="menu" href="/abc.html"> <a class="menu" href="/def.html"> </div> Inside abc.html <div class="cell category" > <div class="cell-text category"> <p class="t"> <a id="cat-24887" href="fgh.html"/> <

How to check if a specific button exists in Scrapy?

阅读更多关于 How to check if a specific button exists in Scrapy?

问题 I have a button in web page as <input class="nextbutton" type="submit" name="B1" value="Next 20>>"></input> Now i want to check if this button exists on the page or not using Xpath selectors so that if it exists i can go to next page and retreive information from there. 回答1: First, you have to determine what counts as "this button". Given the context, I'd suggest looking for an input with a class of 'nextbutton'. You could check for an element with only one class like this in XPath: //input[

How to scrap data in an authenticated session within a dynamic page?

阅读更多关于 How to scrap data in an authenticated session within a dynamic page?

问题 I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code: class MySpider(CrawlSpider): login_user = 'myusername' login_pass = 'mypassword' name = "tv" allowed_domains = [] start_urls = ["https://twitter.com/Acrocephalus/followers"] rules = ( Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True), ) def