scrapy-spider | 易学教程

Scrapy spider not terminating with use of CloseSpider extension

阅读更多关于 Scrapy spider not terminating with use of CloseSpider extension

问题 I have set up a Scrapy spider that parses an xml feed, processing some 20,000 records. For the purposes of development, I'd like to limit the number of items processed. From reading the Scrapy docs I identified I need to use the CloseSpider extension. I have followed the guide on how to enable this - in my spider config I have the following: CLOSESPIDER_ITEMCOUNT = 1 EXTENSIONS = { 'scrapy.extensions.closespider.CloseSpider': 500, } However, my spider never terminates - I'm aware that the

Log in not working using scrapy

阅读更多关于 Log in not working using scrapy

问题 I have written scrapy code for log in to a site. first i tried for one site. It worked well. But then i changed the url and tried for other site. It is not working for that site. I used the same code without any change. What would be the problem? # -*- coding: utf-8 -*- import scrapy from scrapy.http import FormRequest from scrapy.utils.response import open_in_browser class QuoteSpider(scrapy.Spider): name = 'Quote' allowed_domains = ["quotes.toscrape.com"] start_urls = ( 'http://quotes

Scrapy not calling any other function after “init”

阅读更多关于 Scrapy not calling any other function after “__init__”

问题 OS: Ubuntu 16.04 Stack - Scrapy 1.0.3 + Selenium I'm pretty new to scrapy and this might sound very basic, But in my spider, only " init " is being getting executed. Any code/function after that is not getting called and thhe spider just halts. class CancerForumSpider(scrapy.Spider): name = "mainpage_spider" allowed_domains = ["cancerforums.net"] start_urls = [ "http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum" ] def __init__(self,*args,**kwargs): self.browser=webdriver.Firefox()

Scrapy yield a Request, parse in the callback, but use the info in the original function

阅读更多关于 Scrapy yield a Request, parse in the callback, but use the info in the original function

问题 So I'm trying to test some webpages in scrapy, my idea is to yield a Request to the URLS that satisfy the condition, count the number of certain items on the page, and then within the original condition return True/False depending... Here is some code to show what i mean: def filter_categories: if condition: test = yield Request(url=link, callback = self.test_page, dont_filter=True) return (test, None) def test_page(self, link): ... parse the response... return True/False depending I have

scraping : nested url data scraping

阅读更多关于 scraping : nested url data scraping

问题 I have a website name https://www.grohe.com/in In that page i want to get one type of bathroom faucets https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ In that page there are multiple products/related products.I want to get each product url and scrap the data.For that i wrote like this... My items.py file looks like from scrapy.item import Item, Field class ScrapytestprojectItem(Item): producturl=Field() imageurl=Field() description=Field() spider code is import scrapy from

Crawl website from list of values using scrapy

阅读更多关于 Crawl website from list of values using scrapy

问题 I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file. I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names. Here is my current code: import scrapy from scrapy.spider import BaseSpider class MySpider(BaseSpider): name = "npidb" def start_requests(self): urls = [ 'https://npidb.org/npi-lookup/

How to get plain text in between multiple html tag using scrapy

阅读更多关于 How to get plain text in between multiple html tag using scrapy

问题 I am trying to grab all text from multiple tag from a given URL using scrapy .I am new to scrapy. I don't have much idea how to achieve this.Learning through examples and people experience on stackoverflow. Here is list of tags that i am targeting. <div class="TabsMenu fl coloropa2 fontreg"><p>root div<p> <a class="sub_h" id="mtongue" href="#">Mother tongue</a> <a class="sub_h" id="caste" href="#">Caste</a> <a class="sub_h" id="scases" href="#">My name is nand </a> </div> <div class=

Grabbed data from a given URL and put it into a file using scrapy

阅读更多关于 Grabbed data from a given URL and put it into a file using scrapy

问题 I am trying to scrapped deeply a given web site and grab text from all over pages. I am using scrapy to scrap web site here is how i am running spider scrapy crawl stack_crawler -o items.json item.json file coming empty Here is spider code_snap # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule #from tutorial.items import TutorialItem from tutorial.items import DmozItem class StackCrawlerSpider(CrawlSpider): name

scrapy crawlspider output

阅读更多关于 scrapy crawlspider output

问题 I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really). So, my question is can I use this: scrapy crawl dmoz -o items.csv or do I have to create an Item Pipeline? UPDATED, now with code!: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from targets.item import TargetsItem

scrapy: “load more result” pages

阅读更多关于 scrapy: “load more result” pages

问题 I was trying to write the follwing scrapy script to scrapy items from the follwing web site. I was able to scrap first page items but there are more about 2000 page that i want scrap all. There is a option "load more result" , I also try to scrap load more result's pages, but unable to do that. please help me. from scrapy.shell import open_in_browser import scrapy from scrapy import Selector import math import json class MyItems(scrapy.Item): date = scrapy.Field() title = scrapy.Field() link