scrapy-spider

Scrapy spider not terminating with use of CloseSpider extension

孤者浪人 提交于 2019-12-12 12:15:46
问题 I have set up a Scrapy spider that parses an xml feed, processing some 20,000 records. For the purposes of development, I'd like to limit the number of items processed. From reading the Scrapy docs I identified I need to use the CloseSpider extension. I have followed the guide on how to enable this - in my spider config I have the following: CLOSESPIDER_ITEMCOUNT = 1 EXTENSIONS = { 'scrapy.extensions.closespider.CloseSpider': 500, } However, my spider never terminates - I'm aware that the

Log in not working using scrapy

陌路散爱 提交于 2019-12-12 10:26:59
问题 I have written scrapy code for log in to a site. first i tried for one site. It worked well. But then i changed the url and tried for other site. It is not working for that site. I used the same code without any change. What would be the problem? # -*- coding: utf-8 -*- import scrapy from scrapy.http import FormRequest from scrapy.utils.response import open_in_browser class QuoteSpider(scrapy.Spider): name = 'Quote' allowed_domains = ["quotes.toscrape.com"] start_urls = ( 'http://quotes

Scrapy not calling any other function after “__init__”

那年仲夏 提交于 2019-12-12 09:01:41
问题 OS: Ubuntu 16.04 Stack - Scrapy 1.0.3 + Selenium I'm pretty new to scrapy and this might sound very basic, But in my spider, only " init " is being getting executed. Any code/function after that is not getting called and thhe spider just halts. class CancerForumSpider(scrapy.Spider): name = "mainpage_spider" allowed_domains = ["cancerforums.net"] start_urls = [ "http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum" ] def __init__(self,*args,**kwargs): self.browser=webdriver.Firefox()

Scrapy yield a Request, parse in the callback, but use the info in the original function

给你一囗甜甜゛ 提交于 2019-12-12 05:46:55
问题 So I'm trying to test some webpages in scrapy, my idea is to yield a Request to the URLS that satisfy the condition, count the number of certain items on the page, and then within the original condition return True/False depending... Here is some code to show what i mean: def filter_categories: if condition: test = yield Request(url=link, callback = self.test_page, dont_filter=True) return (test, None) def test_page(self, link): ... parse the response... return True/False depending I have

scraping : nested url data scraping

99封情书 提交于 2019-12-12 04:45:14
问题 I have a website name https://www.grohe.com/in In that page i want to get one type of bathroom faucets https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ In that page there are multiple products/related products.I want to get each product url and scrap the data.For that i wrote like this... My items.py file looks like from scrapy.item import Item, Field class ScrapytestprojectItem(Item): producturl=Field() imageurl=Field() description=Field() spider code is import scrapy from

Crawl website from list of values using scrapy

北城以北 提交于 2019-12-12 03:32:00
问题 I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file. I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names. Here is my current code: import scrapy from scrapy.spider import BaseSpider class MySpider(BaseSpider): name = "npidb" def start_requests(self): urls = [ 'https://npidb.org/npi-lookup/

How to get plain text in between multiple html tag using scrapy

大城市里の小女人 提交于 2019-12-12 03:19:15
问题 I am trying to grab all text from multiple tag from a given URL using scrapy .I am new to scrapy. I don't have much idea how to achieve this.Learning through examples and people experience on stackoverflow. Here is list of tags that i am targeting. <div class="TabsMenu fl coloropa2 fontreg"><p>root div<p> <a class="sub_h" id="mtongue" href="#">Mother tongue</a> <a class="sub_h" id="caste" href="#">Caste</a> <a class="sub_h" id="scases" href="#">My name is nand </a> </div> <div class=

Grabbed data from a given URL and put it into a file using scrapy

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-12 01:25:17
问题 I am trying to scrapped deeply a given web site and grab text from all over pages. I am using scrapy to scrap web site here is how i am running spider scrapy crawl stack_crawler -o items.json item.json file coming empty Here is spider code_snap # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule #from tutorial.items import TutorialItem from tutorial.items import DmozItem class StackCrawlerSpider(CrawlSpider): name

scrapy crawlspider output

走远了吗. 提交于 2019-12-11 22:24:55
问题 I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really). So, my question is can I use this: scrapy crawl dmoz -o items.csv or do I have to create an Item Pipeline? UPDATED, now with code!: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from targets.item import TargetsItem

scrapy: “load more result” pages

回眸只為那壹抹淺笑 提交于 2019-12-11 18:09:14
问题 I was trying to write the follwing scrapy script to scrapy items from the follwing web site. I was able to scrap first page items but there are more about 2000 page that i want scrap all. There is a option "load more result" , I also try to scrap load more result's pages, but unable to do that. please help me. from scrapy.shell import open_in_browser import scrapy from scrapy import Selector import math import json class MyItems(scrapy.Item): date = scrapy.Field() title = scrapy.Field() link