scrapyd

Running more than one spiders one by one

馋奶兔 提交于 2019-12-14 03:12:41
问题 I am using Scrapy framework to make spiders crawl through some webpages. Basically, what I want is to scrap web pages and save them to database. I have one spider per webpage. But I am having trouble to run those spiders at once such that a spider starts to crawl exactly after another spiders finishes crawling. How can that be achieved? Is scrapyd the solution? 回答1: scrapyd is indeed a good way to go, max_proc or max_proc_per_cpu configuration can be used to restrict the number of parallel

Change number of running spiders scrapyd

天大地大妈咪最大 提交于 2019-12-13 00:35:27
问题 Hey so I have about 50 spiders in my project and I'm currently running them via scrapyd server. I'm running into an issue where some of the resources I use get locked and make my spiders fail or go really slow. I was hoping their was some way to tell scrapyd to only have 1 running spider at a time and leave the rest in the pending queue. I didn't see a configuration option for this in the docs. Any help would be much appreciated! 回答1: This can be controlled by scrapyd settings. Set max_proc

Not able to Login this website https://www.bestpricewholesale.co.in/Registration/login.aspx in python scrapy project

倖福魔咒の 提交于 2019-12-12 02:55:20
问题 Not able to Login only this website in python scrapy project. I want to scrap a login require website and i have already login many websites in my project but not able to Login only this website in python scrapy project.I think i have problem in form response it's """Return the most likely field names for username and password""" but i am not sure and not able to solve this issue from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders.init import InitSpider from scrapy.http

input/output for scrapyd instance hosted on an Amazon EC2 linux instance

女生的网名这么多〃 提交于 2019-12-12 02:23:22
问题 Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd. The scrapy project I built relies on accessing data from a CSV file in order to run def search(self, response): with open('data.csv', 'rb') as fin: reader = csv.reader(fin) for row in reader: subscriberID = row[0] newEffDate = datetime.datetime.now() counter = 0 yield scrapy.Request( url = "https://www.healthnet.com/portal/provider/protected/patient/results

Scrapy / Python and SQL Server

拟墨画扇 提交于 2019-12-11 19:13:50
问题 Is it possible to get the data scraped from websites using Scrapy, and saving that data in an Microsoft SQL Server Database? If Yes, are there any examples of this being done? Is it mainly a Python issue? i.e. if I find some code of Python saving to an SQL Server database, then Scrapy can do same? 回答1: Yes, but you'd have to write the code to do it yourself since scrapy does not provide an item pipeline that writes to a database. Have a read of the Item Pipeline page from the scrapy

on deploying egg file in scrapyd server then {“status”: “error”, “message”: “IndexError: list index out of range”}

℡╲_俬逩灬. 提交于 2019-12-11 12:23:00
问题 Deploying to project "projectname" in http://127.0.0.1:6800/addversion.json Server response (200): {"status": "error", "message": "IndexError: list index out of range"} when I create egg file , and deploy in scrapyd server , then such kind of error comes , please if someone have any solution give me. thanks in advance 回答1: please check your argument parameter which used in yout code import os s_file = sys.argv[2] please check your agrument paramter or check your code should not be outside to

in `escape': undefined method `gsub' for #<URI::HTTP:0x007fa07cb01e08> (NoMethodError)

心已入冬 提交于 2019-12-11 07:53:14
问题 Hi I am trying to scrap a web page "take the links" go to that links and "to scrap it" too. require 'rubygems' require 'scrapi' require 'uri' Scraper::Base.parser :html_parser web = "http://......" def sub_web(linksubweb) uri = URI.parse(URI.encode(linksubweb)) end scraper = Scraper.define do array :items process "div.mozaique>div", :items => Scraper.define { process "p>a", :title => :text process "div.thumb>a", :link => "@href" result :title, :link, } result :items end uri = URI.parse(URI

Horizontally scaling Scrapyd

拟墨画扇 提交于 2019-12-07 05:37:15
问题 What tool or set of tools would you use for horizontally scaling scrapyd adding new machines to a scrapyd cluster dynamically and having N instances per machine if required. Is not neccesary for all the instances to share a common job queue, but that would be awesome. Scrapy-cluster seems promising for the job but I want a Scrapyd based solution so I listen to other alternatives and suggestions. 回答1: I scripted my own load balancer for Scrapyd using its API and a wrapper. from random import

Running multiple spiders using scrapyd

不羁的心 提交于 2019-12-06 03:30:26
I had multiple spiders in my project so decided to run them by uploading to scrapyd server. I had uploaded my project succesfully and i can see all the spiders when i run the command curl http://localhost:6800/listspiders.json?project=myproject when i run the following command curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2 Only one spider runs because of only one spider given, but i want to run run multiple spiders here so the following command is right for running multiple spiders in scrapyd ? curl http://localhost:6800/schedule.json -d project=myproject -d

Why does scrapyd throw: “'FeedExporter' object has no attribute 'slot'” exception?

南楼画角 提交于 2019-12-05 16:16:12
I came across a situation where my scrapy code is working fine when used from command line but when I'm using the same spider after deploying (scrapy-deploy) and scheduling with scrapyd api it throws error in "scrapy.extensions.feedexport.FeedExporter" class. one is while initializing "open_spider" signal second is while initializing "item_scraped" signal and last while "close_spider" signal 1."open_spider" signal error 2016-05-14 12:09:38 [scrapy] INFO: Spider opened 2016-05-14 12:09:38 [scrapy] ERROR: Error caught on signal handler: <bound method ?.open_spider of <scrapy.extensions