scrapyd

Horizontally scaling Scrapyd

旧街凉风 提交于 2019-12-05 10:21:42
What tool or set of tools would you use for horizontally scaling scrapyd adding new machines to a scrapyd cluster dynamically and having N instances per machine if required. Is not neccesary for all the instances to share a common job queue, but that would be awesome. Scrapy-cluster seems promising for the job but I want a Scrapyd based solution so I listen to other alternatives and suggestions. I scripted my own load balancer for Scrapyd using its API and a wrapper . from random import shuffle from scrapyd_api.wrapper import ScrapydAPI class JobLoadBalancer(object): @classmethod def get_less

error in deploying a project using scrapyd

心不动则不痛 提交于 2019-12-05 04:49:46
I had multiple spiders in my project folder and want to run all the spiders at once, so i decided to run them using scrapyd service. I have started doing this by seeing here First of all i am in current project folder I had opened the scrapy.cfg file and uncommented the url line after [deploy] I had run scrapy server command, that works fine and scrapyd server runs I tried this command scrapy deploy -l Result : default http://localhost:6800/ when i tried this command scrapy deploy -L scrapyd i got following output Result: Usage ===== scrapy deploy [options] [ [target] | -l | -L <target> ]

芝麻HTTP:在阿里云上测试Gerapy教程

你离开我真会死。 提交于 2019-12-04 23:23:32
今天在阿里云上试用了一下,在这里做一个简单的说明。 1、配置环境 阿里云的版本是2.7.5,所以用pyenv新安装了一个3.6.4的环境,安装后使用pyenv global 3.6.4即可使用3.6.4的环境,我个人比较喜欢这样,切换自如,互不影响。 如下图: 接下来按照大才的文章,pip install gerapy即可,这一步没有遇到什么问题。有问题的同学可以向大才提issue。 2. 开启服务 首先去阿里云的后台设置安全组 ,我的是这样: 然后到命令窗口对8000和6800端口放行即可。 接着执行 gerapy init cd gerapy gerapy migrate # 注意下一步 gerapy runserver 0.0.0.0:8000 【如果你是在本地,执行 gerapy runserver即可,如果你是在阿里云上,你就要改成前面这样来执行】 现在在浏览器里访问:ip:8000应该就可以看到主界面了 里面的各个的含义见大才的文章。 3.创建项目 在gerapy下的projects里面新建一个scrapy爬虫,在这里我搞的是最简单的: scrapy startproject gerapy_test cd gerapy_test scrapy genspider baidu www.baidu.com 这样就是一个最简单的爬虫了,修改一个settings

How to password protect Scrapyd UI?

主宰稳场 提交于 2019-12-04 19:41:07
I have my website available to public and there is Scrapyd running at port 6800 like http://website.com:6800/ I do not want anyone to see list of my crawlers. I know anyone can easily guess type up port 6800 and can see whats going on. I have few questions, answer any of them will help me. Is there way to password protect Scrapyd UI? Can I password protect a specific Port on Linux? I know it can be done with IPTables to ONLY ALLOW PARTICULAR IPs but thats not a good solution Should I make changes to Scrapyd's source-code? Can I password protect a specific port only via .htaccess? You should

Enabling HttpProxyMiddleware in scrapyd

一世执手 提交于 2019-12-04 10:09:12
问题 After reading the scrapy documentation, I thought that the HttpProxyMiddleware is enabled by default. But when I start a spider via scrapyd's webservice interface, HttpProxyMiddleware is not enabled. I receive the following output: 2013-02-18 23:51:01+1300 [scrapy] INFO: Scrapy 0.17.0-120-gf293d08 started (bot: pde) 2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, CloseSpider, WebService, CoreStats, SpiderState 2013-02-18 23:51:02+1300 [scrapy] DEBUG:

scrapyd-client command not found

拥有回忆 提交于 2019-12-03 14:53:16
I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file( https://github.com/scrapy/scrapyd-client ), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? Create a fresh environment and install scrapyd-client first using below pip install git+https://github.com/scrapy/scrapyd

Scrapy get request url in parse

南笙酒味 提交于 2019-12-03 14:36:02
问题 How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store these urls. I'm using the BaseSpider. 回答1: The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything. eg. (EDITED) def parse(self, response): print "URL: " + response.request.url 回答2: The request

Scrapy get request url in parse

亡梦爱人 提交于 2019-12-03 04:23:19
How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store these urls. I'm using the BaseSpider. Jagu The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything. eg. (EDITED) def parse(self, response): print "URL: " + response.request.url The request object is accessible from the response object, therefore you can do the following: def parse(self, response):

Enabling HttpProxyMiddleware in scrapyd

妖精的绣舞 提交于 2019-12-03 03:58:47
After reading the scrapy documentation, I thought that the HttpProxyMiddleware is enabled by default. But when I start a spider via scrapyd's webservice interface, HttpProxyMiddleware is not enabled. I receive the following output: 2013-02-18 23:51:01+1300 [scrapy] INFO: Scrapy 0.17.0-120-gf293d08 started (bot: pde) 2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, CloseSpider, WebService, CoreStats, SpiderState 2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,

How to setup and launch a Scrapy spider programmatically (urls and settings)

谁说我不能喝 提交于 2019-12-03 02:02:07
问题 I've written a working crawler using scrapy, now I want to control it through a Django webapp, that is to say: Set 1 or several start_urls Set 1 or several allowed_domains Set settings values Start the spider Stop / pause / resume a spider retrieve some stats while running retrive some stats after spider is complete. At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the