scrapy爬虫, gerapy基于scrapy分布式爬虫的介绍

蓝咒 提交于 2020-04-06 10:40:21

python3.8,scrapy

主要使用pip install 安装,安装python3.8 
安装注意事项:
1在安装这些组件是可能需要VS C++ Build Tools(vs2015版以上,直接安装vs2019也可以)需要安装。
2同时还需要安装.net4.6或及其以上。

scrapy--2.0.1
Twisted--20.3.0
gerapy--0.9.2
pywin32--220

python下载:https://www.python.org/downloads/windows/
twisted下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pywin32下载:https://nchc.dl.sourceforge.net/project/pywin32/pywin32/

python -m venv python_demo    使用python3的venv创建虚拟环境也可以直接使用系统环境   还有virtualenv也可以完成类似功能这里使用系统自带的

cd python_demo

scripts/active 激活venv创建的 python_demo虚拟环境
scripts/deactive 退出venv创建的 python_demo虚拟环境

python -m pip install --upgrade pip   (更新pip)

安装:pip install scrapy scrapyd scrapyd-client gerapy

scrapy  爬虫框架
scrapyd 爬虫管理工具
scrapyd-client 项目的客服端工具,方便推送到scrapyd里去,scrapyd-deploy命令就是她里面的一个工具
gerapy 是一个基于scrapy scrapyd的分布式爬虫管理工具
python-scrapyd-api 一个 scrapyd的api工具

安装了scrapyd-client在windows上使用还需要在Scripts目录下新建scrapyd-deploy.bat 内容为:需修改对应地址

@echo off
python F:\pdpy\scrapy0\Scripts\scrapyd-deploy %*

 

scrapyd-deploy -l
scrapyd-deploy  <target> -p <project>

对爬虫工程scrapy.cfg进行修改

[deploy:scrapyd001]
url=http://127.0.0.1:6800/
project=baiduscrapy
#username = test
#password = test

scrapyd-deploy scrapy001 -p baiduscrapy
 

curl http://localhost:6800/schedule.json -d project=PROJECT_NAME -d spider=SPIDER_NAME
curl http://localhost:6800/schedule.json -d project=baiduscrapy -d spider=baiduscrapySpider

PROJECT_NAME 在上面有 scrapy工程得配置文件scrapy.cfg种 project=  即是。
SPIDER_NAME 在爬虫文件里 有个name属性即为爬虫名字,且项目内唯一,如:

class QuotesSpider(scrapy.Spider):
    name = "quotes"
class csrcSpider(RedisSpider):
    name = 'csrcSpider'

scrapyd中的配置文件:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
#bind_address = 127.0.0.1
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root
 
[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

xxx.json对应各种功能

如删除project:curl http://127.0.0.1:6800/delproject.json -d project="project=sogou"
查看project curl http://127.0.0.1:6800/listprojects.json

 

scrapy.cnf配置文件需要注意的项

注意:执行时 cd 到项目根目录执行

第一种情况
cfg:

[deploy]
url = http://192.168.17.129:6800/
project = tutorial
username = enlong
password = test
运行结果:

python@ubuntu:~/project/tutorial$ scrapyd-deploy 
Packing version 1471069533
Deploying to project "tutorial" in http://192.168.17.129:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1471069533", "spiders": 1, "node_name": "ubuntu"}
第二种情况
cfg:

[deploy:tutorial_deploy]
url = http://192.168.17.129:6800/
project = tutorial
username = enlong
password = test
运行结果:

python@ubuntu:~/project/tutorial$ scrapyd-deploy tutorial_deploy
Packing version 1471069591
Deploying to project "tutorial" in http://192.168.17.129:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1471069591", "spiders": 1, "node_name": "ubuntu"}

 

 

 

scrapy爬虫项目创建

scrapy startproject tutorial
cd tutorial
scrapy genspider doubanSpider movie.douban.com

修改scrapy.cnf  编写爬虫文件xxxxSpider

运行: scrapy crawl doubanSpider
 

 

gerapy 工作的步骤

 

gerapy init    初始化
cd gerapy

gerapy migrate    合并

gerapy createsuperuser   创建用户

gerapy runserver    启动

http://127.0.0.1:8000   用创建的用户和密码登录

 

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!