Preferred way to run Scrapyd in the background / as a service

允我心安 提交于 2019-12-22 09:47:02

问题


I am trying to run Scrapyd on a virtual Ubuntu 16.04 server, to which I connect via SSH. When I run scrapy by simply running

$ scrapyd

I can connect to the web interface by going to http://82.165.102.18:6800.

However, once I close the SSH connection, the web interface is no longer available, therefore, I think I need to run Scrapyd in the background as a service somehow.

After some research I came across a few proposed solutions:

  • daemon (sudo apt install daemon)
  • screen (sudo apt install screen)
  • tmux (sudo apt install tmux)

Does someone know what the best / recommended solution is? Unfortunately, the Scrapyd documentation is rather thin and outdated.

For some background, I need to run about 10-15 spiders on a daily basis.


回答1:


Use this command.

cd /path/to/your/project/folder && nohup scrapyd >& /dev/null &

Now you can close your SSH connection but scrapyd will keep running.

And to make sure that whenever your server restarts and scrapyd runs automatically. Do this

copy the output of echo $PATH from your terminal, and then open your crontab by crontab -e

Now at the very top of that file, write this

PATH=YOUR_COPIED_CONTENT

And now at the end of your crontab, write this.

@reboot cd /path/to/your/project/folder && nohup scrapyd >& /dev/null &

This means, each time your server is restarted, the command cd /path/to/your/project/folder && nohup scrapyd >& /dev/null & will automatically run




回答2:


To have scrapyd run as daemon, you can simply do:

$ scrapyd &

The & at the end makes scrapyd run as daemon.

Or, you can run the following command to load the service on the scrapy folder:

$ daemon --chdir=/home/ubuntu/crawler scrapyd

As you have mentioned, to use "daemon", you need to first install daemon on your ubuntu by

$ sudo apt-get install daemon

After having scrapyd run as daemon by doing one of the above ways, you should be able to access your scrapyd web interface after closing your ssh connection.




回答3:


If you have Scrapyd installed on Ubuntu server, I'd put this command at the end of /etc/rc.local file:

<path_to_scrapyd_binary>/scrapyd > /dev/null 2>&1 &

where <path_to_scrapyd_binary> is probably going to be something like /usr/local/bin. /etc/rc.local is best suited for such cases when you don't want to build you own service file or startup script. There was also suggested putting the command in Cron table with @reboot, but this sometimes didn't work for me for some reason (though, I didn't examine those reasons in depth).

Still my preferred option now is to deploy Scrapyd in Docker. You can get Scrapyd image from Docker Hub. Or you can build the image yourself if you have specific needs. I chose the second option. First I deployed my own Docker repository for that purpose. Once done, I built my own Scrapyd image using this Dockerfile:

FROM ubuntu:16.04

RUN apt-get update -q \
 && apt-get install -y --no-install-recommends \
    build-essential \
    ca-certificates \
    curl \
    libffi-dev \
    libjpeg-turbo8 \
    liblcms2-2 \
    libssl-dev \
    libtiff5 \
    libtool \
    libwebp5 \
    python \
    python-dev \
    zlib1g \
 && curl -sSL https://bootstrap.pypa.io/get-pip.py | python \
 && pip install --no-cache-dir \
    docker \
    future \
    geocoder \
    influxdb \
    Pillow \
    pymongo \
    scrapy-fake-useragent \
    scrapy_splash \
    scrapyd \
    selenium \
    unicode-slugify \
 && apt-get purge -y --auto-remove \
    build-essential \
    curl \
    libffi-dev \
    libssl-dev \
    libtool \
    python-dev \
 && rm -rf /var/lib/apt/lists/*

COPY ./scrapyd.conf /etc/scrapyd/

VOLUME /etc/scrapyd /var/lib/scrapyd
EXPOSE 6800

CMD ["scrapyd", "--logfile=/var/log/scrapyd.log", "--pidfile="]

After building the image and pushing it into the registry, I can deploy it to as many worker servers I need (or, of course, locally). Once you have the image pulled (either the one from Docker Hub, or your own), you can start it using:

sudo docker run --name=scrapyd -d -p 6800:6800 --restart=always -v /var/lib/scrapyd:/var/lib/scrapyd --add-host="dockerhost:"`ip addr show docker0 | grep -Po 'inet \K[\d.]+'` <location>/scrapyd

where <location> is either Docker Hub account, or it points to your own registry. This rather complicated command starts Scrapyd image in the background (-d option) listening on port 6800 every time Docker service is (re-)started (--restart=always option). It also publishes your hosts IP address as dockerhost to the container for cases where you need to access other (probably Dockerized) services on the host.




回答4:


Supervisor is a great way to daemonize scrapyd. Installation is generally straightforward. Once you have it set up, starting and stopping the service is as easy as:

$ supervisorctl start scrapyd
$ supervisorctl stop scrapyd

If you choose this route, note that supervisord may throw a warning about not finding the configuration file. One way to fix this is to simply add a reference to the configuration in the init.d script:

prog_bin="${exec_prefix}/bin/supervisord -c /etc/supervisord.conf"


来源:https://stackoverflow.com/questions/47065225/preferred-way-to-run-scrapyd-in-the-background-as-a-service

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!