input/output for scrapyd instance hosted on an Amazon EC2 linux instance

女生的网名这么多〃 提交于 2019-12-12 02:23:22

问题


Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd.

The scrapy project I built relies on accessing data from a CSV file in order to run

 def search(self, response):
    with open('data.csv', 'rb') as fin:
        reader = csv.reader(fin)
        for row in reader:
            subscriberID = row[0]
            newEffDate = datetime.datetime.now()
            counter = 0
            yield scrapy.Request(
                url = "https://www.healthnet.com/portal/provider/protected/patient/results.action?__checkbox_viewCCDocs=true&subscriberId=" + subscriberID + "&formulary=formulary",
                callback = self.find_term,
                meta = {
                    'ID': subscriberID,
                    'newDate': newEffDate,
                    'counter' : counter
                    }
                )

It outputs scraped data to another CSV file

 for x in data:
        with open('missing.csv', 'ab') as fout:
            csvwriter = csv.writer(fout, delimiter = ',')
            csvwriter.writerow([oldEffDate.strftime("%m/%d/%Y"),subscriberID,ipa])
            return

We are in the initial stages of building an application that needs to access and run these scrapy spiders. I decided to host my scrapyd instance on an AWS EC2 linux instance. Deploying to AWS was straightforward (http://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).

How do I input/output scraped data to/from a scrapyd instance running on an AWS EC2 linux instance?

EDIT: I'm assuming passing a file would look like

curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d in=file_path

Is this correct? How would I grab the output from this spider run? Does this approach have security issues?


回答1:


Is S3 an option? I'm asking because you're already using EC2. If that's the case, you could read/write from/to S3.

I'm a bit confused because you mentioned both CSV and JSON formats. If you're reading CSV, you could use CSVFeedSpider. Either way, you could also use boto to read from S3 in your spider's __init__ or start_requests method.

Regarding the output, this page explains how to use feed exports to write the output of a crawl to S3.

Relevant settings:

  • FEED_URI
  • FEED_FORMAT
  • AWS_ACCESS_KEY_ID
  • AWS_ACCESS_SECRET_KEY


来源:https://stackoverflow.com/questions/42284726/input-output-for-scrapyd-instance-hosted-on-an-amazon-ec2-linux-instance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!