Should I create pipeline to save files with scrapy?

风流意气都作罢 提交于 2019-11-27 10:52:33

问题


I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.

From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?


回答1:


Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

You could save the pdf in the spider callback:

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)

def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

If you choose to do it in a pipeline:

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i

# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item

[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)




回答2:


There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:

https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ




回答3:


It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.

In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.



来源:https://stackoverflow.com/questions/7123387/should-i-create-pipeline-to-save-files-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!