Building a RESTful Flask API for Scrapy

前端 未结 2 1160
余生分开走
余生分开走 2020-12-09 13:35

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.

The following code w

相关标签:
2条回答
  • 2020-12-09 14:08

    I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.

    Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.

    I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.

    Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.

    There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.

    As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.

    If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

    0 讨论(0)
  • 2020-12-09 14:14

    I have been working on similar project last week, it's SEO service API, my workflow was like this:

    • The client send a request to Flask-based server with a URRL to scrape, and a callback url to notify the client when scrapping is done (client here is an other web app)
    • Run Scrapy in the background using Celery. The spider will save the data to the database.
    • The backgound service will notify the client by calling the callback url when the spider is done.
    0 讨论(0)
提交回复
热议问题