The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.
The following code w
I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.
Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it
function may block for an extended period of time, and so you will need many threads/processes.
I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.
Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.
There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.
As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.
If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.
I have been working on similar project last week, it's SEO service API, my workflow was like this: