问题:

I want to run my spider from a script rather than a scrap crawl

I found this page

http://doc.scrapy.org/en/latest/topics/practices.html

but actually it doesn't say where to put that script.

any help please?

回答1:

It is so simple and straight forward action!

Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.py and not every time you just import from it. Just add an if __name__ == "__main__":

import scrapy from scrapy.crawler import CrawlerProcess  class MySpider(scrapy.Spider):     # Your spider definition     pass  if __name__ == "__main__":     process = CrawlerProcess({         'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'     })      process.crawl(MySpider)     process.start() # the script will block here until the crawling is finished

Now save the file as myscript.py and run 'python myscript.py`.

Enjoy!

回答2:

luckily scrapy source is open, so you can follow the way crawl command works and do the same in your code:

... crawler = self.crawler_process.create_crawler() spider = crawler.spiders.create(spname, **opts.spargs) crawler.crawl(spider) self.crawler_process.start()

回答3:

You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.

For example, you can create a single file stackoverflow_spider.py with something like this:

import scrapy  class QuestionItem(scrapy.item.Item):     idx = scrapy.item.Field()     title = scrapy.item.Field()  class StackoverflowSpider(scrapy.spider.Spider):     name = 'SO'     start_urls = ['http://stackoverflow.com']     def parse(self, response):         sel = scrapy.selector.Selector(response)         questions = sel.css('#question-mini-list .question-summary')         for i, elem in enumerate(questions):             l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)             l.add_value('idx', i)             l.add_xpath('title', ".//h3/a/text()")             yield l.load_item()

Then, provided you have scrapy properly installed, you can run it using:

scrapy runspider stackoverflow_spider.py -t json -o questions-items.json

回答4:

Why don't you just do this?

from scrapy import cmdline  cmdline.execute("scrapy crawl myspider".split())

Put that script in the same path where you put scrapy.cfg

转载请标明出处:scrapy run spider from script

文章来源: scrapy run spider from script

标签

脚本

scrapy