Scrapy throws an error when run using crawlerprocess

故事扮演 提交于 2020-12-12 05:37:07

问题


I've written a script in python using scrapy to collect the name of different posts and their links from a website. When I execute my script from command line it works flawlessly. Now, my intention is to run the script using CrawlerProcess(). I look for the similar problems in different places but nowhere I could find any direct solution or anything closer to that. However, when I try to run it as it is I get the following error:

from stackoverflow.items import StackoverflowItem ModuleNotFoundError: No module named 'stackoverflow'

This is my script so far (stackoverflowspider.py):

from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy

class stackoverflowspider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self,response):
        sel = Selector(response)
        items = []
        for link in sel.xpath("//*[@class='question-hyperlink']"):
            item = StackoverflowItem()
            item['name'] = link.xpath('.//text()').extract_first()
            item['url'] = link.xpath('.//@href').extract_first()
            items.append(item)
        return items

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(stackoverflowspider)
    c.start()

items.py includes:

import scrapy

class StackoverflowItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

This is the tree: Click to see the hierarchy

I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:

def parse(self,response):
    for link in sel.xpath("//*[@class='question-hyperlink']"):
        name = link.xpath('.//text()').extract_first()
        url = link.xpath('.//@href').extract_first()
        yield {"Name":name,"Link":url}

回答1:


Although @Dan-Dev showed me a way to the right direction, I decided to provide a complete solution which worked for me flawlessly.

With changing nothing anywhere other than what I'm pasting below:

import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy


class stackoverflowspider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self,response):
        sel = Selector(response)
        items = []
        for link in sel.xpath("//*[@class='question-hyperlink']"):
            item = StackoverflowItem()
            item['name'] = link.xpath('.//text()').extract_first()
            item['url'] = link.xpath('.//@href').extract_first()
            items.append(item)
        return items

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(stackoverflowspider)
    c.start()

Once again: Including the following within the script fixed the problem

import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')



回答2:


It is a python path problem. The easiest way is to call it setting the python path explicitly i.e. from the directory containing scrapy.cfg (and more importantly the stackoverflow module) run:

PYTHONPATH=. python3 stackoverflow/spiders/stackoverflowspider.py

This sets the python path to include the current directory (.).

for alternatives see https://www.daveoncode.com/2017/03/07/how-to-solve-python-modulenotfound-no-module-named-import-error/



来源:https://stackoverflow.com/questions/53033791/scrapy-throws-an-error-when-run-using-crawlerprocess

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!