问题
I've written a script in python using scrapy to collect the name of different posts and their links from a website. When I execute my script from command line it works flawlessly. Now, my intention is to run the script using CrawlerProcess()
. I look for the similar problems in different places but nowhere I could find any direct solution or anything closer to that. However, when I try to run it as it is I get the following error:
from stackoverflow.items import StackoverflowItem ModuleNotFoundError: No module named 'stackoverflow'
This is my script so far (stackoverflowspider.py
):
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
items.py
includes:
import scrapy
class StackoverflowItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
This is the tree: Click to see the hierarchy
I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:
def parse(self,response):
for link in sel.xpath("//*[@class='question-hyperlink']"):
name = link.xpath('.//text()').extract_first()
url = link.xpath('.//@href').extract_first()
yield {"Name":name,"Link":url}
回答1:
Although @Dan-Dev showed me a way to the right direction, I decided to provide a complete solution which worked for me flawlessly.
With changing nothing anywhere other than what I'm pasting below:
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
Once again: Including the following within the script fixed the problem
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
回答2:
It is a python path problem. The easiest way is to call it setting the python path explicitly i.e. from the directory containing scrapy.cfg (and more importantly the stackoverflow module) run:
PYTHONPATH=. python3 stackoverflow/spiders/stackoverflowspider.py
This sets the python path to include the current directory (.).
for alternatives see https://www.daveoncode.com/2017/03/07/how-to-solve-python-modulenotfound-no-module-named-import-error/
来源:https://stackoverflow.com/questions/53033791/scrapy-throws-an-error-when-run-using-crawlerprocess