问题
I am trying to create a spider that fetches all the urls from one domain and create a record of the domain name and all the headers across the urls on this domain. This is a continuation of a previous question.
I managed to get help, and understand that I need to use Item pipeline in the scrapy framework to achieve this. I create a dict/hash in the items-pipeline where I store domain name and append all the headers.
The error I receive is: unhashable type 'list'
spider.py
class MySpider(CrawlSpider):
name = 'Webcrawler'
allowed_domains = ['web.aitp.se']
start_urls = ['http://web.aitp.se/']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
domain=response.url.split("/")[2]
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=WebsiteItem(), response=response)
loader.add_value('domain',domain)
loader.add_xpath('h1',("//h1/text()"))
yield loader.load_item()
pipelines.py
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
from scrapy.http import Request
from Prospecting.items import WebsiteItem
from collections import defaultdict
class DomainPipeline(object):
global Accumulator
Accumulator = defaultdict(list)
def process_item(self, item, spider):
Accumulator[ item['domain'] ].append( item['h1'] )
def close_spider(spider):
yield Accumulator.items()
I tried to break down the problem, and just read domains and headers from a csv-file and merge this into one record and this works fine.
from collections import defaultdict
Accumulator = defaultdict(list)
companies= open('test.csv','r')
for line in companies:
fields=line.split(',')
Accumulator[ fields[0] ].append(fields[1])
print Accumulator.items()
回答1:
In python, a list cannot be used as key in a dict. The dict keys need to be hashable (which usually means that keys need to be immutable)
So, if there is any place where you are using lists, you can convert it into a tuple before adding to a dict. tuple(mylist)
should be good enough to convert the list to a tuple.
来源:https://stackoverflow.com/questions/21353438/scrapy-pipeline-unhashable-type-list