Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src
tag?
I would use an Item Pipeline to deal with the duplicated items.
# file: yourproject/pipelines.py
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.url_seen = set()
def process_item(self, item, spider):
if item['url'] in self.url_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.url_seen.add(item['url'])
return item
And add this pipeline to your settings.py
# file: yourproject/settings.py
ITEM_PIPELINES = {
'your_project.pipelines.DuplicatesPipeline': 300,
}
Then you just need to run your spider scrapy crawl relpathfinder -o items.csv
and the pipeline will Drop duplicate items for you. So will not see any duplicate in your csv output.
What about:
def url_join(self,response):
item=MyItem()
item['url']=[]
relative_url=response.xpath('//img/@src').extract()
for link in relative_url:
item['url'] = response.urljoin(link)
yield item