how to filter duplicate requests based on url in scrapy

前端 未结 5 930
挽巷
挽巷 2020-11-29 18:40

I am writing a crawler for a website using scrapy with CrawlSpider.

Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on ur

相关标签:
5条回答
  • 2020-11-29 18:47

    In the latest scrapy, we can use the default duplication filter or extend and have custom one.

    define the below config in spider settings

    DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'

    0 讨论(0)
  • 2020-11-29 18:48

    Following ytomar's lead, I wrote this filter that filters based purely on URLs that have already been seen by checking an in-memory set. I'm a Python noob so let me know if I screwed something up, but it seems to work all right:

    from scrapy.dupefilter import RFPDupeFilter
    
    class SeenURLFilter(RFPDupeFilter):
        """A dupe filter that considers the URL"""
    
        def __init__(self, path=None):
            self.urls_seen = set()
            RFPDupeFilter.__init__(self, path)
    
        def request_seen(self, request):
            if request.url in self.urls_seen:
                return True
            else:
                self.urls_seen.add(request.url)
    

    As ytomar mentioned, be sure to add the DUPEFILTER_CLASS constant to settings.py:

    DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'
    
    0 讨论(0)
  • 2020-11-29 18:54

    https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py

    This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={'deltafetch_key':uniqe_url_key}). This this let you avoid duplicate requests you already have visited in the past.

    A sample mongodb implementation using deltafetch.py

            if isinstance(r, Request):
                key = self._get_key(r)
                key = key+spider.name
    
                if self.db['your_collection_to_store_deltafetch_key'].find_one({"_id":key}):
                    spider.log("Ignoring already visited: %s" % r, level=log.INFO)
                    continue
            elif isinstance(r, BaseItem):
    
                key = self._get_key(response.request)
                key = key+spider.name
                try:
                    self.db['your_collection_to_store_deltafetch_key'].insert({"_id":key,"time":datetime.now()})
                except:
                    spider.log("Ignoring already visited: %s" % key, level=log.ERROR)
            yield r
    

    eg. id = 345 scrapy.Request(url,meta={deltafetch_key:345},callback=parse)

    0 讨论(0)
  • 2020-11-29 18:56

    Here is my custom filter base on scrapy 0.24.6.

    In this filter, it only cares id in the url. for example

    http://www.example.com/products/cat1/1000.html?p=1 http://www.example.com/products/cat2/1000.html?p=2

    are treated as same url. But

    http://www.example.com/products/cat2/all.html

    will not.

    import re
    import os
    from scrapy.dupefilter import RFPDupeFilter
    
    
    class MyCustomURLFilter(RFPDupeFilter):
    
        def _get_id(self, url):
            m = re.search(r'(\d+)\.html', url)
            return None if m is None else m.group(1)
    
        def request_fingerprint(self, request):
            style_id = self._get_id(request.url)
            return style_id
    
    0 讨论(0)
  • 2020-11-29 19:01

    You can write custom middleware for duplicate removal and add it in settings

    import os
    
    from scrapy.dupefilter import RFPDupeFilter
    
    class CustomFilter(RFPDupeFilter):
    """A dupe filter that considers specific ids in the url"""
    
        def __getid(self, url):
            mm = url.split("&refer")[0] #or something like that
            return mm
    
        def request_seen(self, request):
            fp = self.__getid(request.url)
            if fp in self.fingerprints:
                return True
            self.fingerprints.add(fp)
            if self.file:
                self.file.write(fp + os.linesep)
    

    Then you need to set the correct DUPFILTER_CLASS in settings.py

    DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'
    

    It should work after that

    0 讨论(0)
提交回复
热议问题