How check if a task is already in python Queue?

后端 未结 12 2393
青春惊慌失措
青春惊慌失措 2020-12-05 15:04

I\'m writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished proc

相关标签:
12条回答
  • 2020-12-05 15:51

    instead of "array of pages already visited" make an "array of pages already added to the queue"

    0 讨论(0)
  • 2020-12-05 15:53

    The put method also needs to be overwritten, if not a join call will block forever https://github.com/python/cpython/blob/master/Lib/queue.py#L147

    class UniqueQueue(Queue):
    
        def put(self, item, block=True, timeout=None):
            if item not in self.queue: # fix join bug
                Queue.put(self, item, block, timeout)
    
        def _init(self, maxsize):
            self.queue = set()
    
        def _put(self, item):
            self.queue.add(item)
    
        def _get(self):
            return self.queue.pop()
    
    0 讨论(0)
  • 2020-12-05 15:56

    Also, instead of a set you might try using a dictionary. Operations on sets tend to get rather slow when they're big, whereas a dictionary lookup is nice and quick.

    My 2c.

    0 讨论(0)
  • 2020-12-05 15:57

    use:

    url in q.queue
    

    which returns True iff url is in the queue

    0 讨论(0)
  • 2020-12-05 15:58

    What follows is an improvement over Lukáš Lalinský's latter solution. The important difference is that put is overridden in order to ensure unfinished_tasks is accurate and join works as expected.

    from queue import Queue
    
    class UniqueQueue(Queue):
    
        def _init(self, maxsize):
            self.all_items = set()
            Queue._init(self, maxsize)
    
        def put(self, item, block=True, timeout=None):
            if item not in self.all_items:
                self.all_items.add(item)
                Queue.put(self, item, block, timeout)
    
    0 讨论(0)
  • 2020-12-05 15:59

    Why only use the array (ideally, a dictionary would be even better) to filter things you've already visited? Add things to your array/dictionary as soon as you queue them up, and only add them to the queue if they're not already in the array/dict. Then you have 3 simple separate things:

    1. Links not yet seen (neither in queue nor array/dict)
    2. Links scheduled to be visited (in both queue and array/dict)
    3. Links already visited (in array/dict, not in queue)
    0 讨论(0)
提交回复
热议问题