How check if a task is already in python Queue?

后端未结

关注

 12  2393

I\'m writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished proc

相关标签:

12条回答

孤独总比滥情好

2020-12-05 15:51

instead of "array of pages already visited" make an "array of pages already added to the queue"

0 讨论(0)
发布评论:

提交评论
- 加载中...

一生所求

2020-12-05 15:53

The put method also needs to be overwritten, if not a join call will block forever https://github.com/python/cpython/blob/master/Lib/queue.py#L147

class UniqueQueue(Queue):

    def put(self, item, block=True, timeout=None):
        if item not in self.queue: # fix join bug
            Queue.put(self, item, block, timeout)

    def _init(self, maxsize):
        self.queue = set()

    def _put(self, item):
        self.queue.add(item)

    def _get(self):
        return self.queue.pop()

0 讨论(0)

执念已碎

2020-12-05 15:56

Also, instead of a set you might try using a dictionary. Operations on sets tend to get rather slow when they're big, whereas a dictionary lookup is nice and quick.

My 2c.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-12-05 15:57
use:
```
url in q.queue
```
which returns True iff url is in the queue
0 讨论(0)
发布评论:

提交评论
- 加载中...

深忆病人

2020-12-05 15:58

What follows is an improvement over Lukáš Lalinský's latter solution. The important difference is that put is overridden in order to ensure unfinished_tasks is accurate and join works as expected.

from queue import Queue

class UniqueQueue(Queue):

    def _init(self, maxsize):
        self.all_items = set()
        Queue._init(self, maxsize)

    def put(self, item, block=True, timeout=None):
        if item not in self.all_items:
            self.all_items.add(item)
            Queue.put(self, item, block, timeout)

0 讨论(0)

故里飘歌

2020-12-05 15:59
Why only use the array (ideally, a dictionary would be even better) to filter things you've already visited? Add things to your array/dictionary as soon as you queue them up, and only add them to the queue if they're not already in the array/dict. Then you have 3 simple separate things:
1. Links not yet seen (neither in queue nor array/dict)
2. Links scheduled to be visited (in both queue and array/dict)
3. Links already visited (in array/dict, not in queue)
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2