How check if a task is already in python Queue?

后端未结

关注

 12  2392

I\'m writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished proc

相关标签:

12条回答

陌清茗

2020-12-05 15:35
Sadly, I have no enouch rating for comment the best Lukáš Lalinský’s answer.

To add support for SetQueue.task_done() and SetQueue.join() for second variant of Lukáš Lalinský’s SetQueue add else brahch to the if:
```
def _put(self, item):
    if item not in self.all_items:
        Queue._put(self, item);
        self.all_items.add(item);
    else:
        self.unfinished_tasks -= 1;
```
Tested and works with Python 3.4.
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-12-05 15:36

The way I solved this (actually I did this in Scala, not Python) was to use both a Set and a Queue, only adding links to the queue (and set) if they did not already exist in the set.

Both the set and queue were encapsulated in a single thread, exposing only a queue-like interface to the consumer threads.

Edit: someone else suggested SQLite and that is also something I am considering, if the set of visited URLs needs to grow large. (Currently each crawl is only a few hundred pages so it easily fits in memory.) But the database is something that can also be encapsulated within the set itself, so the consumer threads need not be aware of it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-12-05 15:38

SQLite is so simple to use and would fit perfectly... just a suggestion.

0 讨论(0)
发布评论:

提交评论
- 加载中...

再見小時候

2020-12-05 15:41

I'm agree with @Ben James.Try to use both deque and set.

here are code:

class SetUniqueQueue(Queue):

    def _init(self, maxsize):
        self.queue = deque()
        self.setqueue = set()

    def _put(self, item):
        if item not in self.setqueue:
            self.setqueue.add(item)
            self.queue.append(item)

    def _get(self):
        return self.queue.popleft()

0 讨论(0)

不要未来只要你来

2020-12-05 15:42

This is full version of SetQueue

import Queue

class SetQueue(Queue.Queue):
    def _init(self, maxsize):
        Queue.Queue._init(self, maxsize)
        self.all_items = set()

    def _put(self, item):
        if item not in self.all_items:
            Queue.Queue._put(self, item)
            self.all_items.add(item)

    def _get(self):
        item = Queue.Queue._get(self)
        self.all_items.remove(item)
        return item

0 讨论(0)

执笔经年

2020-12-05 15:44
If you don't care about the order in which items are processed, I'd try a subclass of Queue that uses set internally:
```
class SetQueue(Queue):

    def _init(self, maxsize):
        self.maxsize = maxsize
        self.queue = set()

    def _put(self, item):
        self.queue.add(item)

    def _get(self):
        return self.queue.pop()
```
As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the Queue instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to queue which will order requests properly.
```
class SetQueue(Queue):

    def _init(self, maxsize):
        Queue._init(self, maxsize) 
        self.all_items = set()

    def _put(self, item):
        if item not in self.all_items:
            Queue._put(self, item) 
            self.all_items.add(item)
```
The advantage of this, as opposed to using a set separately, is that the Queue's methods are thread-safe, so that you don't need additional locking for checking the other set.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页