I\'m writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished proc
instead of "array of pages already visited" make an "array of pages already added to the queue"
The put method also needs to be overwritten, if not a join call will block forever https://github.com/python/cpython/blob/master/Lib/queue.py#L147
class UniqueQueue(Queue):
def put(self, item, block=True, timeout=None):
if item not in self.queue: # fix join bug
Queue.put(self, item, block, timeout)
def _init(self, maxsize):
self.queue = set()
def _put(self, item):
self.queue.add(item)
def _get(self):
return self.queue.pop()
Also, instead of a set you might try using a dictionary. Operations on sets tend to get rather slow when they're big, whereas a dictionary lookup is nice and quick.
My 2c.
use:
url in q.queue
which returns True iff url
is in the queue
What follows is an improvement over Lukáš Lalinský's latter solution.
The important difference is that put
is overridden in order to ensure unfinished_tasks
is accurate and join
works as expected.
from queue import Queue
class UniqueQueue(Queue):
def _init(self, maxsize):
self.all_items = set()
Queue._init(self, maxsize)
def put(self, item, block=True, timeout=None):
if item not in self.all_items:
self.all_items.add(item)
Queue.put(self, item, block, timeout)
Why only use the array (ideally, a dictionary would be even better) to filter things you've already visited? Add things to your array/dictionary as soon as you queue them up, and only add them to the queue if they're not already in the array/dict. Then you have 3 simple separate things: