问题
My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file.
I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection was broken.
When I wanted to use multiple processes I've just got error from twisted: Multiprocessing of Scrapy Spiders in Parallel Processes So this mean I must learn twisted which I would say I deprecated, when compared to asyncio, but this only my opinion.
So I have couples of ideas what to do
- Fight back and try to learn twisted and implement multiprocessing and with distributed queue with Redis, but I don't feel that scrapy is the right tool for this type of job.
- Go with pyspider - which has all features that I need (I've never used)
- Go with nutch - which is so complex (I've never used)
- Try to build my own distributed crawler, but after crawling 4 websites I've found 4 edge cases: SSL, duplications, timeouts. But it will be easy to add some modifications like: focused crawling.
What solution do you recommend?
Edit1: Sharing code
class ESIndexingPipeline(object):
def __init__(self):
# self.text = set()
self.extracted_type = []
self.text = OrderedSet()
import html2text
self.h = html2text.HTML2Text()
self.h.ignore_links = True
self.h.images_to_alt = True
def process_item(self, item, spider):
body = item['body']
body = self.h.handle(str(body, 'utf8')).split('\n')
first_line = True
for piece in body:
piece = piece.strip(' \n\t\r')
if len(piece) == 0:
first_line = True
else:
e = ''
if not self.text.empty() and not first_line and not regex.match(piece):
e = self.text.pop() + ' '
e += piece
self.text.add(e)
first_line = False
return item
def open_spider(self, spider):
self.target_id = spider.target_id
self.queue = spider.queue
def close_spider(self, spider):
self.text = [e for e in self.text if comprehension_helper(langdetect.detect, e) == 'en']
if spider.write_to_file:
self._write_to_file(spider)
def _write_to_file(self, spider):
concat = "\n".join(self.text)
self.queue.put([self.target_id, concat])
And the call:
def execute_crawler_process(targets, write_to_file=True, settings=None, parallel=800, queue=None):
if settings is None:
settings = DEFAULT_SPIDER_SETTINGS
# causes that runners work sequentially
@defer.inlineCallbacks
def crawl(runner):
n_crawlers_batch = 0
done = 0
n = float(len(targets))
for url in targets:
#print("target: ", url)
n_crawlers_batch += 1
r = runner.crawl(
TextExtractionSpider,
url=url,
target_id=url,
write_to_file=write_to_file,
queue=queue)
if n_crawlers_batch == parallel:
print('joining')
n_crawlers_batch = 0
d = runner.join()
# todo: print before yield
done += n_crawlers_batch
yield d # download rest of data
if n_crawlers_batch < parallel:
d = runner.join()
done += n_crawlers_batch
yield d
reactor.stop()
def f():
runner = CrawlerProcess(settings)
crawl(runner)
reactor.run()
p = Process(target=f)
p.start()
Spider is not particularly interesting.
回答1:
You can use Scrapy-Redis. It is basically a Scrapy spider that fetches URLs to crawl from a queue in Redis. The advantage is that you can start many concurrent spiders so you can crawl faster. All the instances of the spider will pull the URLs from the queue and wait idle when they run out of URLs to crawl. The repository of Scrapy-Redis comes with an example project to implement this.
I use Scrapy-Redis to fire up 64 instances of my crawler to scrape 1 million URLs in around 1 hour.
来源:https://stackoverflow.com/questions/41262701/extract-text-from-200k-domains-with-scrapy