I am scraping 23770 webpages with a pretty simple web scraper using scrapy
. I am quite new to scrapy and even python, but managed to write a spider that does the jo
I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.
Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.
If you do the math, you are quickly compute bound but not memory bound.
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED
setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
Here's a collection of things to try:
CONCURRENT_REQUESTS_PER_DOMAIN
, CONCURRENT_REQUESTS
settings (docs)LOG_ENABLED = False
(docs)yield
ing an item in a loop instead of collecting items into the items
list and returning themScrapy
on pypy
, see Running Scrapy on PyPyHope that helps.
One workaround to speed up your scrapy is to config your start_urls
appropriately.
For example, If our target data is in http://apps.webofknowledge.com/doc=1
where the doc number range from 1
to 1000
, you can config your start_urls
in followings:
start_urls = [
"http://apps.webofknowledge.com/doc=250",
"http://apps.webofknowledge.com/doc=750",
]
In this way, requests will start from 250
to 251,249
and from 750
to 751,749
simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"]
.