Speed up web scraper

前端 未结 4 1674
野趣味
野趣味 2021-01-30 03:45

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the jo

4条回答
  •  无人共我
    2021-01-30 04:08

    Here's a collection of things to try:

    • use latest scrapy version (if not using already)
    • check if non-standard middlewares are used
    • try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
    • turn off logging LOG_ENABLED = False (docs)
    • try yielding an item in a loop instead of collecting items into the items list and returning them
    • use local cache DNS (see this thread)
    • check if this site is using download threshold and limits your download speed (see this thread)
    • log cpu and memory usage during the spider run - see if there are any problems there
    • try run the same spider under scrapyd service
    • see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
    • try running Scrapy on pypy, see Running Scrapy on PyPy

    Hope that helps.

提交回复
热议问题