Speed up web scraper

前端 未结 4 1669
野趣味
野趣味 2021-01-30 03:45

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the jo

相关标签:
4条回答
  • 2021-01-30 03:58

    I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.

    Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.

    If you do the math, you are quickly compute bound but not memory bound.

    0 讨论(0)
  • 2021-01-30 04:00

    Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

    0 讨论(0)
  • 2021-01-30 04:08

    Here's a collection of things to try:

    • use latest scrapy version (if not using already)
    • check if non-standard middlewares are used
    • try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
    • turn off logging LOG_ENABLED = False (docs)
    • try yielding an item in a loop instead of collecting items into the items list and returning them
    • use local cache DNS (see this thread)
    • check if this site is using download threshold and limits your download speed (see this thread)
    • log cpu and memory usage during the spider run - see if there are any problems there
    • try run the same spider under scrapyd service
    • see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
    • try running Scrapy on pypy, see Running Scrapy on PyPy

    Hope that helps.

    0 讨论(0)
  • 2021-01-30 04:18

    One workaround to speed up your scrapy is to config your start_urls appropriately.

    For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:

     start_urls = [
        "http://apps.webofknowledge.com/doc=250",
        "http://apps.webofknowledge.com/doc=750",
    ]
    

    In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].

    0 讨论(0)
提交回复
热议问题