Python web scraping: difference between sleep and request(page, timeout=x)

旧时模样 提交于 2020-01-23 03:59:07

问题


When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,

sleep(10)
response = requests.get(url)

and,

response = requests.get(url, timeout=10)

That is, timeout is much faster.

Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.

  1. Why is there such a difference in speed?
  2. Why is the scraping duration per page less than 10 seconds?

I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.


回答1:


time.sleep stops your script from running for certain amount of seconds, while the timeout is the maximum time wait for retrieving the url. If the data is retrieved before the timeout time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout.

time.sleep is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep will take more than 10 seconds every time.

They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.




回答2:


response = requests.get(url, timeout=10)
# timeout specifies the maximum time program will wait for request to complete before throwing exception. It is not necessary that program will pause for 10 seconds. If response is returned early the program won't wait anymore.

Read more about requests timeout here.

time.sleep cause your main thread to sleep , so your program will always wait for 10 seconds always before making a request to the url.



来源:https://stackoverflow.com/questions/43169900/python-web-scraping-difference-between-sleep-and-requestpage-timeout-x

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!