I am very new to all of this; I need to obtain data on several thousand sourceforge projects for a paper I am writing. The data is all freely available in json format at the
This one can be easily changed to use whichever number of connections you want.
MAX_CONNECTIONS = 100 #Number of connections you want to limit it to
# urlsList: Your list of URLs.
results = []
for x in range(1,pages+1, MAX_CONNECTIONS):
rs = (grequests.get(u, stream=False) for u in urlsList[x:x+MAX_CONNECTIONS])
time.sleep(0.2) #You can change this to whatever you see works better.
results.extend(grequests.map(rs)) #The key here is to extend, not append, not insert.
print("Waiting") #Optional, so you see something is done.
In my case, it was not rate limiting by the destination server, but something much simpler: I didn't explicitly close the responses, so they were keeping the socket open, and the python process ran out of file handles.
My solution (don't know for sure which one fixed the issue - theoretically either of them should) was to:
Set stream=False
in grequests.get
:
rs = (grequests.get(u, stream=False) for u in urls)
Call explicitly response.close()
after I read response.content:
responses = grequests.map(rs)
for response in responses:
make_use_of(response.content)
response.close()
Note: simply destroying the response
object (assigning None
to it, calling gc.collect()
) was not enough - this did not close the file handles.