I\'ve built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info\' I need.
after using an algorith
this turns the recursion in to a loop:
def checkNextID(ID):
global numOfRuns, curRes, lastResult
while ID < lastResult:
try:
numOfRuns += 1
if numOfRuns % 10 == 0:
time.sleep(3) # sleep every 10 iterations
if isValid(ID + 8):
parseHTML(curRes)
ID = ID + 8
elif isValid(ID + 18):
parseHTML(curRes)
ID = ID + 18
elif isValid(ID + 7):
parseHTML(curRes)
ID = ID + 7
elif isValid(ID + 17):
parseHTML(curRes)
ID = ID + 17
elif isValid(ID+6):
parseHTML(curRes)
ID = ID + 6
elif isValid(ID + 16):
parseHTML(curRes)
ID = ID + 16
else:
ID = ID + 1
except Exception, e:
print "somethin went wrong: " + str(e)
Instead of doing recursion, the parts of the code with checkNextID(ID + 18)
and similar could be replaced with ID+=18
, and then if you remove all instances of return 0
, then it should do the same thing but as a simple loop. You should then put a return 0
at the end and make your variables non-global.
Python don't have a great support for recursion because of it's lack of TRE (Tail Recursion Elimination).
This means that each call to your recursive function will create a function call stack and because there is a limit of stack depth (by default is 1000) that you can check out by sys.getrecursionlimit (of course you can change it using sys.setrecursionlimit but it's not recommended) your program will end up by crashing when it hits this limit.
As other answer has already give you a much nicer way for how to solve this in your case (which is to replace recursion by simple loop) there is another solution if you still want to use recursion which is to use one of the many recipes of implementing TRE in python like this one.
N.B: My answer is meant to give you more insight on why you get the error, and I'm not advising you to use the TRE as i already explained because in your case a loop will be much better and easy to read.
You can increase the capacity of the stack by the following :
import sys
sys.setrecursionlimit(10000)