问题
I am a beginner programmer, designing a spider that crawls pages. Logic goes like this:
- get $url with curl
- create dom document
- parsing out href tags using xpath
- storing href attributes in $totalurls (that aren't already there)
- updating $url from $totalurls
Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on.
But if I begin with the page that was 10th in previous example it finds all links with no problem but breaks again after 10 urls crawled.
Any idea what might cause this? My guess is something with domdocument, maybe, I am not 100%familiar with that. Or can storing too much data cause trouble? It can be some really beginner issue cause i am brand new - AND clueless. Please give me some advice where to look for problem
回答1:
My guess is that your script times out after 30 or 60 seconds (default for php) which can be overridden with set_time_limit($num_of_seconds);
or you can change your max_execution_time
in your php.ini or if you have a hosting you can change some values via php settings(or whatever it is called).
Also you might want to add this to the top of your page:
error_reporting(E_ALL);
ini_set("display_errors", 1);
and check your error logs to see if there are messages that pertain to your spider.
来源:https://stackoverflow.com/questions/14638214/php-spider-breaks-in-middle-domdocument-xpath-curl-help-needed