php spider breaks in middle (Domdocument, xpath, curl) - help needed

问题

I am a beginner programmer, designing a spider that crawls pages. Logic goes like this:

get $url with curl
create dom document
parsing out href tags using xpath
storing href attributes in $totalurls (that aren't already there)
updating $url from $totalurls

Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on.

But if I begin with the page that was 10th in previous example it finds all links with no problem but breaks again after 10 urls crawled.

Any idea what might cause this? My guess is something with domdocument, maybe, I am not 100%familiar with that. Or can storing too much data cause trouble? It can be some really beginner issue cause i am brand new - AND clueless. Please give me some advice where to look for problem

回答1:

My guess is that your script times out after 30 or 60 seconds (default for php) which can be overridden with set_time_limit($num_of_seconds); or you can change your max_execution_time in your php.ini or if you have a hosting you can change some values via php settings(or whatever it is called).

Also you might want to add this to the top of your page:

error_reporting(E_ALL);
ini_set("display_errors", 1);

and check your error logs to see if there are messages that pertain to your spider.

来源：https://stackoverflow.com/questions/14638214/php-spider-breaks-in-middle-domdocument-xpath-curl-help-needed

标签

php

curl

xpath

domdocument

web-crawler