php spider breaks in middle (Domdocument, xpath, curl) - help needed

我与影子孤独终老i 提交于 2019-12-25 01:24:49

问题


I am a beginner programmer, designing a spider that crawls pages. Logic goes like this:

  • get $url with curl
  • create dom document
  • parsing out href tags using xpath
  • storing href attributes in $totalurls (that aren't already there)
  • updating $url from $totalurls

Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on.

But if I begin with the page that was 10th in previous example it finds all links with no problem but breaks again after 10 urls crawled.

Any idea what might cause this? My guess is something with domdocument, maybe, I am not 100%familiar with that. Or can storing too much data cause trouble? It can be some really beginner issue cause i am brand new - AND clueless. Please give me some advice where to look for problem


回答1:


My guess is that your script times out after 30 or 60 seconds (default for php) which can be overridden with set_time_limit($num_of_seconds); or you can change your max_execution_time in your php.ini or if you have a hosting you can change some values via php settings(or whatever it is called).

Also you might want to add this to the top of your page:

error_reporting(E_ALL);
ini_set("display_errors", 1);

and check your error logs to see if there are messages that pertain to your spider.



来源:https://stackoverflow.com/questions/14638214/php-spider-breaks-in-middle-domdocument-xpath-curl-help-needed

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!