Errors regarding Web Crawler in PHP

后端 未结 1 559
独厮守ぢ
独厮守ぢ 2021-01-14 21:37

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

I have used simple html dom for

相关标签:
1条回答
  • 2021-01-14 22:16

    Flat Loop Example:

    1. You initiate the loop with a stack that contains all URLs you'd like to process first.
    2. Inside the loop:
      1. You shift the first URL (you obtain it and it's removed) from the stack.
      2. If you find new URLs, you add them at the end of the stack (push).

    This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

    $URLStack = (array) $parent_Url_Html->getHTML()->find('a');
    $URLProcessedCount = 0;
    while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
    {
        $url = array_shift($URLStack);
        if (!$url) break; # exit if the stack is empty
    
        # process URL
    
        # for each new URL:
        $URLStack[] = $newURL;
    }
    

    You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

    while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
    {
        $url = $URLStack[$URLProcessedCount++];
    

    Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it's a much more versatile tool.

    0 讨论(0)
提交回复
热议问题