Errors regarding Web Crawler in PHP

后端未结

关注

 1  560

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

I have used simple html dom for

相关标签:

1条回答

北恋

2021-01-14 22:16
Flat Loop Example:
1. You initiate the loop with a stack that contains all URLs you'd like to process first.
2. Inside the loop:
  1. You shift the first URL (you obtain it and it's removed) from the stack.
  2. If you find new URLs, you add them at the end of the stack (push).
This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:
```
$URLStack = (array) $parent_Url_Html->getHTML()->find('a');
$URLProcessedCount = 0;
while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = array_shift($URLStack);
    if (!$url) break; # exit if the stack is empty

    # process URL

    # for each new URL:
    $URLStack[] = $newURL;
}
```
You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:
```
while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = $URLStack[$URLProcessedCount++];
```
Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it's a much more versatile tool.
0 讨论(0)
发布评论:

提交评论
- 加载中...