I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.
I have used simple html dom for
Flat Loop Example:
This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach
) a counter to prevent this from running for too long:
$URLStack = (array) $parent_Url_Html->getHTML()->find('a');
$URLProcessedCount = 0;
while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
{
$url = array_shift($URLStack);
if (!$url) break; # exit if the stack is empty
# process URL
# for each new URL:
$URLStack[] = $newURL;
}
You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount
inside the loop so you keep previous entries as well:
while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
{
$url = $URLStack[$URLProcessedCount++];
Additionally I suggest you use the PHP DOMDocument
extension instead of simple dom as it's a much more versatile tool.