How can I speed this up?

问题

I have a script which I think is pretty basic scraping, call it what you will, but it takes on average at least 6 seconds...is it possible to speed it up? The $date variables are only there for timing the code and don't add anything significant to the time it takes. I have set two timing markers and each is approx 3 seconds between. Example URL below for testing

$date = date('m/d/Y h:i:s a', time());

echo "start of timing $date<br /><br />"; 

include('simple_html_dom.php');

function getUrlAddress()
{
$url = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $url .'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}

$date = date('m/d/Y h:i:s a', time());  echo "<br /><br />after geturl $date<br /><br />";

$parts = explode("/",$url);

$html = file_get_html($url);

$date = date('m/d/Y h:i:s a', time());  echo "<br /><br />after file_get_url $date<br /><br />";

$file_string = file_get_contents($url);
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];

foreach($html->find('img') as $e){

    $image = $e->src;

    if (preg_match("/orangeBlue/", $image)) { $image = ''; }

    if (preg_match("/BeaconSprite/", $image)) { $image = ''; }

    if($image != ''){

    if (preg_match("/http/", $image)) { $image = $image; }

    elseif (preg_match("*//*", $image)) { $image = 'http:'.$image; }

    else { $image = $parts['0']."//".$parts[1].$parts[2]."/".$image; }

    $size = getimagesize($image);
    if (($size[0]>110)&&($size[1]>110)){
    if (preg_match("/http/", $image)) { $image = $image; }
    echo '<img src='.$image.'><br>';
    }
    }
    }

$date = date('m/d/Y h:i:s a', time());  echo "<br /><br />end of timing $date<br /><br />";

Example URL

UPDATE

This is actual what timing markers show:

start of timing 01/24/2012 12:31:50 am

after geturl 01/24/2012 12:31:50 am

after file_get_url 01/24/2012 12:31:53 am

end of timing 01/24/2012 12:31:57 am

http://www.ebay.co.uk/itm/Duke-Nukem-Forever-XBOX-360-Game-BRAND-NEW-SEALED-UK-PAL-UK-Seller-/170739972246?pt=UK_PC_Video_Games_Video_Games_JS&hash=item27c0e53896`

回答1:

It's probably the getimagesize function - it is going and fetching every image on the page so it can determine the size. Maybe you can write something with curl to get the header only for Content-size (though, actually, this might be what getimagesize does).

At any rate, back in the day I wrote a few spiders and it's kind of slow to do, with internet speeds better than ever it's still a fetch for each element. And I wasn't even concerned with images.

回答2:

I'm not a PHP guy, but it looks to me like you're going out to the web to get the file twice...

First using this:

$html = file_get_html($url);

Then again using this:

$file_string = file_get_contents($url);

So if each hit takes a couple of seconds, you might be able to reduce your timing by finding a way to cut this down to a single web-hit.

Either that, or I'm blind. Which is a real possibility!

来源：https://stackoverflow.com/questions/8980519/how-can-i-speed-this-up

标签

php

html

simple-html-dom