html scraping and css queries

后端未结

关注

 1  2049

没有蜡笔的小新 2021-02-02 01:15

what are the advantages and disadvantages of the following libraries?

PHP Simple HTML DOM Parser
QP
phpQuery

From the above i\'ve

1条回答

独厮守ぢ (楼主)

2021-02-02 02:02

I used to use simple html dom exclusively until some bright SO'ers showed me the light hallelujah.

Just use the built in DOM functions. They are written in C and part of the PHP core. They are faster more efficient than any 3rd party solution. With firebug, getting an XPath query is muey simple. This simple change has made my php based scrapers run faster, while saving my precious time.

My scrapers used to take ~ 60 megabytes to scrape 10 sites asyncronously with curl. That was even with the simple html dom memory fix you mentioned.

Now my php processes never go above 8 megabytes.

Highly recommended.

EDIT

Okay I did some benchmarks. Built in dom is at least an order of magnitude faster.

Built in php DOM: 0.007061
Simple html  DOM: 0.117781

loadHTML($html);
$x = new DOMXPath($dom); 

foreach($x->query("//a") as $node) 
{
     $data['dom'][] = $node->getAttribute("href");
}

foreach($x->query("//img") as $node) 
{
     $data['dom'][] = $node->getAttribute("src");
}

foreach($x->query("//input") as $node) 
{
     $data['dom'][] = $node->getAttribute("name");
}

$dom_time =  microtime(true) - $timer_start;

echo "built in php DOM : $dom_time\n";

$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
   $data['simple_dom'][] = $node->href;
}

foreach( $simple_dom->find("img") as $node)
{
   $data['simple_dom'][] = $node->src;
}

foreach( $simple_dom->find("input") as $node)
{
   $data['simple_dom'][] = $node->name;
}
$simple_dom_time =  microtime(true) - $timer_start;

echo "simple html  DOM : $simple_dom_time\n";

0 讨论(0)