Improving HTML scraper efficiency with pcntl_fork()

那年仲夏 提交于 2019-12-02 03:12:15

It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.

Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.

In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

Baba

Introduction

pcntl_fork() is not the only way to improve performance HTML scraper while it might be a good idea to use Message Queue has Charles suggested but you still need a faster effective way to pull that request in your workers

Solution 1

Use curl_multi_init ... curl is actually faster and using multi curl gives you parallel processing

From PHP DOC

curl_multi_init Allows the processing of multiple cURL handles in parallel.

So Instead of using $html->loadHtmlFile('http://xxxx'); to load the files several times you can just use curl_multi_init to load multiple url at the same time

Here are some Interesting Implementations

Solution 2

You can use pthreads to use multi-threading in PHP

Example

// Number of threads you want
$threads = 10;

// Treads storage
$ts = array();

// Your list of URLS // range just for demo
$urls = range(1, 50);

// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));

printf("%s:PROCESS  #load\n", date("g:i:s"));

$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
    $ts[] = new AsyncScraper($group, $name[$i ++]);
}

printf("%s:PROCESS  #join\n", date("g:i:s"));

// wait for all Threads to complete
foreach ( $ts as $t ) {
    $t->join();
}

printf("%s:PROCESS  #finish\n", date("g:i:s"));

Output

9:18:00:PROCESS  #load
9:18:00:START  #5592     A
9:18:00:START  #9620     B
9:18:00:START  #11684    C
9:18:00:START  #11156    D
9:18:00:START  #11216    E
9:18:00:START  #11568    F
9:18:00:START  #2920     G
9:18:00:START  #10296    H
9:18:00:START  #11696    I
9:18:00:PROCESS  #join
9:18:00:START  #6692     J
9:18:01:END  #9620       B
9:18:01:END  #11216      E
9:18:01:END  #10296      H
9:18:02:END  #2920       G
9:18:02:END  #11696      I
9:18:04:END  #5592       A
9:18:04:END  #11568      F
9:18:04:END  #6692       J
9:18:05:END  #11684      C
9:18:05:END  #11156      D
9:18:05:PROCESS  #finish

Class Used

class AsyncScraper extends Thread {

    public function __construct(array $urls, $name) {
        $this->urls = $urls;
        $this->name = $name;
        $this->start();
    }

    public function run() {
        printf("%s:START  #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
        if ($this->urls) {
            // Load with CURL
            // Parse with DOM
            // Do some work

            sleep(mt_rand(1, 5));
        }
        printf("%s:END  #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
    }
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!