PHP Simple HTML Dom Parser Memory Leak / Usage

前端 未结 2 850
情书的邮戳
情书的邮戳 2021-01-01 06:46

I\'m trying to use PHP Simple HTML Dom Parser to parse some information from some sites. Does not matter what and where. But it seems, that there is some HUGE memory problem

相关标签:
2条回答
  • 2021-01-01 07:32

    Seems there are some memory problems when using str_html_get or similar function that creates simple_html_dom object few times without clearing and destroying the previous one. Especially when using ->find that creates array of simple_html_dom_node objects. Even FAQ on authors site says to clear and destroy previous simple_html_dom object before creating new one, but sometimes it can't be done without additional code and memory.

    That's why I created this function, to remove all PHP Simple HTML Dom Parser traces from memory:

    function clean_all(&$items,$leave = ''){
        foreach($items as $id => $item){
            if($leave && ((!is_array($leave) && $id == $leave) || (is_array($leave) && in_array($id,$leave)))) continue;
            if($id != 'GLOBALS'){
                if(is_object($item) && ((get_class($item) == 'simple_html_dom') || (get_class($item) == 'simple_html_dom_node'))){
                    $items[$id]->clear();
                    unset($items[$id]);
                }else if(is_array($item)){
                    $first = array_shift($item);
                    if(is_object($first) && ((get_class($first) == 'simple_html_dom') || (get_class($first) == 'simple_html_dom_node'))){
                        unset($items[$id]);
                    }
                    unset($first);
                }
            }
        }
    }
    

    Usage:

    Clean ALL traces of PHP Simple HTML Dom Parser from memory: clean_all($GLOBALS);

    Clean all traces of PHP Simple HTML Dom Parser from memory, except $myobj: clean_all($GLOBALS,'myobj');

    Clean all traces of PHP Simple HTML Dom Parser from memory, except list of objects ($myobj1,$myobj2...): clean_all($GLOBALS,array('myobj1','myobj2'));

    Hope it will help others too.


    Generally I use it when I use str_to_html() two times like:

    $site=file_get_contents('http://google.com');
    $site_html=str_get_html($site);
    foreach($site->find('a') as $a){
       $site2=file_get_contents($a->href);
       $site2_html=str_get_html($site2);
       echo $site2->find('p',0)->plaintext;
    }
    clean_all($_GLOBALS);
    

    In this example I can't $site_html->clear() before foreach{}, because foreach then will fail. And because calling multiple str_get_html() without clearing previous ones, the redundant dependencies are being broken and clearing it after all leaves memory leaks. That's why my function has to search the defined variables for simple_html_dom objects and clear them manually.

    In my case I forked inside foreach and after few steps main php script used like 100MB of memory. And when forked few times, it have been increasing and increasing and finally killing my server to death. Well almost. Of course when PHP script ends, it does free up memory. But when using 8GB of memory, it took like ages to end.

    0 讨论(0)
  • 2021-01-01 07:45

    I believe you need to call clear() on $main_html

    From the docs...

    Q: This script is leaking memory seriously... After it finished running, it's not cleaning up dom object properly from memory..

    A: Due to PHP5 circular references memory leak, after creating DOM object, you must call $dom->clear() to free memory if call file_get_dom() more than once.

    Example:

    $html = file_get_html(...); 
    // do something... 
    $html->clear(); 
    unset($html);
    
    0 讨论(0)
提交回复
热议问题