PHP - how to get main HTML content like Reader Mode in Firefox

前端 未结 5 1995
无人共我
无人共我 2021-02-11 06:33

in android Firefox app and safari iPad we can read only main content by \"Reader Mode\". read more... How to recognize only main content in HTML with PHP?

I need to dete

5条回答
  •  猫巷女王i
    2021-02-11 07:11

    Readability.php works pretty well but I've found you get more successful results if you curl for the html content and spoof the user agent. You can also use some redirect forwarding in case the url you are trying to hit is giving you the runaround. Here is what I'm using now slightly modified from another post (PHP Curl following redirects). Hope you find it useful.

    function getData($url) {
        $url = str_replace('&', '&', urldecode(trim($url)) );
        $timeout = 5;
        $cookie = tempnam('/tmp', 'CURLCOOKIE');
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_ENCODING, '');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
        curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
        $content = curl_exec($ch);
        curl_close ($ch);
        return $content;
    }
    

    Implementation:

    $url = 'http://';
    //$html = file_get_contents($url);
    $html = getData($url);
    
    if (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($html, array(), 'UTF8');
        $tidy->cleanRepair();
        $html = $tidy->value;
    }
    
    $readability = new Readability($html, $url);
    
    //...
    

提交回复
热议问题