Simple html dom character encoding issue

后端 未结 3 1541
灰色年华
灰色年华 2020-12-18 11:07

I am using simple html dom to retrieve content from another website, but the thing is theres a character encoding issue with the stuff retrieved using simple html dom. The c

相关标签:
3条回答
  • 2020-12-18 11:30

    I had this problem too, but it was not the charset problem.It was gzip compression that simple html dom doesn't handle. Here is my solution. Use the function file_get_html2 instead file_get_html.

    function curl($url){
        $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
        $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
        $headers[]  = "Accept-Language:en-us,en;q=0.5";
        $headers[]  = "Accept-Encoding:gzip,deflate";
        $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $headers[]  = "Keep-Alive:115";
        $headers[]  = "Connection:keep-alive";
        $headers[]  = "Cache-Control:max-age=0";
    
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
        curl_setopt($curl, CURLOPT_ENCODING, "gzip");
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
        $data = curl_exec($curl);
        curl_close($curl);
        return $data;
    
    }
    function file_get_html2($url){
        return str_get_html(curl($url));
    }
    
    0 讨论(0)
  • 2020-12-18 11:35

    Try using iconv to convert the charset of the scraped text to the charset you use on your page.

    Signature:

    string iconv ( string $in_charset , string $out_charset , string $str )
    

    Example:

    echo iconv("ISO-8859-1", "UTF-8", $text);
    
    0 讨论(0)
  • 2020-12-18 11:35

    Go to website and check their charset by viewing page info.

    $text = iconv(mb_detect_encoding($text), "UTF-8//TRANSLIT//IGNORE", $text);
    
    0 讨论(0)
提交回复
热议问题