Character Encoding issue with PHP Simple HTML DOM Parser

前端 未结 3 1463
被撕碎了的回忆
被撕碎了的回忆 2021-01-07 00:14

I am using PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/ to fetch data like Page Title, Meta Description and Meta Tags from other domains and

相关标签:
3条回答
  • 2021-01-07 00:52

    If I switch browser encoding to UTF-8, it works.

    So you're simply not setting the correct HTTP header to designate your document to be UTF-8 encoded and the browser is interpreting it in some other encoding. Use:

    header('Content-Type: text/html; charset=utf-8');
    
    0 讨论(0)
  • 2021-01-07 00:59

    @deceze and @Shakti thanks for your help.

    +1 for the article link posted by deceze (Handling Unicode Front to Back in a Web App) and it also worth reading Understanding encoding

    After reading your comments, answer and of course those two articles, I finally solved my issue.

    I have listed the steps I did so far to solve this issue:

    1. Added header('Content-Type: text/html; charset=utf-8'); on the top of my init.php file,
    2. Changed CHARACTER SET of my database table field which is storing those value to UTF-8,
    3. Set MySQL connection charset to UTF-8 mysql_set_charset('utf8', $connection_link_id);
    4. Used htmlentities() function to convert characters $meta_title = htmlentities(trim($meta_title_raw), ENT_QUOTES, 'UTF-8');

    Now the issue seems to be solved, BUT I still have to do following thing to solve this issue in FULL.

    1. Get the encoded charset from the source $source_charset.
    2. Change the encoding of the string into UTF-8 if it is already not in the same encoding. For this the only available PHP function is iconv(). Example: iconv($source_charset, "UTF-8", $meta_title_raw);

    For getting $source_charset I probably have to use some tricks or multi checking. Like checking headers and meta tag etc. I found a good answer at Detect encoding

    Let me know if there are any improvements or any fault on my steps above.

    0 讨论(0)
  • 2021-01-07 01:03

    I had the same problem with Romanian characters. Nothing worked until I used

    header('Content-Type: text/html; charset=ISO-8859-2'); 
    

    ISO-8859-2 being the character set for Eastern European letters. So find the right character set for your language and use it in header.

    0 讨论(0)
提交回复
热议问题