PHP DOMDocument loadHTML not encoding UTF-8 correctly

后端 未结 13 1545
梦如初夏
梦如初夏 2020-11-22 15:11

I\'m trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

$profile = \"

        
相关标签:
13条回答
  • 2020-11-22 15:19

    DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

    If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    $dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
    echo $dom->saveHTML();
    

    If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
    echo $dom->saveHTML();
    

    This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

    0 讨论(0)
  • 2020-11-22 15:19

    The only thing that worked for me was the accepted answer of

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    $dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
    echo $dom->saveHTML();
    

    HOWEVER

    This brought about new issues, of having <?xml encoding="utf-8" ?> in the output of the document.

    The solution for me was then to do

    foreach ($doc->childNodes as $xx) {
        if ($xx instanceof \DOMProcessingInstruction) {
            $xx->parentNode->removeChild($xx);
        }
    }
    

    Some solutions told me that to remove the xml header, that I had to perform

    $dom->saveXML($dom->documentElement);
    

    This didn't work for me as for a partial document (e.g. a doc with two <p> tags), only one of the <p> tags where being returned.

    0 讨论(0)
  • 2020-11-22 15:23

    I am using php 7.3.8 on a manjaro and I was working with Persian content. This solved my problem:

    $html = 'hi</b><p>سلام<div>の家庭に、9 ☆';
    $doc = new DOMDocument('1.0', 'UTF-8');
    $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL;
    
    0 讨论(0)
  • 2020-11-22 15:24

    You could prefix a line enforcing utf-8 encoding, like this:

    @$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
    

    And you can then continue with the code you already have, like:

    $doc->saveXML()
    
    0 讨论(0)
  • 2020-11-22 15:31

    Can also encode like below.... gathered from https://davidwalsh.name/domdocument-utf8-problem

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
    echo $dom->saveHTML();
    
    0 讨论(0)
  • 2020-11-22 15:34

    Works finde for me:

    $dom = new \DOMDocument;
    $dom->loadHTML(utf8_decode($html));
    ...
    return  utf8_encode( $dom->saveHTML());
    
    0 讨论(0)
提交回复
热议问题