How do I tell DOMDocument->load() what encoding I want it to use?

前端 未结 3 1249
旧巷少年郎
旧巷少年郎 2021-01-04 10:48

I search for and process XML files from elsewhere, and need to transform them with some XSLTs. No problem. Using PHP5 and the DOM library, everything\'s a snap. Worked fine,

相关标签:
3条回答
  • 2021-01-04 11:16

    I haven't found a way to set the default encoding (yet) but maybe the recover mode is feasible in this case.
    When libxml encounters an encoding error and no encoding has been explicitly set it switches from unicode/utf8 to latin1 and continues parsing the document. But in the parser context the property wellFormed is set to 0/false. PHP's DOM extension considers the document valid if wellFormed is true or the DOMDocument object's attribute recover is true.

    <?php
    // german Umlaut ä in latin1 = 0xE4
    $xml = '<foo>'.chr(0xE4).'</foo>';
    
    $doc = new DOMDocument;
    $b = $doc->loadxml($xml);
    echo 'with doc->recover=false(default) : ', ($b) ? 'success':'failed', "\n";
    
    $doc = new DOMDocument;
    $doc->recover = true;
    $b = $doc->loadxml($xml);
    echo 'with doc->recover=true : ', ($b) ? 'success':'failed', "\n";
    

    prints

    Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
    Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in test.php on line 6
    with doc->recover=false(default) : failed
    
    Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding !
    Bytes: 0xE4 0x3C 0x2F 0x66 in Entity, line: 1 in  test.php on line 11
    with doc->recover=true : success
    

    You still get the warning message (which can be suppressed with @$doc->load()) and it will also show up in the internal libxml errors (only once when the parser switches from utf8 to latin1). The error code for this particular error will be 9 (XML_ERR_INVALID_CHAR).

    <?php
    $xml = sprintf('<foo>
        <ae>%s</ae>
        <oe>%s</oe>
        &
    </foo>', chr(0xE4),chr(0xF6));
    
    libxml_use_internal_errors(true);
    $doc = new DOMDocument;
    $doc->recover = true;
    libxml_clear_errors();
    $b = $doc->loadxml($xml);
    $invalidCharFound = false;
    foreach(libxml_get_errors() as $error) {
        if ( 9==$error->code && !$invalidCharFound ) {
            $invalidCharFound = true;
            echo "found invalid char, possibly harmless\n";
        }
        else {
            echo "hm, that's probably more severe: ", $error->message, "\n";
        }
    }
    
    0 讨论(0)
  • 2021-01-04 11:24

    The ony way to specify the encoding is in the XML declaration at the start of the file:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    
    0 讨论(0)
  • 2021-01-04 11:25

    Does this work for you?

    $doc = new DOMDocument('1.0', 'iso-8859-1');
    $doc->load($xmlPath);
    

    Edit: Since it appears that this doesn't work, what you could do instead is similar to your existing method but without the temp file. Read the XML file from your source just using standard IO operations (file_get_contents() or something), then perform whatever changes to the encoding you need (iconv() or utf8_decode()) and then use loadXML()

    $myXMLString = file_get_contents($xmlPath);
    $myXMLString = utf8_decode($myXMLString);
    $doc = new DOMDocument('1.0', 'iso-8859-1');
    $doc->loadXML($myXMLString);
    
    0 讨论(0)
提交回复
热议问题