PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

后端 未结 6 1827
旧时难觅i
旧时难觅i 2020-12-17 02:20

THE PROBLEM: I need a XML file \"full encoded\" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are

相关标签:
6条回答
  • 2020-12-17 02:44

    I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.

    Note: I'm assuming default encoding is UTF-8

    // Search for named entities (strings like "&abc1;").
    echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
        // Decode the entity and re-encode as XML entities. This means "&"
        // will remain "&" whereas "€" becomes "€".
        return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
    }, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";
    
    /* <Foo>€&amp;foo Ç</Foo> */
    

    Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:

    // Search for named entities (strings like "&abc1;").
    $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
        // Decode the entity and re-encode as XML entities. This means "&amp;"
        // will remain "&amp;" whereas "&euro;" becomes "€".
        return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
    }, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";
    
    echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
    
    /* <Foo>&#8364;&amp;foo &#199;</Foo> */
    

    In your case you want it the other way around. Encode numbered entities as UTF-8:

    // Search for named entities (strings like "&abc1;").
    $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
        // Decode the entity and re-encode as XML entities. This means "&amp;"
        // will remain "&amp;" whereas "&euro;" becomes "€".
        return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
    }, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";
    
    // Encodes (uncaught) numbered entities to UTF-8.
    echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
    
    /* <Foo>€&amp;foo Ç</Foo> */
    

    Benchmark

    I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.

    <Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>
    

    Your method

    php -r '$q=["&amp;","&gt;","&lt;"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    <Foo>€&amp;foo Ç é &amp; ∬</Foo>
    =====
    Time taken: 2.0397531986237
    

    My method

    php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    <Foo>€&amp;foo Ç é #_x_amp#; &#8748;</Foo>
    =====
    Time taken: 4.045273065567
    

    My method (with unicode to numbered entity):

    php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    <Foo>&#8364;&amp;foo &#199; &#233; #_x_amp#; &#8748;</Foo>
    =====
    Time taken: 5.4407880306244
    

    My method (with numbered entity to unicode):

    php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
    
    <Foo>€&amp;foo Ç é #_x_amp#; ∬</Foo>
    =====
    Time taken: 5.5400078296661
    
    0 讨论(0)
  • 2020-12-17 02:46

    For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:

    echo xml_entity_decode('&#128;');
    //Output &#128; instead expected €
    

    This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where € is the € sign).

    0 讨论(0)
  • 2020-12-17 02:47

    This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

    ... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

    The best workaround

    Pay attention:

    1. The function xml_entity_decode() below is the best (over any other) workaround.
    2. The function below is not an answer to the present question, it is only a workwaround.
      function xml_entity_decode($s) {
      // illustrating how a (hypothetical) PHP-build-in-function MUST work
        static $XENTITIES = array('&amp;','&gt;','&lt;');
        static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
        $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
        $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
        $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
        return $s;
     }  
    

    To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

      $countBchMk_MAX=1000;
      $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
      $start_time = microtime(TRUE);
      for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){
    
        $A = xml_entity_decode($xml); // 0.0002
    
        /* 0.0014
         $doc = new DOMDocument;
         $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
         $doc->encoding = 'UTF-8';
         $A = $doc->saveXML();
        */
    
      }
      $end_time = microtime(TRUE);
      echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
         ($end_time  - $start_time)/$countBchMk_MAX, 
         " seconds</h1>";
      
    
    0 讨论(0)
  • 2020-12-17 02:47
        public function entity_decode($str, $charset = NULL)
    {
        if (strpos($str, '&') === FALSE)
        {
            return $str;
        }
    
        static $_entities;
    
        isset($charset) OR $charset = $this->charset;
        $flag = is_php('5.4')
            ? ENT_COMPAT | ENT_HTML5
            : ENT_COMPAT;
    
        do
        {
            $str_compare = $str;
    
            // Decode standard entities, avoiding false positives
            if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
            {
                if ( ! isset($_entities))
                {
                    $_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));
    
                    // If we're not on PHP 5.4+, add the possibly dangerous HTML 5
                    // entities to the array manually
                    if ($flag === ENT_COMPAT)
                    {
                        $_entities[':'] = '&colon;';
                        $_entities['('] = '&lpar;';
                        $_entities[')'] = '&rpar';
                        $_entities["\n"] = '&newline;';
                        $_entities["\t"] = '&tab;';
                    }
                }
    
                $replace = array();
                $matches = array_unique(array_map('strtolower', $matches[0]));
                for ($i = 0; $i < $c; $i++)
                {
                    if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
                    {
                        $replace[$matches[$i]] = $char;
                    }
                }
    
                $str = str_ireplace(array_keys($replace), array_values($replace), $str);
            }
    
            // Decode numeric & UTF16 two byte entities
            $str = html_entity_decode(
                preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
                $flag,
                $charset
            );
        }
        while ($str_compare !== $str);
        return $str;
    }
    
    0 讨论(0)
  • 2020-12-17 02:52

    Try this function:

    function xmlsafe($s,$intoQuotes=1) {
    if ($intoQuotes)
         return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s);
    else
         return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), html_entity_decode($s));
    }
    

    example usage:

    echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';
    

    also: https://stackoverflow.com/a/9446666/2312709

    this code used in production seem that no problems happened with UTF-8

    0 讨论(0)
  • 2020-12-17 02:58

    Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:

    $doc = new DOMDocument;
    $doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
    $doc->encoding = 'UTF-8';
    $doc->save($outputFile);
    
    0 讨论(0)
提交回复
热议问题