How can I convert HTML character references (ף) to regular UTF-8?

后端 未结 2 1775
野趣味
野趣味 2021-01-03 02:02

I have some hebrew websites that contains character references like: נוף

I can only view these letters if I save the file

相关标签:
2条回答
  • 2021-01-03 02:50

    Those are XML Character References. You want to decode them using html_entity_decode():

    $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    

    For more information, you can search Google for the entity in question. See these few examples:

    1. Hebrew Characters
    2. HTML Entities for Hebrew Characters
    3. UTF-8 Encoding Table with HTML entities
    0 讨论(0)
  • 2021-01-03 02:51

    Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (&#n;) or hexadecimal (&#xn;) notation.

    You can use html_entity_decode that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like <, >, & will also get decoded:

    $str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
    

    If you just want to decode the numeric character references, you can use this:

    function html_dereference($match) {
        if (strtolower($match[1][0]) === 'x') {
            $codepoint = intval(substr($match[1], 1), 16);
        } else {
            $codepoint = intval($match[1], 10);
        }
        return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');
    }
    $str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);
    

    As YuriKolovsky and thirtydot have pointed out in another question, it seems that browser vendors did ‘silently’ agreed on something regarding character references mapping, that does differ from the specification and is quite undocumented.

    There seem to be some character references that would normally be mapped onto the Latin 1 supplement but that are actually mapped onto different characters. This is due the mapping that would rather result from mapping the characters from Windows-1252 instead of ISO 8859-1, on which the Unicode character set is build on. Jukka Korpela wrote an extensive article on this topic.

    Now here’s an extension to the function mentioned above that handles this quirk:

    function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {
        $deref = function($match) use ($encoding, $fixMappingBug) {
            if (strtolower($match[1][0]) === "x") {
                $codepoint = intval(substr($match[1], 1), 16);
            } else {
                $codepoint = intval($match[1], 10);
            }
            // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
            if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
                $mapping = array(
                    8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
                    338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
                    8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
                $codepoint = $mapping[$codepoint-130];
            }
            return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
        };
        return preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', $deref, $string);
    }
    

    If anonymous functions are not available (introduced with 5.3.0), you could also use create_function:

    $deref = create_function('$match', '
        $encoding = '.var_export($encoding, true).';
        $fixMappingBug = '.var_export($fixMappingBug, true).';
        if (strtolower($match[1][0]) === "x") {
            $codepoint = intval(substr($match[1], 1), 16);
        } else {
            $codepoint = intval($match[1], 10);
        }
        // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
        if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
            $mapping = array(
                8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
                338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
                8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
            $codepoint = $mapping[$codepoint-130];
        }
        return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
    ');
    

    Here’s another function that tries to comply to the behavior of HTML 5:

    function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {
        $deref = function($match) use ($flags, $charset) {
            if ($match[1][0] === '#') {
                if (strtolower($match[1][0]) === '#') {
                    $codepoint = intval(substr($match[1], 2), 16);
                } else {
                    $codepoint = intval(substr($match[1], 1), 10);
                }
    
                // HTML 5 specific behavior
                // @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references
    
                // handle Windows-1252 mismapping
                // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
                // @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
                $overrides = array(
                    0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,
                    0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,
                    0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,
                    0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,
                    0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,
                    0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);
                if (isset($windows1252Mapping[$codepoint])) {
                    $codepoint = $windows1252Mapping[$codepoint];
                }
    
                if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {
                    $codepoint = 0xFFFD;
                }
                if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||
                    ($codepoint >= 0x000E && $codepoint <= 0x001F) ||
                    ($codepoint >= 0x007F && $codepoint <= 0x009F) ||
                    ($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||
                    in_array($codepoint, array(
                        0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
                        0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
                        0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
                        0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                        0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {
                    $codepoint = 0xFFFD;
                }
                return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");
            } else {
                return html_entity_decode($match[0], $flags, $charset);
            }   
        };
        return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);
    }
    

    I’ve also noticed that in PHP 5.4.0 the html_entity_decode function was added another flag named ENT_HTML5 for HTML 5 behavior.

    0 讨论(0)
提交回复
热议问题