How to remove multiple UTF-8 BOM sequences

后端 未结 11 1766
故里飘歌
故里飘歌 2020-11-22 10:28

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.

private function fetch($name) {
    $path = $this->         


        
相关标签:
11条回答
  • 2020-11-22 11:24

    A solution without pack function:

    $a = "1";
    var_dump($a); // string(4) "1"
    
    function deleteBom($text)
    {
        return preg_replace("/^\xEF\xBB\xBF/", '', $text);
    }
    
    var_dump(deleteBom($a)); // string(1) "1"
    
    0 讨论(0)
  • 2020-11-22 11:25

    If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).

    >>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
    => "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
    >>> json_decode($json);
    => null
    >>>
    

    In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:

    >>> substr($json, 0, 3)
    => "  "
    >>> substr($json, 0, 3) == pack('H*','EFBBBF');
    => true
    >>>
    

    If the line above returns TRUE for you, then a simple test may fix the problem:

    >>> json_decode($json[0] == "{" ? $json : substr($json, 3))
    => {#204
         +"orgao": [
           {#203
             +"Nome": "Tribunal de Justiça",
             +"ID_Orgao": "59",
             +"Condicao": "1",
           },
         ],
         ...
       }
    
    0 讨论(0)
  • 2020-11-22 11:28

    b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:

    "\xef\xbb\xbf"
    

    Your files also seem to contain a lot more garbage than just a single leading BOM:

    $ curl http://ircb.in/jisti/ | xxd
    
    0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef  ................
    0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068  .....<!DOCTYPE h
    0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561  tml>.<html>.<hea
    ...
    
    0 讨论(0)
  • 2020-11-22 11:28

    This might help. let me know if you care for me to expand my thought process.

    <?php
        //
        // labled TESTINGSTRIPZ.php
        //
    
        define('CHARSET', 'UTF-8');
    
        $stringy = "\xef\xbb\xbf\"quoted text\" ";
        $str_find_array    = array( "\xef\xbb\xbf");
        $str_replace_array = array(             '');
    
    
        $RESULT =
            trim(
                mb_convert_encoding(
    
                    str_replace(
                        $str_find_array,
                        $str_replace_array,
                        strip_tags( $stringy )
                        ),
    
                    'UTF-8',
    
                    mb_detect_encoding(
                        strip_tags($stringy)
                        )
    
                    )
                );
    
            print("YOUR RESULT IS: " . $RESULT.PHP_EOL);
    
    ?>
    

    Result:

    terminal$ php TESTINGSTRIPZ.php 
          YOUR RESULT IS: "quoted text" // < with no hidden char.
    
    0 讨论(0)
  • 2020-11-22 11:29

    This global funtion resolve for UTF-8 system base charset. Tanks!

    function prepareCharset($str) {
    
        // set default encode
        mb_internal_encoding('UTF-8');
    
        // pre filter
        if (empty($str)) {
            return $str;
        }
    
        // get charset
        $charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
    
        if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
            $str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
        } else {
            $str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
        }
    
        // remove BOM
        $str = urldecode(str_replace("%C2%81", '', urlencode($str)));
    
        // prepare string
        return $str;
    }
    
    0 讨论(0)
提交回复
热议问题