Remove non-utf8 characters from string

后端 未结 18 1422
心在旅途
心在旅途 2020-11-22 11:56

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

相关标签:
18条回答
  • 2020-11-22 12:44

    From recent patch to Drupal's Feeds JSON parser module:

    //remove everything except valid letters (from any language)
    $raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);
    

    If you're concerned yes it retains spaces as valid characters.

    Did what I needed. It removes widespread nowadays emoji-characters that don't fit into MySQL's 'utf8' character set and that gave me errors like "SQLSTATE[HY000]: General error: 1366 Incorrect string value".

    For details see https://www.drupal.org/node/1824506#comment-6881382

    0 讨论(0)
  • 2020-11-22 12:48

    Slightly different to the question, but what I am doing is to use HtmlEncode(string),

    pseudo code here

    var encoded = HtmlEncode(string);
    encoded = Regex.Replace(encoded, "&#\d+?;", "");
    var result = HtmlDecode(encoded);
    

    input and output

    "Headlight\x007E Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢?"
    "Headlight~ Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢?"
    

    I know it's not perfect, but does the job for me.

    0 讨论(0)
  • 2020-11-22 12:50
    static $preg = <<<'END'
    %(
    [\x09\x0A\x0D\x20-\x7E]
    | [\xC2-\xDF][\x80-\xBF]
    | \xE0[\xA0-\xBF][\x80-\xBF]
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
    | \xED[\x80-\x9F][\x80-\xBF]
    | \xF0[\x90-\xBF][\x80-\xBF]{2}
    | [\xF1-\xF3][\x80-\xBF]{3}
    | \xF4[\x80-\x8F][\x80-\xBF]{2}
    )%xs
    END;
    if (preg_match_all($preg, $string, $match)) {
        $string = implode('', $match[0]);
    } else {
        $string = '';
    }
    

    it work on our service

    0 讨论(0)
  • 2020-11-22 12:56

    try this:

    $string = iconv("UTF-8","UTF-8//IGNORE",$string);
    

    According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.

    If you set both the input and output charset to UTF-8, and append the //IGNORE flag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.

    0 讨论(0)
  • 2020-11-22 12:56

    UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.

    function replace_invalid_byte_sequence($str)
    {
        return UConverter::transcode($str, 'UTF-8', 'UTF-8');
    }
    
    function replace_invalid_byte_sequence2($str)
    {
        return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
    }
    

    htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.

    function replace_invalid_byte_sequence3($str)
    {
        return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
    }
    
    0 讨论(0)
  • 2020-11-22 12:57

    Maybe not the most precise solution, but it gets the job done with a single line of code:

    echo str_replace("?","",(utf8_decode($str)));
    

    utf8_decode will convert the characters to a question mark;
    str_replace will strip out the question marks.

    0 讨论(0)
提交回复
热议问题