Remove non-ascii characters from string

前端 未结 8 1209
遥遥无期
遥遥无期 2020-11-28 03:39

I\'m getting strange characters when pulling data from a website:

Â

How can I remove anything that isn\'t a non-extended ASCII character?

相关标签:
8条回答
  • 2020-11-28 04:13

    I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on unicode.

    $name="βγδεζηΘKgfgebhjrf!@#$%^&";    
    //this function will clear all non greek and english characters on greek-iso charset        
    function replace_characters($string)    
    {    
       $str_length=strlen($string);    
       for ($x=0;$x<$str_length;$x++)    
          {    
              $character=$string[$x];    
              if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254)    
                 {    
                     $new_string=$new_string.$character;     
                 }    
          }    
          return $new_string;    
    }    
    //end function    
    
    $name=replace_characters($name);    
    
    echo $name;    
    
    0 讨论(0)
  • 2020-11-28 04:14

    $clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

    0 讨论(0)
  • 2020-11-28 04:16

    You want only ASCII printable characters?

    use this:

    <?php
    header('Content-Type: text/html; charset=UTF-8');
    $str = "abqwrešđčžsff";
    $res = preg_replace('/[^\x20-\x7E]/','', $str);
    echo "($str)($res)";
    

    Or even better, convert your input to utf8 and use phputf8 lib to translate 'not normal' characters into their ascii representation:

    require_once('libs/utf8/utf8.php');
    require_once('libs/utf8/utils/bad.php');
    require_once('libs/utf8/utils/validation.php');
    require_once('libs/utf8_to_ascii/utf8_to_ascii.php');
    
    if(!utf8_is_valid($str))
    {
      $str=utf8_bad_strip($str);
    }
    
    $str = utf8_to_ascii($str, '' );
    
    0 讨论(0)
  • 2020-11-28 04:16

    I just had to add the header

    header('Content-Type: text/html; charset=UTF-8');
    
    0 讨论(0)
  • 2020-11-28 04:19

    Kind of related, we had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

    Solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

    Normally I would do something like this:

    <?php
    // transliterate
    if (function_exists('iconv')) {
        $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
        }
    ?>
    

    ... but that replaces everything that can't be translated into a question mark (?).

    So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

    <?php
    public function cleanNonAsciiCharactersInString($orig_text) {
    
        $text = $orig_text;
    
        // Single letters
        $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
        $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
        $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
        $text = preg_replace("/[βвВ]/u",            "B", $text);
        $text = preg_replace("/[çς©с]/u",            "c", $text);
        $text = preg_replace("/[ÇС]/u",              "C", $text);        
        $text = preg_replace("/[δ]/u",             "d", $text);
        $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
        $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
        $text = preg_replace("/[₣]/u",               "F", $text);
        $text = preg_replace("/[НнЊњ]/u",           "H", $text);
        $text = preg_replace("/[ђћЋ]/u",            "h", $text);
        $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
        $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
        $text = preg_replace("/[Јј]/u",             "j", $text);
        $text = preg_replace("/[ΚЌК]/u",            'K', $text);
        $text = preg_replace("/[ќк]/u",             'k', $text);
        $text = preg_replace("/[ℓ∟]/u",             'l', $text);
        $text = preg_replace("/[Мм]/u",             "M", $text);
        $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
        $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
        $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
        $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
        $text = preg_replace("/[ρφрРф]/u",          "p", $text);
        $text = preg_replace("/[®яЯ]/u",              "R", $text); 
        $text = preg_replace("/[ГЃгѓ]/u",              "r", $text); 
        $text = preg_replace("/[Ѕ]/u",              "S", $text);
        $text = preg_replace("/[ѕ]/u",              "s", $text);
        $text = preg_replace("/[Тт]/u",              "T", $text);
        $text = preg_replace("/[τ†‡]/u",              "t", $text);
        $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
        $text = preg_replace("/[√]/u",               "v", $text);
        $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
        $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
        $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
        $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
        $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
        $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
        $text = preg_replace("/[ζ]/u",              "Z", $text);
    
        // Punctuation
        $text = preg_replace("/[‚‚]/u", ",", $text);        
        $text = preg_replace("/[`‛′’‘]/u", "'", $text);
        $text = preg_replace("/[″“”«»„]/u", '"', $text);
        $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
        $text = preg_replace("/[  ]/u", ' ', $text);
    
        $text = str_replace("…", "...", $text);
        $text = str_replace("≠", "!=", $text);
        $text = str_replace("≤", "<=", $text);
        $text = str_replace("≥", ">=", $text);
        $text = preg_replace("/[‗≈≡]/u", "=", $text);
    
    
        // Exciting combinations    
        $text = str_replace("ыЫ", "bl", $text);
        $text = str_replace("℅", "c/o", $text);
        $text = str_replace("₧", "Pts", $text);
        $text = str_replace("™", "tm", $text);
        $text = str_replace("№", "No", $text);        
        $text = str_replace("Ч", "4", $text);                
        $text = str_replace("‰", "%", $text);
        $text = preg_replace("/[∙•]/u", "*", $text);
        $text = str_replace("‹", "<", $text);
        $text = str_replace("›", ">", $text);
        $text = str_replace("‼", "!!", $text);
        $text = str_replace("⁄", "/", $text);
        $text = str_replace("∕", "/", $text);
        $text = str_replace("⅞", "7/8", $text);
        $text = str_replace("⅝", "5/8", $text);
        $text = str_replace("⅜", "3/8", $text);
        $text = str_replace("⅛", "1/8", $text);        
        $text = preg_replace("/[‰]/u", "%", $text);
        $text = preg_replace("/[Љљ]/u", "Ab", $text);
        $text = preg_replace("/[Юю]/u", "IO", $text);
        $text = preg_replace("/[fifl]/u", "fi", $text);
        $text = preg_replace("/[зЗ]/u", "3", $text); 
        $text = str_replace("£", "(pounds)", $text);
        $text = str_replace("₤", "(lira)", $text);
        $text = preg_replace("/[‰]/u", "%", $text);
        $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
        $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);
    
    
        //2) Translation CP1252.
        $trans = get_html_translation_table(HTML_ENTITIES);
        $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
        $trans['-'] = array(
            '&hellip;',     // Horizontal Ellipsis
            '&tilde;',      // Small Tilde
            '&ndash;'       // Dash
            );
        $trans["+"] = '&dagger;';    // Dagger
        $trans['#'] = '&Dagger;';    // Double Dagger         
        $trans['M'] = '&permil;';    // Per Mille Sign
        $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron        
        $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
        $trans["'"] = array(
            '&lsquo;',  // Left Single Quotation Mark
            '&rsquo;',  // Right Single Quotation Mark
            '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
            '&sbquo;',  // Single Low-9 Quotation Mark
            '&circ;',   // Modifier Letter Circumflex Accent
            '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
            );
    
        $trans['"'] = array(
            '&ldquo;',  // Left Double Quotation Mark
            '&rdquo;',  // Right Double Quotation Mark
            '&bdquo;',  // Double Low-9 Quotation Mark
            );
    
        $trans['*'] = '&bull;';    // Bullet
        $trans['n'] = '&ndash;';    // En Dash
        $trans['m'] = '&mdash;';    // Em Dash        
        $trans['tm'] = '&trade;';    // Trade Mark Sign
        $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
        $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
        $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
        $trans['euro'] = '&euro;';    // euro currency symbol
        ksort($trans);
    
        foreach ($trans as $k => $v) {
            $text = str_replace($v, $k, $text);
        }
    
        // 3) remove <p>, <br/> ...
        $text = strip_tags($text);
    
        // 4) &amp; => & &quot; => '
        $text = html_entity_decode($text);
    
    
        // transliterate
        // if (function_exists('iconv')) {
        // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
        // }
    
        // remove non ascii characters
        // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);      
    
        return $text;
    }
    
    ?>
    
    0 讨论(0)
  • 2020-11-28 04:21

    I also think that the best solution might be to use a regular expression.

    Here's my suggestion:

    function convert_to_normal_text($text) {
    
        $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
        $normal_text = preg_replace("/[^$normal_characters]/", '', $text);
    
        return $normal_text;
    }
    

    Then you can use it like this:

    $before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
    $after = convert_to_normal_text($before);
    echo $after;
    

    Displays:

    Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .
    
    0 讨论(0)
提交回复
热议问题