php iconv translit for removing accents: not working as excepted?

后端 未结 7 1235
终归单人心
终归单人心 2020-12-10 06:09

consider this simple code:

echo iconv(\'UTF-8\', \'ASCII//TRANSLIT\', \'è\');

it prints

 `e

instead of ju

相关标签:
7条回答
  • 2020-12-10 06:39

    It happen with me with pure iconv without php. The Trick was to set LANG environment value to en_US.UTF-8 (it was hu_HU.UTF-8 before, in my case). After it worked as expected.

    0 讨论(0)
  • 2020-12-10 06:45

    I'm tempted to say "nothing", although this is a little outside my expertise. PHP's iconv() is notorious, and the inspiration for many workarounds, including

    • dropping to the system's iconv utility (Unix & Linux)
    • crafting a lookup table
    • replacing all accented characters with an ASCII equivalent as kind of a preprocessing stage
    • setting LC_COLLATE (which doesn't seem to work for everyone)
    • use htmlentities() instead of iconv()

    Read the comments for iconv() documentation for more inspiration. (Or commiseration. Too close to call.)

    0 讨论(0)
  • 2020-12-10 06:47

    It seems the standard way to handle this is with a "removing accents" function which you can find in library's like flourish or CMS's like Wordpress. Iconv seems to be unable to translate accents (and rightly so) since this isn't a good idea for anything other than URL slugs.

    0 讨论(0)
  • 2020-12-10 06:54

    I have this standard function to return valid url strings without the invalid url characters. The magic seems to be in the line after the //remove unwanted characters comment.

    This is taken from the Symfony framework documentation: http://www.symfony-project.org/jobeet/1_4/Doctrine/en/08 which in turn is taken from http://php.vrana.cz/vytvoreni-pratelskeho-url.php but i don't speak Czech ;-)

    function slugify($text)
    {
      // replace non letter or digits by -
      $text = preg_replace('#[^\\pL\d]+#u', '-', $text);
    
      // trim
      $text = trim($text, '-');
    
      // transliterate
      if (function_exists('iconv'))
      {
        $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
      }
    
      // lowercase
      $text = strtolower($text);
    
      // remove unwanted characters
      $text = preg_replace('#[^-\w]+#', '', $text);
    
      if (empty($text))
      {
        return 'n-a';
      }
    
      return $text;
    }
    
    echo slugify('é'); // --> "e"
    
    0 讨论(0)
  • 2020-12-10 07:01

    When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.

    Look at http://uk3.php.net/manual/en/function.setlocale.php

    0 讨论(0)
  • 2020-12-10 07:02

    cf @tchrist, with INTL php extension

    http://fr2.php.net/manual/en/book.intl.php

    preg_replace('/\pM*/u','',normalizer_normalize( $mystring, Normalizer::FORM_D));
    

    eéèêëiîïoöôuùûüaâäÅ Ἥ ŐǟǠ ǺƶƈƉųŪŧȬƀ␢ĦŁȽŦ ƀǖ becomes

    eeeeeiiiooouuuuaaaA Η OaA AƶƈƉuUŧOƀ␢ĦŁȽŦ ƀu


    As tchrist emphasises, not all unicode characters are considered decomposable:

    extract from Unicode charts:

    U0080.pdf

    00CF Ï LATIN CAPITAL LETTER I WITH DIAERESIS

    ≡ 0049 I 0308 ¨

    NB this symbol « ≡ » indicate an available decomposition

    00D0 Ð LATIN CAPITAL LETTER ETH

    → 00F0 ð latin small letter eth

    → 0110 Đ latin capital letter d with stroke

    → 0189 Ɖ latin capital letter african d

    no decomposition available, IMHO strangely (we could consider ASCII letter D as an acceptable equivalent).

    U0100.pdf

    0110 Đ LATIN CAPITAL LETTER D WITH STROKE

    → 00D0 Ð latin capital letter eth

    → 0111 đ latin small letter d with stroke

    → 0189 Ɖ latin capital letter african d

    even stranger: this one is identified as LATIN CAPITAL LETTER D (with stroke), but not decomposable as such! Perhaps a cooler solution should be to get the unicode description of each char, and compare it with the description of each ascii char (and replace accordingly). Anyone? ;-]

    cf http://unicode.org/Public/UNIDATA/UnicodeData.txt

    0 讨论(0)
提交回复
热议问题