How to convert letters with accents, umlauts, etc to their ASCII counterparts in Perl?

后端 未结 4 956
清歌不尽
清歌不尽 2020-12-01 17:07

I\'m writing a program that works with documents in Perl and a lot of the documents have characters such as ä, ö, ü, é, etc (both capital and lowercase). I\'d l

相关标签:
4条回答
  • 2020-12-01 17:21

    As usual, if you think of a problem which most certainly is not yours only, there's already a solution on CPAN. ) In this case it's called Text::Unidecode

    use warnings;
    use strict;
    use utf8;
    use Text::Unidecode;
    print unidecode('ä, ö, ü, é'); # will print 'a, o, u, e'
    
    0 讨论(0)
  • 2020-12-01 17:23

    Text::Unidecode

    See the many disclaimers, but it's probably just what you need if you just have Latin text with diacritics.

    0 讨论(0)
  • 2020-12-01 17:34

    i did this subroutine and i feed each word through it. This could be slow.

    sub store_utf82_encoding{
    ##see file UTF8vowels.txt
    #converts  UTF8 Euro vowels to nearest English equivant  
    
      my $name=$_[0];
      $name =~m/\x00c0/A/g; #Agrav
      $name =~m/\x00c1/A/g; # Aacute
      $name =~m/\x00c2/A/g; # Acap
      $name =~m/\x00c3/A/g; # Atilde
      $name =~m/\x00c4/A/g; # Auml
      $name =~m/\x00c5/A/g; # Aring
      $name =~m/\x00c6/AE/g; # AE
      $name =~m/\x00c7/Ch/g; # Ccedilla
      $name =~m/\x00c8/E/g; #Egrav
      $name =~m/\x00c9/E/g; # Eacute
      $name =~m/\x00ca/E/g; # Ecap
      $name =~m/\x00cb/E/g; # Euml
      $name =~m/\x00cc/I/g; # Igrav
      $name =~m/\x00cd/I/g; # Iacut
      $name =~m/\x00ce/I/g; # Icap
      $name =~m/\x00cf/I/g; # Iuml
      $name =~m/\x00d0/Th/g; #CapEth
      $name =~m/\x00d1/NY/g; # Ntild
      $name =~m/\x00d2/O/g; # Ograv
      $name =~m/\x00d3/O/g; # Oacute
      $name =~m/\x00d4/O/g; # Ocap
      $name =~m/\x00d5/Th/g; # Otilde
      $name =~m/\x00d6/O/g; # Ouml
      $name =~m/\x00d8/O/g; # Ostroke 
      $name =~m/\x00d9/U/g; # Ugrav
      $name =~m/\x00da/U/g; # Uacute
      $name =~m/\x00db/U/g; # Ucap
      $name =~m/\x00dc/U/g; # Uuml
      $name =~m/\x00dd/Y/g; # Yacute
      $name =~m/\x00de/Th/g; # CapThorn
      $name =~m/\x00df/SS/g; # GermanUCss Ezette
      $name =~m/\x00e0/a/g; # agrav
      $name =~m/\x00e1/a/g; # aacute 
      $name =~m/\x00e2/a/g; # acap
      $name =~m/\x00e3/a/g; # atilde
      $name =~m/\x00e4/a/g; # auml
      $name =~m/\x00e5/a/g; # aring
      $name =~m/\x00e6/ae/g; # ae
      $name =~m/\x00e7/ch/g; # ccedilla 
      $name =~m/\x00e8/e/g; # egrav
      $name =~m/\x00e9/e/g; # eacute
      $name =~m/\x00ea/e/g; # ecap
      $name =~m/\x00eb/e/g; # euml
      $name =~m/\x00ec/i/g; # igrav
      $name =~m/\x00ed/i/g; # iacute
      $name =~m/\x00ee/i/g; # icap
      $name =~m/\x00ef/i/g; # iuml
      $name =~m/\x00f0/th/g; # lowercase eth
      $name =~m/\x00f1/ny/g; # ntilde
      $name =~m/\x00f2/o/g; # ograv
      $name =~m/\x00f3/o/g; # oacute 
      $name =~m/\x00f4/o/g; # ocap
      $name =~m/\x00f5/th/g; # otilde
      $name =~m/\x00f6/o/g; # ouml
      $name =~m/\x00f8/o/g; # ostroke
      $name =~m/\x00f9/u/g; # ugrav
      $name =~m/\x00fa/u/g; # uacute
      $name =~m/\x00fb/u/g; # ucap
      $name =~m/\x00fc/u/g; # uuml
      $name =~m/\x00fe/th/g; # lowercase thorn
      $name =~m/\x00fd/y/g; # yacute
      $name =~m/\x00ff/y/g; # yuml
    
    return $name;
    
    } #endsub store_utf82_encoding
    
    0 讨论(0)
  • 2020-12-01 17:36

    use s/// (=Search&Replace) instead of m// (=Match)

    e.g. $name =~ s/\x00c0/A/g;

    0 讨论(0)
提交回复
热议问题