How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

前端 未结 15 1952
旧时难觅i
旧时难觅i 2020-12-05 10:29

I\'m looking for pseudocode, or sample code, to convert higher bit ascii characters (like, Ü which is extended ascii 154) into U (which is ascii 85).

My initial gues

相关标签:
15条回答
  • 2020-12-05 10:54

    I use this function to fix a variable with accents to pass to a soap function from VB6:

    Function FixAccents(ByVal Valor As String) As String
    
        Dim x As Long
        Valor = Replace(Valor, Chr$(38), "&#" & 38 & ";")
    
        For x = 127 To 255
            Valor = Replace(Valor, Chr$(x), "&#" & x & ";")
        Next
    
        FixAccents = Valor
    
    End Function
    

    And inside the soap function I do this (for the variable Filename):

    FileName = HttpContext.Current.Server.HtmlDecode(FileName)
    
    0 讨论(0)
  • 2020-12-05 10:56

    The upper 128 characters do not have standard meanings. They can take different interpretations (code pages) depending on the user's language.

    For example, see Portuguese versus French Canadian

    Unless you know the code page, your "translation" will be wrong sometimes.

    If you are going to assume a certain code page (e.g. the original IBM code page) then a translation array will work, but for true international users, it will be wrong a lot.

    This is one reason why unicode is favored over the older system of code pages.

    Strictly speaking, ASCII is only 7 bits.

    0 讨论(0)
  • 2020-12-05 10:57

    In code page 1251, chars are coded with 2 bytes : one for the basic char and one for the variation. Then, when you encode back in ASCII, only basic chars are kept.

    public string RemoveDiacritics(string text)
    {
    
      return System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text));
    
    }
    

    From : http://www.clt-services.com/blog/post/Enlever-les-accents-dans-une-chaine-(proprement).aspx

    0 讨论(0)
  • 2020-12-05 11:03

    You seem to have nailed it I think. A 128 byte long array of bytes, indexed by char&127, containing the matching 7-bit character for the 8-bit bit character.

    0 讨论(0)
  • 2020-12-05 11:06

    Indeed as proposed by unexist : "iconv" function exists to handle all weird conversion for you, is available in almost all programming language and has a special option which tries to convert characters missing in the target set with approximations.

    Use iconv to simply convert your input UTF-8 string to 7bit ASCII.

    Otherwise, you'll always end hitting corner case : a 8bit input using a different codepage with a different set of characters (thus not working at all with your conversion table), forgot to map one last stupid accented caracter (you mapped all grave/acute accent, but forgot to map Czech caron or the nordic '°'), etc.

    Of course if you want to apply the solution to a small specific problem (making file-system friendly filenames for your music collection) the the look-up arrays are the way to go (either an array which for each code number above 128 maps an approximation under 128 as proposed by JeeBee, or the source/target pairs proposed by vIceBerg depending on which substitution functions are already available in your language of choice), because it's quickly hacked together and quickly check for missing elements.

    0 讨论(0)
  • 2020-12-05 11:07

    I think you already nailed it on the head. Given your limited domain, a conversion array or hash is your best bet. No sense creating anything complex to try to automagically do it.

    0 讨论(0)
提交回复
热议问题