Convert unicode codepoint to string character in Ruby

前端 未结 2 1755
难免孤独
难免孤独 2020-12-01 09:56

I have these values from a unicode database but I\'m not sure how to translate them into the human readable form. What are these even called?

Here they are:

相关标签:
2条回答
  • 2020-12-01 10:00

    How about:

    # Using pack
    puts ["2B71F".hex].pack("U")
    
    # Using chr
    puts (0x2B71F).chr(Encoding::UTF_8)
    

    In Ruby 1.9+ you can also do:

    puts "\u{2B71F}"
    

    I.e. the \u{} escape sequence can be used to decode Unicode codepoints.

    0 讨论(0)
  • 2020-12-01 10:18

    The unicode symbols like U+2B71F are referred to as a codepoint.

    The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.

    For example, U+221E is infinity.

    The codepoints are hexadecimal numbers. There is always exactly one number defined per character.

    There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.

    Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.

    codepoint = "U+2B71F"
    

    You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.

    codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
    

    And you're UTF-8 character will be:

    utf_8_character = [$1.hex].pack("U")
    

    References:

    1. Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
    2. Tim Bray on the goodness of unicode.
    3. Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
    4. Dissecting the Unicode regular expression
    0 讨论(0)
提交回复
热议问题