Can R read html-encoded emoji characters?

后端 未结 4 1253
旧巷少年郎
旧巷少年郎 2021-01-13 00:58

Question

My question, explained below, is:

How can R be used to read a string that includes HTML emoji codes like ��

4条回答
  •  囚心锁ツ
    2021-01-13 01:22

    I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).

    This is a quick and unpolished implementation of rensa's algorithm, but it works!

    utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
      string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
    
      string3a <- string_elements[1]
      string3b <- string_elements[2]
    
      string4a <- sprintf("0x0%x", as.numeric(string3a))
      string4b <- sprintf("0x0%x", as.numeric(string3b))
    
      string5a <- paste0(
        # "0x", 
        as.hexmode(string4a) - 0xd800
      )
      string5b <- paste0(
        # "0x",
        as.hexmode(string4b) - 0xdc00
      )
    
      string6 <- paste0(
        stringi::stri_pad(
          paste0(BMS::hex2bin(string5a), collapse = ""),
          10,
          pad = "0"
        ) %>%
          stringr::str_trunc(10, side = "left", ellipsis = ""),
        stringi::stri_pad(
          paste0(BMS::hex2bin(string5b), collapse = ""),
          10,
          pad = "0"
        ) %>%
          stringr::str_trunc(10, side = "left", ellipsis = "")
      )
    
      string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
    
      string8 <- as.hexmode(string7) + 0x10000
    
      unicode_pattern <- string8
      unicode_pattern
    }
    
    make_unicode_entity <- function(x) {
      paste0("\\U000", utf16_double_dec_code_to_utf8(x))
    }
    make_html_entity <- function(x) {
      paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
    }
    
    # An example string, using the "hug" emoji:
    example_string <- "test �� test"
    
    output_string <- stringr::str_replace_all(
      example_string,
      "(&#[0-9]*?;){2}",  # Find all two-character "&#...;&#...;" codes.
      make_unicode_entity
      # make_html_entity
    )
    
    cat(output_string)
    
    # To print Unicode string (doesn't display in R console, but can be copied and
    # pasted elsewhere:
    # (This assumes you've used 'make_unicode_entity' above in the str_replace_all
    # call):
    stringi::stri_unescape_unicode(output_string)
    

提交回复
热议问题