Can R read html-encoded emoji characters?

后端未结

关注

 4  1253

旧巷少年郎 2021-01-13 00:58

Question

My question, explained below, is:

How can R be used to read a string that includes HTML emoji codes like ��

4条回答囚心锁ツ (楼主) 2021-01-13 01:22 I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse). This is a quick and unpolished implementation of rensa's algorithm, but it works! utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){ string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2] string3a <- string_elements[1] string3b <- string_elements[2] string4a <- sprintf("0x0%x", as.numeric(string3a)) string4b <- sprintf("0x0%x", as.numeric(string3b)) string5a <- paste0( # "0x", as.hexmode(string4a) - 0xd800 ) string5b <- paste0( # "0x", as.hexmode(string4b) - 0xdc00 ) string6 <- paste0( stringi::stri_pad( paste0(BMS::hex2bin(string5a), collapse = ""), 10, pad = "0" ) %>% stringr::str_trunc(10, side = "left", ellipsis = ""), stringi::stri_pad( paste0(BMS::hex2bin(string5b), collapse = ""), 10, pad = "0" ) %>% stringr::str_trunc(10, side = "left", ellipsis = "") ) string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]])) string8 <- as.hexmode(string7) + 0x10000 unicode_pattern <- string8 unicode_pattern } make_unicode_entity <- function(x) { paste0("\\U000", utf16_double_dec_code_to_utf8(x)) } make_html_entity <- function(x) { paste0("&#x", utf16_double_dec_code_to_utf8(x), ";") } # An example string, using the "hug" emoji: example_string <- "test �� test" output_string <- stringr::str_replace_all( example_string, "(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes. make_unicode_entity # make_html_entity ) cat(output_string) # To print Unicode string (doesn't display in R console, but can be copied and # pasted elsewhere: # (This assumes you've used 'make_unicode_entity' above in the str_replace_all # call): stringi::stri_unescape_unicode(output_string) 0 讨论(0) 查看其它4个回答发布评论: 提交评论加载中... 验证码看不清? 提交回复