Emoji in R [UTF-8 encoding]

梦想的初衷 提交于 2019-12-31 02:42:08

问题


I'm trying to make an emoji analysis on R. I have stored some tweets where there are emojis.

Here is one of the tweet that I want to analyze :

> tweetn2
[1] "Programme du week-end: \xed\xa0\xbd\xed\xb2\x83\xed\xa0\xbc \xed\xbe\xb6\xed\xa0\xbc 
    \xed\xbd\xbb\xed\xa0\xbc\xed\xbd\xbb\xed\xa0\xbc \xed\xbd\xbb\xed\xa0\xbc\xed\xbd\xbb"

To be sure that I have "UTF-8":

> Encoding(tweetn2)
[1] "UTF-8

" Now when I'm trying to recognize some characters, it's not working fine

> grepl("\\xed",tweetn2)
[1] FALSE

or

> grepl("xed",tweetn2)
[1] FALSE

But it seems that emojis "\xed\xa0\xbd" are not "UTF-8" encoding because I get an error message when I write :

> str(tweetn2)
Error in str.default(tweetn2) : invalid multibyte string, element 1

I find a kind of solution by using iconv( ) function and "ASCII" encoding there :
http://www.r-bloggers.com/emoticons-decoder-for-social-media-sentiment-analysis-in-r/

But I want to keep using "UTF-8" for my analysis because it works well with french special letters (à, é, è, ê, ë, û, etc.. )

So do you have an idea how I can get above it?

Thanks


回答1:


The string is invalid UTF-8, as indicated. What you have there is UTF-16 encoded with UTF-8. So \xED\xA0\xBD is the high surrogate U+D83D, -- and \xED\xB2\x83 is the low surrogate U+DC83

If you apply the magical High,Low -> Codepoint formula, you'll end up with the actual codepoint:

(0xD83D - 0xD800) * 0x400 + 0xDC83 - 0xDC00 + 0x10000 = 0x1F483

You'll see this is the dancer emoji. Unfortunately I don't have a suggestion for you, as I'm not that familiar with R. But I can say you'd certainly want to get yourself in a position where this data is double encoded! Hope that helps bump you along the correct direction.




回答2:


I use iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve characters with tilde:

> tweetn2
[1] "Prógrämmè dü week-eñd: \xed��\xed�\u0083\xed��\xed��\xed��\xed��\xed��\xed��\xed��\xed�� "
> iconv(tweetn2, 'UTF-8', 'latin1', 'byte')
[1] "Prógrämmè dü week-eñd: <ed><a0><bd><ed><b2><83><ed><a0><bc><ed><bd><bb><ed><a0><bc><ed><bd><bb><ed><a0><bc><ed><bd><bb><ed><a0><bc><ed><bd><bb> "

As for the emoji decoding I would suggest using a function implementing nj_'s answer. Or directly using an emoji dictionary like the one I proposed.

unicode2hilo <- function(unicode){
   hi = floor((unicode - 0x10000)/0x400) + 0xd800
   lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
   hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
   return(hilo)
}

hilo2unicode <- function(hi,lo){
   unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
   unicode = paste('0x', as.hexmode(unicode), sep = '')
   return(unicode)
}


来源:https://stackoverflow.com/questions/35670238/emoji-in-r-utf-8-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!