I\'m trying to make an emoji analysis on R. I have stored some tweets where there are emojis.
Here is one of the tweet that I want
I use iconv(tweet, 'UTF-8', 'latin1', 'byte')
to preserve characters with tilde:
> tweetn2
[1] "Prógrämmè dü week-eñd: \xed��\xed�\u0083\xed��\xed��\xed��\xed��\xed��\xed��\xed��\xed�� "
> iconv(tweetn2, 'UTF-8', 'latin1', 'byte')
[1] "Prógrämmè dü week-eñd: <ed><a0><bd><ed><b2><83><ed><a0><bc><ed><bd><bb><ed><a0><bc><ed><bd><bb><ed><a0><bc><ed><bd><bb><ed><a0><bc><ed><bd><bb> "
As for the emoji decoding I would suggest using a function implementing nj_'s answer. Or directly using an emoji dictionary like the one I proposed.
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
The string is invalid UTF-8, as indicated. What you have there is UTF-16 encoded with UTF-8. So \xED\xA0\xBD
is the high surrogate U+D83D, -- and \xED\xB2\x83
is the low surrogate U+DC83
If you apply the magical High,Low -> Codepoint formula, you'll end up with the actual codepoint:
(0xD83D - 0xD800) * 0x400 + 0xDC83 - 0xDC00 + 0x10000 = 0x1F483
You'll see this is the dancer emoji. Unfortunately I don't have a suggestion for you, as I'm not that familiar with R. But I can say you'd certainly want to get yourself in a position where this data is double encoded! Hope that helps bump you along the correct direction.