发表新帖

发表新帖

Emoji in R [UTF-8 encoding]

前端未结

关注

 2  1225

不要未来只要你来 2021-01-21 09:26

I\'m trying to make an emoji analysis on R. I have stored some tweets where there are emojis.

Here is one of the tweet that I want

2条回答

孤街浪徒 (楼主)

2021-01-21 10:12
The string is invalid UTF-8, as indicated. What you have there is UTF-16 encoded with UTF-8. So \xED\xA0\xBD is the high surrogate U+D83D, -- and \xED\xB2\x83 is the low surrogate U+DC83

If you apply the magical High,Low -> Codepoint formula, you'll end up with the actual codepoint:
```
(0xD83D - 0xD800) * 0x400 + 0xDC83 - 0xDC00 + 0x10000 = 0x1F483
```
You'll see this is the dancer emoji. Unfortunately I don't have a suggestion for you, as I'm not that familiar with R. But I can say you'd certainly want to get yourself in a position where this data is double encoded! Hope that helps bump you along the correct direction.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题