How to avoid Twitter emoticon character while processing string in Java

心不动则不痛 提交于 2019-12-13 13:27:33

问题


I'm working on processing Tweets from Twitter and storing them in a database (MySQL).

I have my process running perfectly but sometimes I get an error like this one:

2012-08-31 08:11:23,303 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - SQL Error: 1366, SQLState: HY000
2012-08-31 08:11:23,304 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - Incorrect string value: '\xF0\x9F\x98\x9D #...' for column 'twe_text' at row 1

When looking for the problematic tweet in my logs I find the following one:

 2012-08-31 08:11:22,971 INFO com.myapp.TweetLoaderJob  - Text for tweet 241175722096480256: RT @totallytoyosi_: My go
odies, my goodies, not your goodies  <U+1F61D> #m&ms #sweeties #goodies #food  @ The Ritzy Cinema Café, Brixton htt ...

And, finally, looking what the hell is , I discovered that it is an emoticon that Twitter sends as-is

I have debugged, looking only for this specific tweet and my eclipse seems to not recognize this encoding character. So the question is, how can I handle this exception? I looked for configuring my MySQL database, but I cannot change the encoding (it's a requirement), so my option is to avoid managing this kind of tweets or supress this complicated character.

But how to do it, if Java does not recognize it?


回答1:


You could filter your strings and remove the undesired part (with a simple regexp like <U+[^>]+>) before storing them in your database.



来源:https://stackoverflow.com/questions/12214163/how-to-avoid-twitter-emoticon-character-while-processing-string-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!