R RMySQL query deforms japanese characters

一个人想着一个人 提交于 2020-01-13 19:13:36

问题


I am using RMySQL to connect to an aws MySQL server. It works, except character values are deformed. This question has been asked before but the fixes don't seem to work for me. Here's what I'm doing:

Make sure no connections are open:

dbListConnections(MySQL()) list()

Make sure my connection is set to use UTF-8:

dbGetQuery(credentials, "show variables like 'character_set%'")

             Variable_name                                     Value
1     character_set_client                                      utf8
2 character_set_connection                                      utf8
3   character_set_database                                      utf8
4 character_set_filesystem                                      utf8
5    character_set_results                                      utf8
6     character_set_server                                      utf8
7     character_set_system                                      utf8
8       character_sets_dir /rdsdbbin/mysql-5.5.40.R1/share/charsets/

Fetch data:

data <- dbGetQuery(credentials, Query) head(data) keyword_ja 1 \036 2 \036蜀ャ 3 \036螟\x8f 4 \036譌・譛ャ莠コ 5 \037繧、繝ゥ繧ケ繝\x88 6 \037蜿守ゥォ

When I write this data to disk Excel shows the same deformed characters, but notepad++ can somehow show the japanese as it's intended:

"keyword_ja"

"冬" "夏" "日本人" "イラスト" "収穫"

I've been trying to use functions like Encoding() and enc2utf8() in R to get it to display the characters correctly as notepad++ does, with no success.

Encoding(head(data$keyword_ja))

[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

enc2utf8(head(data$keyword_ja))

[1] "\036" "\036蜀ャ" "\036螟<8f>" "\036譌・譛ャ莠コ" "\037繧、繝ゥ繧ケ繝<88>" "\037蜿守ゥォ"

I can normally type japanese characters and R has no problem displaying them

Sys.getlocale() [1] "LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932" mystring <- "日本語入力できる" mystring [1] "日本語入力できる" Encoding(mystring) [1] "unknown"

I'm pretty desperate to figure this out so any help is very much appreciated. Please let me know if I can provide additional information.


回答1:


Based on this SO article, you might have to write your data to disk with UTF-8 encoding. Try this:

data <- dbGetQuery(credentials, Query)
con <- file('output.csv', encoding="utf8")
write.csv(data, file=con)

Then try opening output.csv in both Excel and Notepad++ and let us know the results. When you read this file back into R, it should hopefully behave as expected:

fread("test.csv", encoding="UTF-8")



回答2:


The only (awful) workaround I've found so far is to write the data to disk with write.csv, and then read it back in with fread from the data.table package. This gives results like this:

fread("test.csv", encoding="UTF-8") keyword_ja \036 \036冬 \036夏 \036日本人 \037イラスト

which is close to what I want, it has the correct japanese words but there are some hidden characters like \036 before them



来源:https://stackoverflow.com/questions/35788226/r-rmysql-query-deforms-japanese-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!