R H2O package import csv file with Chinese characters

大憨熊 提交于 2019-12-12 23:09:02

问题


I have a large dataset in csv format to build a prediction model. Because of its size, I planned to use h2o package in R to build the model. However, the data, in multiple columns of the data.frame, contains some Chinese Simplified characters and h2o is having difficulty receiving the data.

I've tried two different approaches. The first approach involved directly reading from the file using the h2o.importFile() function to import the data. However, this approach ends up converting the Chinese characters into some messy codes.

The second approach I've tried to first bring the data into R using readr and base R read_csv/read.csv functions. After the data is loaded correctly into R, I tried to convert the data.frame into h2o frame using as.h2o function. Though, the end result of this approach also resulted in a messed up translation.

To illustrate, I've written the following piece of codes as an example:

require(h2o)
dat<-data.frame(x=rep(c("北京","上海"),50),
                y=rnorm(mean=10,sd=3,n=100))
h2o.init(nthreads=-1)
h2o.dat<-as.h2o(dat)

回答1:


I dont know if it is the best way but I have worked on Korean data before and this the process I generally follow. First, ensure that the data you need to read is encoded as "UTF-8". Second, ensure that the locale is set to English

Sys.getlocale(category="LC_ALL")

You can then read the file using the below statement,

dat <- read.csv("Test.txt",header=T,encoding = "UTF-8",stringsAsFactors = F)

dat[,1]
[1] "北京" "上海" "北京" "上海"

dat
        X.U.FEFF.X Y
1 <U+5317><U+4EAC> 1
2 <U+4E0A><U+6D77> 2
3 <U+5317><U+4EAC> 3
4 <U+4E0A><U+6D77> 4

As you can see, when you view the entire data.frame you see them as "UTF-8" encodes but you can also look at the chinese characters by looking using df[1,] and looking at each vector.




回答2:


Your problem is only related with R not showing encoded character inside H2O frames however the data inside h2o frames is still totally preserved as in original frame. Once you use H2O Web/FLOW UI and see the h2o frame you will see data inside h2o frame is exactly same as original frame. The following image shows results at various location i.e RStudio, R view window and in H2O FLOW UI

Please following the link below for a solution however you must be able to update locals in your machine to view those characters in the H2O data frames:

how to read data in utf-8 format in R?




回答3:


I would consider this a bug since R's data.frame can display the characters, but at the same time, the R H2OFrame cannot. I checked that this works for H2OFrames in Python, so it's an R issue only. I filed a bug here.



来源:https://stackoverflow.com/questions/41627290/r-h2o-package-import-csv-file-with-chinese-characters

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!