问题
I have a large dataset in csv
format to build a prediction model. Because of its size, I planned to use h2o
package in R to build the model. However, the data, in multiple columns of the data.frame
, contains some Chinese Simplified characters and h2o
is having difficulty receiving the data.
I've tried two different approaches. The first approach involved directly reading from the file using the h2o.importFile()
function to import the data. However, this approach ends up converting the Chinese characters into some messy codes.
The second approach I've tried to first bring the data into R using readr
and base R read_csv
/read.csv
functions. After the data is loaded correctly into R, I tried to convert the data.frame
into h2o
frame using as.h2o
function. Though, the end result of this approach also resulted in a messed up translation.
To illustrate, I've written the following piece of codes as an example:
require(h2o)
dat<-data.frame(x=rep(c("北京","上海"),50),
y=rnorm(mean=10,sd=3,n=100))
h2o.init(nthreads=-1)
h2o.dat<-as.h2o(dat)
回答1:
I dont know if it is the best way but I have worked on Korean data before and this the process I generally follow. First, ensure that the data you need to read is encoded as "UTF-8". Second, ensure that the locale is set to English
Sys.getlocale(category="LC_ALL")
You can then read the file using the below statement,
dat <- read.csv("Test.txt",header=T,encoding = "UTF-8",stringsAsFactors = F)
dat[,1]
[1] "北京" "上海" "北京" "上海"
dat
X.U.FEFF.X Y
1 <U+5317><U+4EAC> 1
2 <U+4E0A><U+6D77> 2
3 <U+5317><U+4EAC> 3
4 <U+4E0A><U+6D77> 4
As you can see, when you view the entire data.frame you see them as "UTF-8" encodes but you can also look at the chinese characters by looking using df[1,]
and looking at each vector.
回答2:
Your problem is only related with R not showing encoded character inside H2O frames however the data inside h2o frames is still totally preserved as in original frame. Once you use H2O Web/FLOW UI and see the h2o frame you will see data inside h2o frame is exactly same as original frame. The following image shows results at various location i.e RStudio, R view window and in H2O FLOW UI
Please following the link below for a solution however you must be able to update locals in your machine to view those characters in the H2O data frames:
how to read data in utf-8 format in R?
回答3:
I would consider this a bug since R's data.frame can display the characters, but at the same time, the R H2OFrame cannot. I checked that this works for H2OFrames in Python, so it's an R issue only. I filed a bug here.
来源:https://stackoverflow.com/questions/41627290/r-h2o-package-import-csv-file-with-chinese-characters