问题
I am working on a moderate data set (train_data). There are more 124 variables and 50,00,000 observations. For categorical variables, I have used feature hashing on it through hashed.model.matrix function in R.
## feature hashing
b <- 2 ^ 22
f <- ~ .-1
X_train <- hashed.model.matrix(f, train_data, hash.size=b)
So, as a result , I have got a large dgCmatrix (a sparse matrix) as output (X_train). How can I use, H2o wrapper on this matrix and use different algorithms available in H2o ? Does H2o wrapper take sparse matrix (dgCmatrix). Any link / example of such usage will be helpful. Thanks in anticipation.
Looking forward to import X_train in H2o environment to do dollowing type of steps
# initialize connection to H2O server
h2o.init(nthreads = -1)
train.hex <- h2o.uploadFile('./X_train', destination_frame='train')
# list of features for training
feature.names <- names(train.hex)
# train random forest model, use ntrees = 500
drf <- h2o.randomForest(x=feature.names, y='outcome', training_frame,train.hex, ntrees =500)
回答1:
you could save your sparse matrix to svmlight sparse format, then use
train.hex <- h2o.uploadFile('./X_train', parse_type = "SVMLight", destination_frame='train')
svmlight sparse format will also be detected by h2o.importFile()
, which is a parallelized reader and pulls information from the server from a location specified by the client.
train.hex <- h2o.importFile('./X_train', destination_frame='train')
来源:https://stackoverflow.com/questions/38870109/how-to-use-h2o-on-feature-hashed-matrix-in-r