How do I read/write libsvm data into/from R
?
The libsvm
format is sparse data like
[
The question was asked a long time ago and has several answer. Most answers didn't work for me since my data comes in a long format, and I cant one-hot encode it in R. So here is my take. I wrote a function to one-hot encode the data, and save it without having to first transform the matrix into a sparse one.
RCPP code:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
#include <iostream>
#include <fstream>
#include <string>
using namespace Rcpp;
// Reading data frame from R and saving it as an libFM file
// [[Rcpp::export]]
std::string createNumber(int x, double y) {
std::string s1 = std::to_string(x);
std::string s2 = std::to_string(y);
std::string X_elem = s1 + ":" + s2;
return X_elem;
}
// [[Rcpp::export]]
std::string createRowLibFM(arma::rowvec row_to_fm, arma::vec factor_levels, arma::vec position) {
int n = factor_levels.n_elem;
std::string total = std::to_string(row_to_fm[0]);
for (int i = 1; i < n; i++) {
if (factor_levels[i] > 1) {
total = total + " " + createNumber(position[i - 1] + row_to_fm[i], 1);
}
if (factor_levels[i] == 1) {
total = total + " " + createNumber(position[i], row_to_fm[i]);
}
}
return total;
}
// [[Rcpp::export]]
void writeFile(std::string file, arma::mat all_data, arma::vec factor_levels) {
int n = all_data.n_rows;
arma::vec position = arma::cumsum(factor_levels);
std::ofstream temp_file;
temp_file.open (file.c_str());
for (int i = 0; i < n; i++) {
std::string temp_row = createRowLibFM(all_data.row(i), factor_levels, position);
temp_file << temp_row + "\n";
}
temp_file.close();
}
R function acting as wrapper for it:
writeFileFM <- function(temp.data, path = 'test.txt') {
### Dealing with y function
if (!(any(colnames(temp.data) %in% 'y'))) {
stop('No y column is given')
} else {
temp.data <- temp.data %>% select(y, everything()) ## y is required to be first column for writeFile
}
### Dealing with factors/strings
temp.classes <- sapply(temp.data, class)
class.num <- rep(0, length(temp.classes))
map.list <- list()
for (i in 2:length(temp.classes)) { ### since y is always the first column
if (any(temp.classes[i] %in% c('factor', 'character'))) {
temp.col <- as.factor(temp.data[ ,i]) ### incase it is character
temp.unique <- levels(temp.col)
factors.new <- seq(0, length(temp.unique) - 1, 1)
levels(temp.col) <- factors.new
temp.data[ ,i] <- temp.col
### Saving changes
class.num[i] <- length(temp.unique)
map.list[[i - 1]] <- data.frame('original.value' = temp.unique,
'transform.value' = factors.new)
} else {
class.num[i] <- 1 ### Numeric values require only 1 column
}
}
### Writing file
print('Writing file to disc')
writeFile(all_data = sapply(temp.data, as.numeric), file = path, factor_levels = class.num)
return(map.list)
}
Comparing it on fake data.
### Creating data to save
set.seed(999)
n <- 10000
factor.lvl1 <- 3
factor.lvl2 <- 2
temp.data <- data.frame('x1' = sample(stri_rand_strings(factor.lvl1, 7), n, replace = TRUE),
'x2' = sample(stri_rand_strings(factor.lvl2, 4), n, replace = TRUE),
'x3' = rnorm(n),
'x4' = rnorm(n),
'y' = rnorm(n))
### Comparing to other method
library(data.table)
library(e1071)
microbenchmark::microbenchmark(
temp.data.table <- model.matrix( ~ 0 + x1 + x2 + x3 + x4, data = temp.data,
contrasts = list(x2 = contrasts(temp.data$x2, contrasts = FALSE))),
write.matrix.csr(temp.data.table, 'out.txt'),
writeFileFM(temp.data))
Results.
min lq mean median uq
1.3061 1.6725 1.890942 1.92475 2.07725
629.9863 653.4345 676.108548 672.52510 687.88330
270.8217 275.1353 283.537898 281.42100 289.39160
max neval cld
3.2328 100 a
793.7040 100 c
328.0863 100 b
It is faster than the e1071 option, and while that option fails when increasing the number of observations, the method suggested is still applicable.