read/write data in libsvm format

前端未结

关注

 7  2028

How do I read/write libsvm data into/from R?

The libsvm format is sparse data like


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2020-11-30 12:02
              
            
            
                                                                       
The question was asked a long time ago and has several answer. Most answers didn't work for me since my data comes in a long format, and I cant one-hot encode it in R. So here is my take. I wrote a function to one-hot encode the data, and save it without having to first transform the matrix into a sparse one. 

RCPP code: 

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
#include <iostream>
#include <fstream>
#include <string>
using namespace Rcpp;

// Reading data frame from R and saving it as an libFM file

// [[Rcpp::export]] 
std::string createNumber(int x, double y) {
  std::string s1 = std::to_string(x); 
  std::string s2 = std::to_string(y); 
  std::string X_elem = s1 + ":" + s2; 
  return X_elem;
}

// [[Rcpp::export]]
std::string createRowLibFM(arma::rowvec row_to_fm, arma::vec factor_levels, arma::vec position) {
  int n = factor_levels.n_elem; 
  std::string total =  std::to_string(row_to_fm[0]); 
  for (int i = 1; i < n; i++) { 
    if (factor_levels[i] > 1) { 
      total = total + " " + createNumber(position[i - 1] + row_to_fm[i], 1);
    } 
    if (factor_levels[i] == 1) {
      total = total + " " + createNumber(position[i], row_to_fm[i]);
    }
  }
  return total; 
}

// [[Rcpp::export]]
void writeFile(std::string file, arma::mat all_data, arma::vec factor_levels) {
  int n = all_data.n_rows;
  arma::vec position = arma::cumsum(factor_levels);
  std::ofstream temp_file;
  temp_file.open (file.c_str());
  for (int i = 0; i < n; i++) {
    std::string temp_row = createRowLibFM(all_data.row(i), factor_levels, position);
    temp_file << temp_row + "\n";
  }
  temp_file.close();
}


R function acting as wrapper for it: 

writeFileFM <- function(temp.data, path = 'test.txt') { 
  ### Dealing with y function 
  if (!(any(colnames(temp.data) %in% 'y'))) { 
    stop('No y column is given')  
  } else { 
    temp.data <- temp.data %>% select(y, everything()) ## y is required to be first column for writeFile 
  }
  ### Dealing with factors/strings 
  temp.classes <- sapply(temp.data, class) 
  class.num    <- rep(0, length(temp.classes))
  map.list     <- list()
  for (i in 2:length(temp.classes)) { ### since y is always the first column 
    if (any(temp.classes[i] %in% c('factor', 'character'))) {
      temp.col         <- as.factor(temp.data[ ,i]) ### incase it is character 
      temp.unique      <- levels(temp.col)
      factors.new      <- seq(0, length(temp.unique) - 1, 1)
      levels(temp.col) <- factors.new 
      temp.data[ ,i]   <- temp.col
      ### Saving changes 
      class.num[i]  <- length(temp.unique)
      map.list[[i - 1]] <- data.frame('original.value'  = temp.unique, 
                                      'transform.value' = factors.new)
    } else { 
      class.num[i]  <- 1  ### Numeric values require only 1 column 
    }
  }
  ### Writing file 
  print('Writing file to disc')
  writeFile(all_data = sapply(temp.data, as.numeric), file = path, factor_levels = class.num)
  return(map.list) 
}


Comparing it on fake data. 

### Creating data to save 
set.seed(999)
n <- 10000 
factor.lvl1 <- 3
factor.lvl2 <- 2 
temp.data <- data.frame('x1' = sample(stri_rand_strings(factor.lvl1, 7), n, replace = TRUE),
                        'x2' = sample(stri_rand_strings(factor.lvl2, 4), n, replace = TRUE), 
                        'x3' = rnorm(n), 
                        'x4' = rnorm(n),
                        'y'  = rnorm(n))

### Comparing to other method 
library(data.table)
library(e1071)

microbenchmark::microbenchmark(
  temp.data.table <- model.matrix( ~ 0 + x1 + x2 + x3 + x4, data = temp.data,
                                   contrasts = list(x2 = contrasts(temp.data$x2, contrasts = FALSE))),
  write.matrix.csr(temp.data.table, 'out.txt'), 
  writeFileFM(temp.data))


Results. 

  min       lq       mean    median        uq
   1.3061   1.6725   1.890942   1.92475   2.07725
 629.9863 653.4345 676.108548 672.52510 687.88330
 270.8217 275.1353 283.537898 281.42100 289.39160
      max neval cld
   3.2328   100 a  
 793.7040   100   c
 328.0863   100  b 


It is faster than the e1071 option, and while that option fails when increasing the number of observations, the method suggested is still applicable. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复