read/write data in libsvm format

前端 未结 7 2019
慢半拍i
慢半拍i 2020-11-30 11:05

How do I read/write libsvm data into/from R?

The libsvm format is sparse data like

[ 

        
相关标签:
7条回答
  • 2020-11-30 11:44

    Try these functions and examples:

    https://github.com/zygmuntz/r-libsvm-format-read-write

    0 讨论(0)
  • 2020-11-30 11:46

    I have been running a job using the zygmuntz solution on a dataset with 25k observations (rows) for almost 5 hrs now. It has done 3k-ish rows. It was taking so long that I coded this up in the meantime (based on zygmuntz's code):

    require(Matrix)
    read.libsvm = function( filename ) {
      content = readLines( filename )
      num_lines = length( content )
      tomakemat = cbind(1:num_lines, -1, substr(content,1,1))
    
      # loop over lines
      makemat = rbind(tomakemat,
      do.call(rbind, 
        lapply(1:num_lines, function(i){
           # split by spaces, remove lines
               line = as.vector( strsplit( content[i], ' ' )[[1]])
               cbind(i, t(simplify2array(strsplit(line[-1],
                              ':'))))   
    })))
    class(makemat) = "numeric"
    
    #browser()
    yx = sparseMatrix(i = makemat[,1], 
                  j = makemat[,2]+2, 
              x = makemat[,3])
    return( yx )
    }
    

    This ran in minutes on the same machine (there may have been memory issues with zygmuntz solution too, not sure). Hope this helps anyone with the same problem.

    Remember, if you need to do big computations in R, VECTORIZE!

    EDIT: fixed an indexing error I found this morning.

    0 讨论(0)
  • 2020-11-30 11:47

    I went with a two-hop solution - convert R data to another format first, and then to LIBSVM:

    1. Used R package foreign to convert (and write out) data frame to ARFF format (modified write.arff changing write.table to na="0.0" instead of na="?" otherwise step 2 fails)
    2. Used https://github.com/dat/svm-tools/blob/master/arff2svm.py to convert ARFF format to LIBSVM

    My data set is 200K x 500 and this only took 3-5 minutes.

    0 讨论(0)
  • 2020-11-30 11:48
    Based on some comments. I add it as an aswer so it's easier for others to use. This is to write data in libsvm format.

    Function to write a data.frame to svm light format. I've added a train={TRUE, FALSE} argument in case the data doesn't have labels. In this case, the class index is ignored.

    write.libsvm = function(data, filename= "out.dat", class = 1, train=TRUE) {
      out = file(filename)
      if(train){
        writeLines(apply(data, 1, function(X) {
          paste(X[class], 
                apply(cbind(which(X!=0)[-class], 
                            X[which(X!=0)[-class]]), 
                      1, paste, collapse=":"), 
                collapse=" ") 
          }), out)
      } else {
        # leaves 1 as default for the new data without predictions. 
        writeLines(apply(data, 1, function(X) {
          paste('1',
                apply(cbind(which(X!=0), X[which(X!=0)]), 1, paste, collapse=":"), 
                collapse=" ") 
          }), out)
      }
      close(out) 
    }
    

    ** EDIT **

    Another option - In case you already have the data in a data.table object

    libfm and SVMlight have the same format, so this function should work.

    library(data.table)
    
    data.table.fm <- function (data = X, fileName = "../out.fm", target = "y_train", 
        train = TRUE) {
        if (train) {
            if (is.logical(data[[target]]) | sum(levels(factor(data[[target]])) == 
                levels(factor(c(0, 1)))) == 2) {
                data[[target]][data[[target]] == TRUE] = 1
                data[[target]][data[[target]] == FALSE] = -1
            }
        }
        specChar = "\\(|\\)|\\||\\:"
        specCharSpace = "\\(|\\)|\\||\\:| "
        parsingNames <- function(x) {
            ret = c()
            for (el in x) ret = append(ret, gsub(specCharSpace, "_", 
                el))
            ret
        }
        parsingVar <- function(x, keepSpace, hard_parse) {
            if (!keepSpace) 
                spch = specCharSpace
            else spch = specChar
            if (hard_parse) 
                gsub("(^_( *|_*)+)|(^_$)|(( *|_*)+_$)|( +_+ +)", 
                    " ", gsub(specChar, "_", gsub("(^ +)|( +$)", 
                      "", x)))
            else gsub(spch, "_", x)
        }
        setnames(data, names(data), parsingNames(names(data)))
        target = parsingNames(target)
        format_vw <- function(column, formater) {
            ifelse(as.logical(column), sprintf(formater, j, column), 
                "")
        }
        all_vars = names(data)[!names(data) %in% target]
        cat("Reordering data.table if class isn't first\n")
        target_inx = which(names(data) %in% target)
        rest_inx = which(!names(data) %in% target)
        cat("Adding Variable names to data.table\n")
        for (j in rest_inx) {
            column = data[[j]]
            formater = "%s:%f"
            set(data, i = NULL, j = j, value = format_vw(column, 
                formater))
            cat(sprintf("Fixing %s\n", j))
        }
        data = data[, c(target_inx, rest_inx), with = FALSE]
        drop_extra_space <- function(x) {
            gsub(" {1,}", " ", x)
        }
        cat("Pasting data - Removing extra spaces\n")
        data = apply(data, 1, function(x) drop_extra_space(paste(x, 
            collapse = " ")))
        cat("Writing to disk\n")
        write.table(data, file = fileName, sep = " ", row.names = FALSE, 
            col.names = FALSE, quote = FALSE)
    }
    
    0 讨论(0)
  • 2020-11-30 11:50

    e1071 is off the shelf:

    install.packages("e1071")
    library(e1071)
    read.matrix.csr(...)
    write.matrix.csr(...)
    

    Note: it is implemented in R, not in C, so it is dog-slow.

    It even have a special vignette Support Vector Machines—the Interface to libsvm in package e1071.

    r.vw is bundled with vowpal_wabbit

    Note: it is implemented in R, not in C, so it is dog-slow.

    0 讨论(0)
  • 2020-11-30 11:51

    I came up with my own ad hoc solution leveraging some data.table utilities,

    It ran in almost no time on the test data set I found (Boston Housing data).

    Converting that to a data.table (orthogonal to solution, but adding here for easy reproducibility):

    library(data.table)
    x = fread("/media/data_drive/housing.data.fw",
              sep = "\n", header = FALSE)
    #usually fixed-width conversion is harder, but everything here is numeric
    columns =  c("CRIM", "ZN", "INDUS", "CHAS",
                 "NOX", "RM", "AGE", "DIS", "RAD", 
                 "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
    DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
                       header = FALSE, sep = "\t",
                       col.names = columns))
    

    Here it is:

    DT[ , fwrite(as.data.table(paste0(
      MEDV, " | ", sapply(transpose(lapply(
        names(.SD), function(jj)
          paste0(jj, ":", get(jj)))),
        paste, collapse = " "))), 
      "/path/to/output", col.names = FALSE, quote = FALSE),
      .SDcols = !"MEDV"]
    #what gets sent to as.data.table:
    #[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575 
    #  AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"      
    #[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421 
    #  AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
    # ...
    

    There may be a better way to get this understood by fwrite than as.data.table, but I can't think of one (until setDT works on vectors).

    I replicated this to test its performance on a bigger data set (just blow up the current data set):

    DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))
    

    The operation was pretty fast compared to some of the times reported here (I haven't bothered comparing directly yet):

    system.time(.)
    #    user  system elapsed 
    #   8.392   0.000   8.385 
    

    I also tested using writeLines instead of fwrite, but the latter was better.


    I am looking again and seeing it might take a while to figure out what's going on. Maybe the magrittr-piped version will be easier to follow:

    DT[ , 
        #1) prepend each column's values with the column name
        lapply(names(.SD), function(jj)
          paste0(jj, ":", get(jj))) %>%
          #2) transpose this list (using data.table's fast tool)
          #   (was column-wise, now row-wise)
          #3) concatenate columns, separated by " "
          transpose %>% sapply(paste, collapse = " ") %>%
          #4) prepend each row with the target value
          #   (with Vowpal Wabbit in mind, separate with a pipe)
          paste0(MEDV, " | ", .) %>%
          #5) convert this to a data.table to use fwrite
          as.data.table %>%
          #6) fwrite it; exclude nonsense column name,
          #   and force quotes off
          fwrite("/path/to/data", 
                 col.names = FALSE, quote = FALSE),
      .SDcols = !"MEDV"]
    

    reading in such files is much easier**

    #quickly read data; don't split within lines
    x = fread("/path/to/data", sep = "\n", header = FALSE)
    
    #tstrsplit is transpose(strsplit(.))
    dt1 = x[ , tstrsplit(V1, split = "[| :]+")]
    
    #even columns have variable names
    nms = c("target_name", 
            unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L), 
                       with = FALSE]))
    
    #odd columns have values
    DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
    #add meaningful names
    setnames(DT, nms)
    

    **this will not work with "ragged"/sparse input data. I don't think there's a way to extend this to work in such cases.

    0 讨论(0)
提交回复
热议问题