Directly creating dummy variable set in a sparse matrix in R

前端 未结 2 1380
清酒与你
清酒与你 2020-11-29 10:30

Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You\'d like to create a dummy variable data set, but since it would be too sp

相关标签:
2条回答
  • 2020-11-29 11:08

    This can be done slightly more compactly with Matrix:::sparse.model.matrix, although the requirement to have all columns for all variables makes things a little more difficult.

    Generate input:

    set.seed(123)
    n <- 6
    df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                     y = sample(c("D", "E"),      n, TRUE))
    

    If you didn't need all columns for all variables you could just do:

    library(Matrix)
    sparse.model.matrix(~.-1,data=df)
    

    If you need all columns:

    fList <- lapply(names(df),reformulate,intercept=FALSE)
    mList <- lapply(fList,sparse.model.matrix,data=df)
    do.call(cBind,mList)
    
    0 讨论(0)
  • 2020-11-29 11:13

    Thanks for having clarified your question, try this.

    Here is sample data with two columns that have three and two levels respectively:

    set.seed(123)
    n <- 6
    df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                     y = sample(c("D", "E"),      n, TRUE))
    #   x y
    # 1 A E
    # 2 C E
    # 3 B E
    # 4 C D
    # 5 C E
    # 6 A D
    
    library(Matrix)
    spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                              j = as.integer(j), x = 1))
    do.call(cBind, spm)
    # 6 x 5 sparse Matrix of class "dgCMatrix"
    #               
    # [1,] 1 . . . 1
    # [2,] . . 1 . 1
    # [3,] . 1 . . 1
    # [4,] . . 1 1 .
    # [5,] . . 1 . 1
    # [6,] 1 . . 1 .
    

    Edit: @user20650 pointed out do.call(cBind, ...) was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:

    n <- nrow(df)
    nlevels <- sapply(df, nlevels)
    i <- rep(seq_len(n), ncol(df))
    j <- unlist(lapply(df, as.integer)) +
         rep(cumsum(c(0, head(nlevels, -1))), each = n)
    x <- 1
    sparseMatrix(i = i, j = j, x = x)
    
    0 讨论(0)
提交回复
热议问题