Directly creating dummy variable set in a sparse matrix in R

前端未结

关注

 2  1384

Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You\'d like to create a dummy variable data set, but since it would be too sp

相关标签:

2条回答

庸人自扰

2020-11-29 11:08
This can be done slightly more compactly with Matrix:::sparse.model.matrix, although the requirement to have all columns for all variables makes things a little more difficult.

Generate input:
```
set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))
```
If you didn't need all columns for all variables you could just do:
```
library(Matrix)
sparse.model.matrix(~.-1,data=df)
```
If you need all columns:
```
fList <- lapply(names(df),reformulate,intercept=FALSE)
mList <- lapply(fList,sparse.model.matrix,data=df)
do.call(cBind,mList)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

臣服心动

2020-11-29 11:13

Thanks for having clarified your question, try this.

Here is sample data with two columns that have three and two levels respectively:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))
#   x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D

library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                          j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#               
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .

Edit: @user20650 pointed out do.call(cBind, ...) was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:

n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
     rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)

0 讨论(0)