Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You\'d like to create a dummy variable data set, but since it would be too sp
This can be done slightly more compactly with Matrix:::sparse.model.matrix
,
although the requirement to have all columns for all variables makes things
a little more difficult.
Generate input:
set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
y = sample(c("D", "E"), n, TRUE))
If you didn't need all columns for all variables you could just do:
library(Matrix)
sparse.model.matrix(~.-1,data=df)
If you need all columns:
fList <- lapply(names(df),reformulate,intercept=FALSE)
mList <- lapply(fList,sparse.model.matrix,data=df)
do.call(cBind,mList)
Thanks for having clarified your question, try this.
Here is sample data with two columns that have three and two levels respectively:
set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
y = sample(c("D", "E"), n, TRUE))
# x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D
library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .
Edit: @user20650 pointed out do.call(cBind, ...)
was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:
n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)