create a sparse matrix; given the indices of non-zero elements for creation of dummy variables of a categorical column of a large dataset

懵懂的女人 提交于 2019-12-02 13:41:51

问题


I'm trying to use a sparse matrix to generate dummy variables for a set of data with 5.8 million rows and two categorical columns.

The structure of the data is:

mydata: data.table of 5,800,000 rows and two categorical (in integer format) variables Var1 and Var2

nlevel(Var1) : 210,000 (levels include all numbers between 1 and 210,000)

nlevel(Var2) : 500 (levels include all numbers between 1 and 500)

here's an example of mydata:

 Var_1      Var_2
   1          4
   1          2
   2          7
   5          9
   5          500
   .
   .
   .

  200         6
  200         2
  200         80
   .
   .
   .

I'm using a sparse Matrix (sparse_Mx) to create the dummy variable matrix which would be of the form:

Var_1       Var_2_level_1     Var_2_level_2   . . .    Var_2_level_500
  1                0                   1                    0
  2                0                   0                    0
  3                1                   1                    0
  4                0                   0                    0
  5                0                   0                    1

  .
  .
  .

 200              0                    1                    0
  .
  .
  .

210,000           ...                 ...                  ...

I didn't know how to do this efficiently, so i used a for-loop to create the dummy variable matrix:

library(Matrix) #for sparse matrices
m2 <- Matrix(0, nrow = 210000, ncol = 500 , sparse = TRUE) 

for (i in 1: nrow(mydata))
  sparse_Mx[ mydata[i, Var_1] , mydata[i, Var_2] ] <- 1

It basically goes through each row of mydata, and based on the row Var1 value (which determines the row in the matrix) and the row Var2 value (which determines the column number in the matrix, fills the sparse matrix with 1.

It works, except it's taking forever (as the for-loop has to go through 5,800,000 loops!)

Is there any way to do this more efficiently? I really dislike using for-loop for this purpose but couldn't think of another way to do this.


Edit: I'd like to add that I have tried using sparse.model.matrix(), to no avail. the generated matrix is not in the right format (210,000 rows and 500 columns).

The variables were converted to factors and used the following:

sp_mx <- sparse.model.matrix( ~ . -1 , data = mydata)

However, I get a sparse matrix of [5,800,000 x 500 ] as opposed to a matrix of [210,000 x 500]

I've tried many variations and still same result:

sp_mx <- sparse.model.matrix( ~ Var2 -1 , data = mydata)

or

 sp_mx <- sparse.model.matrix(Var1 ~ Var2 -1 , data = mydata)

all of them result in a sparse matrix with all rows. what i need is a [210,000 x 500] matrix that will have more than one 1 in each row.


回答1:


Try this:

spmat<-Matrix(0,nrow = 210000 ,ncol = 500,sparse = T)
locs<-Matrix(data=c(mydata$Var_1,mydata$Var_2),byrow=F,ncol=2)
spmat[locs]=1



回答2:


Why do you want a sparse matrix? For a dummy matrix you can also just use:

model.matrix(~ . + 0, data = df)

The 0 indicates no intercept and the . indicates that all categorical variables will be transformed. Be sure to set these variables as factors using as.factor() beforehand.



来源:https://stackoverflow.com/questions/44692603/create-a-sparse-matrix-given-the-indices-of-non-zero-elements-for-creation-of-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!