I have a several data sets with 75,000 observations and a type
variable that can take on a value 0-4. I want to add five new dummy variables to each data set f
Drew, this is much faster and shouldn't cause any crashes.
> binom <- data.frame(data=runif(1e5),type=sample(0:4,1e5,TRUE))
> for(t in unique(binom$type)) {
+ binom[paste("type",t,sep="")] <- ifelse(binom$type==t,1,0)
+ }
> head(binom)
data type type2 type4 type1 type3 type0
1 0.11787309 2 1 0 0 0 0
2 0.11884046 4 0 1 0 0 0
3 0.92234950 4 0 1 0 0 0
4 0.44759259 1 0 0 1 0 0
5 0.01669651 2 1 0 0 0 0
6 0.33966184 3 0 0 0 1 0
ifelse
is vectorized, so if I understand your code correctly, you don't need that sapply
. And I wouldn't use merge - I would use SQLite or PostgreSQL.
Some sample data would help too :-)
If you're open to using the data.table package, mltools has a one_hot() method.
library(data.table)
library(mltools)
binom <- data.table(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
one_hot(binom)
y x catVar_0 catVar_1 catVar_2 catVar_3 catVar_4
1: 0.90511891 0.83045050 0 0 1 0 0
2: 0.91375984 0.73273830 0 0 0 1 0
3: 0.01926608 0.10301409 0 0 1 0 0
4: 0.48691138 0.24428157 0 1 0 0 0
5: 0.60660396 0.09132816 0 0 1 0 0
---
99996: 0.12908356 0.26157731 0 1 0 0 0
99997: 0.96397273 0.98959000 0 1 0 0 0
99998: 0.16818414 0.37460941 1 0 0 0 0
99999: 0.72610508 0.72055867 1 0 0 0 0
100000: 0.89710998 0.24155507 0 0 0 0 1
Usage
one_hot(dt, cols = "auto", sparsifyNAs = FALSE,
naCols = FALSE, dropCols = TRUE,
dropUnusedLevels = FALSE)
Which column(s) should be one-hot-encoded? cols = "auto" encodes all unordered factor columns. Therefore, the command below is equivalent. This is only important when the data.table contains factors that should not be encoded.
one_hot(binom, cols="catVar")
You can use the package called dummies
binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
head(binom)
y x catVar
1 0.4143348 0.09721401 1
2 0.3140782 0.54340539 3
3 0.1262037 0.51820499 2
4 0.7159850 0.13167720 3
5 0.8203528 0.94116026 3
6 0.2169781 0.82020216 1
Solution:
library(dummies)
binom<-dummy.data.frame(binom)
head(binom)
y x catVar0 catVar1 catVar2 catVar3 catVar4
1 0.4143348 0.09721401 0 1 0 0 0
2 0.3140782 0.54340539 0 0 0 1 0
3 0.1262037 0.51820499 0 0 1 0 0
4 0.7159850 0.13167720 0 0 0 1 0
5 0.8203528 0.94116026 0 0 0 1 0
6 0.2169781 0.82020216 0 1 0 0 0
The nnet package for single-layer neural networks (which don't understand factors) has a conversion command: class.ind.
R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.
> binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
> head(binom)
y x catVar
1 0.5051653 0.34888390 2
2 0.4868774 0.85005067 2
3 0.3324482 0.58467798 2
4 0.2966733 0.05510749 3
5 0.5695851 0.96237936 1
6 0.8358417 0.06367418 2
You just do
> A <- model.matrix(y ~ x + catVar,binom)
> head(A)
(Intercept) x catVar1 catVar2 catVar3 catVar4
1 1 0.34888390 0 1 0 0
2 1 0.85005067 0 1 0 0
3 1 0.58467798 0 1 0 0
4 1 0.05510749 0 0 1 0
5 1 0.96237936 1 0 0 0
6 1 0.06367418 0 1 0 0
Done.