What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor.
race.White race.Hispanic race
Another idea:
ff = function(x)
{
ans = integer(nrow(x))
for(i in seq_along(x)) ans[as.logical(x[[i]])] = i
names(x)[ans]
}
sub("[^.]+\\.", "", ff(dat))
#[1] "White" "Asian" "White" "Black" "Asian" "Hispanic" "White" "White" "White" "Black"
And to compare with akrun's alternatives:
akrun1 = function(x) names(x)[max.col(x, "first")]
akrun2 = function(x) names(x)[(as.matrix(x) %*% seq_along(x))[, 1]]
akrun3 = function(x) names(x)[do.call(pmax, x * seq_along(x)[col(x)])]
akrunlike = function(x) names(x)[do.call(pmax, Map("*", x, seq_along(x)))]
DF = setNames(as.data.frame("[<-"(matrix(0L, 1e4, 1e3),
cbind(seq_len(1e4), sample(1e3, 1e4, TRUE)),
1L)),
paste("fac", 1:1e3, sep = ""))
identical(ff(DF), akrun1(DF))
#[1] TRUE
identical(ff(DF), akrun2(DF))
#[1] TRUE
identical(ff(DF), akrun3(DF))
#[1] TRUE
identical(ff(DF), akrunlike(DF))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF), akrun1(DF), akrun2(DF),
akrun3(DF), akrunlike(DF),
as.matrix(DF), col(DF), times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(DF) 61.99124 64.56194 78.62267 102.18424 152.64891 30
# akrun1(DF) 296.89042 314.28641 327.95059 353.46185 394.46013 30
# akrun2(DF) 103.76105 114.01497 120.12191 129.86513 166.13266 30
# akrun3(DF) 1141.46478 1163.96842 1178.92961 1203.83848 1231.70346 30
# akrunlike(DF) 125.47542 130.20826 141.66123 157.92743 203.42331 30
# as.matrix(DF) 19.46940 20.54543 28.22377 35.69575 87.06001 30
# col(DF) 103.61454 112.75450 116.00120 126.09138 176.97435 30
I included as.matrix()
and col()
just to show that "list"-y structures can be convenient on efficient looping as is. E.g., in contrast to a by-row looping, a way to use by-column looping doesn't need time to transform the structure of data.
We can use max.col
to get the column index, subset the column names based on that and use sub
to remove the prefix.
sub('[^.]+\\.', '', names(dat)[max.col(dat)])
#[1] "White" "Asian" "White" "Black" "Asian" "Hispanic"
#[7] "White" "White" "White" "Black"
Here, I assumed that there is a single 1
per each row. If there are multiple 1s, we can use the option ties.method='first'
or ties.method='last'
.
Or another option is doing the %*%
with the sequence of columns, subset the column names, and remove the prefix with sub
.
sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]])
Or we can use pmax
sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])])