I would like to split a data frame with thousands of columns. The data frame looks like this:
# sample data of four columns
sample <-read.table(stdin(),header
I guess the interesting data is really a matrix
m = as.matrix(sample[,-1])
The underlying data is then a vector with relatively few unique values; we'll map the unique values to their integer representation, using the map where possible to minimize the number of iterations in any loops that are necessary
s = as.character(m)
map = lapply(strsplit(setNames(unique(s), unique(s)), "/"), as.integer)
Here's the number of times each row needs to be replicated
row.len = apply(matrix(sapply(map, max)[s], ncol=ncol(m)), 1, max) + 1
and the offsets into s
of each row
offset = head(c(1, cumsum(rep(row.len, ncol(m))) + 1), -1)
Calculate the values of each mapped element, and the index of the value in s
v = unlist(unname(map)[match(s, names(map))])
idx = rep(offset, sapply(map, length)[s]) + v
Finally, allocate the result matrix of NA's, and update the non-NA values
ans = matrix(NA_integer_, sum(row.len), ncol(m))
ans[idx] = v
As a function:
flatten <- function(sample) {
m = as.matrix(sample[,-1])
s = as.character(m)
map = lapply(strsplit(setNames(unique(s), unique(s)), "/"), as.integer)
row.len = apply(matrix(sapply(map, max)[s], ncol=ncol(m)), 1, max) + 1
offset = head(c(1, cumsum(rep(row.len, ncol(m))) + 1), -1)
v = unlist(unname(map)[match(s, names(map))])
idx = rep(offset, sapply(map, length)[s]) + v
ans = matrix(NA_integer_, sum(row.len), ncol(m),
dimnames=list(NULL, colnames(sample)[-1]))
ans[idx] = v
cbind(POS=rep(sample[,1], row.len), as.data.frame(ans))
}
The slowest part of this will be the apply
function for calculating row.len
. Some timing (I guess the dimensions are not correct for the problem...)
xx = do.call(rbind, replicate(10000, sample, simplify=FALSE))
dim(xx)
## [1] 30000 5
system.time(flatten(xx))
## user system elapsed
## 0.192 0.000 0.194
versus about 5s for the data.table solution above.