Split thousands of columns at a time by '/' on multiple lines, sort the values in the new rows and add 'NA' values

前端 未结 4 1657
逝去的感伤
逝去的感伤 2021-01-22 21:24

I would like to split a data frame with thousands of columns. The data frame looks like this:

# sample data of four columns
sample <-read.table(stdin(),header         


        
4条回答
  •  梦毁少年i
    2021-01-22 21:52

    I guess the interesting data is really a matrix

    m = as.matrix(sample[,-1])
    

    The underlying data is then a vector with relatively few unique values; we'll map the unique values to their integer representation, using the map where possible to minimize the number of iterations in any loops that are necessary

    s = as.character(m)
    map = lapply(strsplit(setNames(unique(s), unique(s)), "/"), as.integer)
    

    Here's the number of times each row needs to be replicated

    row.len = apply(matrix(sapply(map, max)[s], ncol=ncol(m)), 1, max) + 1
    

    and the offsets into s of each row

    offset = head(c(1, cumsum(rep(row.len, ncol(m))) + 1), -1)
    

    Calculate the values of each mapped element, and the index of the value in s

    v = unlist(unname(map)[match(s, names(map))])
    idx = rep(offset, sapply(map, length)[s]) + v
    

    Finally, allocate the result matrix of NA's, and update the non-NA values

    ans = matrix(NA_integer_, sum(row.len), ncol(m))
    ans[idx] = v
    

    As a function:

    flatten <- function(sample) {
        m = as.matrix(sample[,-1])
        s = as.character(m)
        map = lapply(strsplit(setNames(unique(s), unique(s)), "/"), as.integer)
        row.len = apply(matrix(sapply(map, max)[s], ncol=ncol(m)), 1, max) + 1
        offset = head(c(1, cumsum(rep(row.len, ncol(m))) + 1), -1)
        v = unlist(unname(map)[match(s, names(map))])
        idx = rep(offset, sapply(map, length)[s]) + v
        ans = matrix(NA_integer_, sum(row.len), ncol(m),
            dimnames=list(NULL, colnames(sample)[-1]))
        ans[idx] = v
        cbind(POS=rep(sample[,1], row.len), as.data.frame(ans))
    }
    

    The slowest part of this will be the apply function for calculating row.len. Some timing (I guess the dimensions are not correct for the problem...)

    xx = do.call(rbind, replicate(10000, sample, simplify=FALSE))
    dim(xx)
    ## [1] 30000     5
    system.time(flatten(xx))
    ##   user  system elapsed 
    ##  0.192   0.000   0.194 
    

    versus about 5s for the data.table solution above.

提交回复
热议问题