Split thousands of columns at a time by '/' on multiple lines, sort the values in the new rows and add 'NA' values

前端 未结 4 1642
逝去的感伤
逝去的感伤 2021-01-22 21:24

I would like to split a data frame with thousands of columns. The data frame looks like this:

# sample data of four columns
sample <-read.table(stdin(),header         


        
4条回答
  •  借酒劲吻你
    2021-01-22 21:54

    Here is a solution using data.table:

    library("data.table")
    dt <- data.table(df)
    fun <- function(DT) {
      split <- strsplit(vapply(DT, as.character, character(1L)), "/")
      lapply(split, 
        function(x, max.len) as.numeric(x)[match(0:max.len, as.numeric(x))],
        max.len=max(as.numeric(unlist(split)))
    ) }
    dt[, fun(.SD), by=POS]
    #    POS v1 v2 v3 v4
    # 1: 152  0  0  0  0
    # 2: 152 NA  1 NA  1
    # 3: 152 NA NA  2  2
    # 4:  73 NA  0  0  0
    # 5:  73  1 NA  1  1
    # 6: 185  0 NA  0  0
    # 7: 185 NA  1 NA NA
    # 8: 185 NA NA NA NA
    # 9: 185 NA NA  3 NA
    

    The idea is to use data.table to execute our function fun against the data elements of each row (i.e. excluding POS). data.table will stitch back POS for our modified result.

    Here fun starts by converting each data row to a character vector, and then splitting by /, which will produce a list with for each item, a character vector with as many elements as there were /, + 1.

    Finally, lapply cycles through each of these list items, converting them all to the same length vectors, filling in with NA, and sorting.

    data.table recognizes the resulting list as representing columns for our result set, and adds back the POS column as noted earlier.


    EDIT: the following addresses a question in the comments:

    val <- "0/2/3:25:0.008,0.85,0.002:0.004,0.013,0.345"
    first.colon <- strsplit(val, ":")[[1]][[1]]
    strsplit(first.colon, "/")[[1]]
    // [1] "0" "2" "3"
    

    The key thing to understand is strsplit returns a list with as many elements as there are items in your input vector. In this toy example there is only one item in the vector, so there is only one item in the list, though each item is a character vector that can have multiple values (in this case, 3 after we split by /). So something like this should work (but I haven't tested debugged):

    dt <- data.table(df)
    fun <- function(DT) {
      split <- strsplit(vapply(DT, as.character, character(1L)), ":")
      split.2 <- vapply(split, `[[`, character(1L), 1)  # get just first value from `:` split
      split.2 <- strsplit(split.2, "/")
      lapply(split.2, 
        function(x, max.len) as.numeric(x)[match(0:max.len, as.numeric(x))],
        max.len=max(as.numeric(unlist(split)))
    ) }
    

提交回复
热议问题