I would like to split a data frame with thousands of columns. The data frame looks like this:
# sample data of four columns
sample <-read.table(stdin(),header
Here is a solution using data.table
:
library("data.table")
dt <- data.table(df)
fun <- function(DT) {
split <- strsplit(vapply(DT, as.character, character(1L)), "/")
lapply(split,
function(x, max.len) as.numeric(x)[match(0:max.len, as.numeric(x))],
max.len=max(as.numeric(unlist(split)))
) }
dt[, fun(.SD), by=POS]
# POS v1 v2 v3 v4
# 1: 152 0 0 0 0
# 2: 152 NA 1 NA 1
# 3: 152 NA NA 2 2
# 4: 73 NA 0 0 0
# 5: 73 1 NA 1 1
# 6: 185 0 NA 0 0
# 7: 185 NA 1 NA NA
# 8: 185 NA NA NA NA
# 9: 185 NA NA 3 NA
The idea is to use data.table
to execute our function fun
against the data elements of each row (i.e. excluding POS
). data.table
will stitch back POS
for our modified result.
Here fun
starts by converting each data row to a character vector, and then splitting by /
, which will produce a list with for each item, a character vector with as many elements as there were /
, + 1.
Finally, lapply
cycles through each of these list items, converting them all to the same length vectors, filling in with NA
, and sorting.
data.table
recognizes the resulting list as representing columns for our result set, and adds back the POS
column as noted earlier.
EDIT: the following addresses a question in the comments:
val <- "0/2/3:25:0.008,0.85,0.002:0.004,0.013,0.345"
first.colon <- strsplit(val, ":")[[1]][[1]]
strsplit(first.colon, "/")[[1]]
// [1] "0" "2" "3"
The key thing to understand is strsplit
returns a list with as many elements as there are items in your input vector. In this toy example there is only one item in the vector, so there is only one item in the list, though each item is a character vector that can have multiple values (in this case, 3 after we split by /
). So something like this should work (but I haven't tested debugged):
dt <- data.table(df)
fun <- function(DT) {
split <- strsplit(vapply(DT, as.character, character(1L)), ":")
split.2 <- vapply(split, `[[`, character(1L), 1) # get just first value from `:` split
split.2 <- strsplit(split.2, "/")
lapply(split.2,
function(x, max.len) as.numeric(x)[match(0:max.len, as.numeric(x))],
max.len=max(as.numeric(unlist(split)))
) }