Merge multiple data tables with duplicate column names

后端 未结 6 1386
感情败类
感情败类 2020-12-29 10:19

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but wo

相关标签:
6条回答
  • 2020-12-29 10:41

    Using reshaping gives you a lot more flexibility in how you want to name your columns.

    library(dplyr)
    library(tidyr)
    
    list(DT1, DT2, DT3, DT4, DT5) %>%
      bind_rows(.id = "source") %>%
      mutate(source = paste("y", source, sep = ".")) %>%
      spread(source, y)
    

    Or, this would work

    library(dplyr)
    library(tidyr)
    
    list(DT1 = DT1, DT2 = DT2, DT3 = DT3, DT4 = DT4, DT5 = DT5) %>%
      bind_rows(.id = "source") %>%
      mutate(source = paste(source, "y", sep = ".")) %>%
      spread(source, y)
    
    0 讨论(0)
  • 2020-12-29 10:47

    Alternatively you could setNames for the columns before and do merge like this

    dts = list(DT1, DT2, DT3, DT4, DT5)
    names(dts) = paste('DT', c(1:5), sep = '')    
    
    dtlist = lapply(names(dts),function(i) 
             setNames(dts[[i]], c('x', paste('y',i,sep = '.'))))
    
    Reduce(function(...) merge(..., all = T), dtlist)
    
    #   x y.DT1 y.DT2 y.DT3 y.DT4 y.DT5
    #1: a    10    11    12    13    14
    #2: b    11    12    13    14    15
    #3: c    12    13    14    15    16
    #4: d    13    14    15    16    17
    #5: e    14    15    16    17    18
    #6: f    15    16    17    18    19
    
    0 讨论(0)
  • 2020-12-29 10:51

    Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

    Reduce((function() {counter = 0
                        function(x, y) {
                          counter <<- counter + 1
                          d = merge(x, y, all = T, by = 'x')
                          setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                        }})(), list(DT1, DT2, DT3, DT4, DT5))
    #   x y.x y.1 y.2 y.3 y.4
    #1: a  10  11  12  13  14
    #2: b  11  12  13  14  15
    #3: c  12  13  14  15  16
    #4: d  13  14  15  16  17
    #5: e  14  15  16  17  18
    #6: f  15  16  17  18  19
    
    0 讨论(0)
  • 2020-12-29 11:00

    If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

    # set the key for each datatable to 'x'
    setkey(DT1,x)
    setkey(DT2,x)
    setkey(DT3,x)
    setkey(DT4,x)
    setkey(DT5,x)
    
    # the nested join
    mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]
    

    Or as @Frank said in the comments:

    DTlist <- list(DT1,DT2,DT3,DT4,DT5)
    Reduce(function(X,Y) X[Y], DTlist)
    

    which gives:

       x y1 y2 y3 y4 y5
    1: a 10 11 12 13 14
    2: b 11 12 13 14 15
    3: c 12 13 14 15 16
    4: d 13 14 15 16 17
    5: e 14 15 16 17 18
    6: f 15 16 17 18 19
    

    This gives the same result as:

    mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
    
    > identical(mergedDT1,mergedDT2)
    [1] TRUE
    

    When your x columns do not have the same values, a nested join will not give the desired solution:

    DT1[DT2[DT3[DT4[DT5[DT6]]]]]
    

    this gives:

       x y1 y2 y3 y4 y5 y6
    1: b 11 12 13 14 15 15
    2: c 12 13 14 15 16 16
    3: d 13 14 15 16 17 17
    4: e 14 15 16 17 18 18
    5: f 15 16 17 18 19 19
    6: g NA NA NA NA NA 20
    

    While:

    Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))
    

    gives:

       x y1 y2 y3 y4 y5 y6
    1: a 10 11 12 13 14 NA
    2: b 11 12 13 14 15 15
    3: c 12 13 14 15 16 16
    4: d 13 14 15 16 17 17
    5: e 14 15 16 17 18 18
    6: f 15 16 17 18 19 19
    7: g NA NA NA NA NA 20
    

    Used data:

    In order to make the code with Reduce work, I changed the names of the y columns.

    DT1 <- data.table(x = letters[1:6], y1 = 10:15)
    DT2 <- data.table(x = letters[1:6], y2 = 11:16)
    DT3 <- data.table(x = letters[1:6], y3 = 12:17)
    DT4 <- data.table(x = letters[1:6], y4 = 13:18)
    DT5 <- data.table(x = letters[1:6], y5 = 14:19)
    
    DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")
    
    0 讨论(0)
  • 2020-12-29 11:00

    stack and reshape I don't think this maps exactly to the merge function but...

    mycols <- "x"
    DTlist <- list(DT1,DT2,DT3,DT4,DT5)
    
    dcast(rbindlist(DTlist,idcol=TRUE), paste0(paste0(mycols,collapse="+"),"~.id"))
    
    #    x  1  2  3  4  5
    # 1: a 10 11 12 13 14
    # 2: b 11 12 13 14 15
    # 3: c 12 13 14 15 16
    # 4: d 13 14 15 16 17
    # 5: e 14 15 16 17 18
    # 6: f 15 16 17 18 19
    

    I have no sense for if this would extend to having more columns than y.

    merge-assign

    DT <- Reduce(function(...) merge(..., all = TRUE, by = mycols), 
      lapply(DTlist,`[.noquote`,mycols))
    
    for (k in seq_along(DTlist)){
      js = setdiff( names(DTlist[[k]]), mycols )
      DT[DTlist[[k]], paste0(js,".",k) := mget(paste0("i.",js)), on=mycols, by=.EACHI]
    }
    
    #    x y.1 y.2 y.3 y.4 y.5
    # 1: a  10  11  12  13  14
    # 2: b  11  12  13  14  15
    # 3: c  12  13  14  15  16
    # 4: d  13  14  15  16  17
    # 5: e  14  15  16  17  18
    # 6: f  15  16  17  18  19
    

    (I'm not sure if this fully extends to other cases. Hard to say because the OP's example really doesn't demand the full functionality of merge. In the OP's case, with mycols="x" and x being the same across all DT*, obviously a merge is inappropriate, as mentioned by @eddi. The general problem is interesting, though, so that's what I'm trying to attack here.)

    0 讨论(0)
  • 2020-12-29 11:00

    Another way of doing this:

    dts <- list(DT1, DT2, DT3, DT4, DT5)
    
    names(dts) <- paste("y", seq_along(dts), sep="")
    data.table::dcast(rbindlist(dts, idcol="id"), x ~ id, value.var = "y")
    
    #   x y1 y2 y3 y4 y5
    #1: a 10 11 12 13 14
    #2: b 11 12 13 14 15
    #3: c 12 13 14 15 16
    #4: d 13 14 15 16 17
    #5: e 14 15 16 17 18
    #6: f 15 16 17 18 19
    

    The package name in "data.table::dcast" is added to ensure that the call returns a data table and not a data frame even if the "reshape2" package is loaded as well. Without mentioning the package name explicitly, the dcast function from the reshape2 package might be used which works on a data.frame and returns a data.frame instead of a data.table.

    0 讨论(0)
提交回复
热议问题