Merge data frames whilst summing common columns in R

后端 未结 3 611
星月不相逢
星月不相逢 2021-01-06 00:00

My problem is very similar to the one posted here.

The difference is that they knew the columns that would be conflicting whereas I need a generic method that wont k

3条回答
  •  执念已碎
    2021-01-06 00:15

    If I understand correctly, you want a flexible method that does not require knowing which columns exist in each table aside from the columns you want to merge by and the columns you want to preserve. This may not be the most elegant solution, but here is an example function to suit your exact needs:

    merge_Sum <- function(.df1, .df2, .id_Columns, .match_Columns){
        merged_Columns <- unique(c(names(.df1),names(.df2)))
        merged_df1 <- data.frame(matrix(nrow=nrow(.df1), ncol=length(merged_Columns)))
        names(merged_df1) <- merged_Columns
        for (column in merged_Columns){
            if(column %in% .id_Columns | !column %in% names(.df2)){
                merged_df1[, column] <- .df1[, column]
            } else if (!column %in% names(.df1)){
                merged_df1[, column] <- .df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
            } else {
                df1_Values=.df1[, column]
                df2_Values=.df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
                df2_Values[is.na(df2_Values)] <- 0
                merged_df1[, column] <- df1_Values + df2_Values
            }
        }
        return(merged_df1)
    }
    

    This function assumes you have a table '.df1' that is a master of sorts, and you want to merge data from a second table '.df2' that has rows that match one or more of the rows in '.df1'. The columns to preserve from the master table '.df1' are accepted as an array '.id_Columns', and the columns that provide the match for merging the two tables are accepted as an array '.match_Columns'

    For your example, it would work like this:

    merge_Sum(table1, table2, c("Date","Time"), "Date")
    
    #   Date       Time  ColumnA ColumnB ColumnC
    # 1 01/01/2013 08:00     110     330       1
    # 2 01/01/2013 08:30     115     325       1
    # 3 01/01/2013 09:00     120     320       1
    # 4 02/01/2013 08:00     225     415       2
    # 5 02/01/2013 08:30     230     410       2
    # 6 02/01/2013 09:00     235     405       2
    

    In plain language, this function first finds the total number of unique columns and makes an empty data frame in the shape of the master table '.df1' to later hold the merged data. Then, for the '.id_Columns', the data is copied from '.df1' into the new merged data frame. For the other columns, any data that exists in '.df1' is added to any existing data in '.df2', where the rows in '.df2' are matched based on the '.match_Columns'

    There is probably some package out there that does something similar, but most of them require knowledge of all the existing columns and how to treat them. As I said before, this is not the most elegant solution, but it is flexible and accurate.

    Update: The original function assumed a many-to-one relationship between table1 and table2, and the OP requested the allowance of a many-to-none relationship, also. The code has been updated with a slightly less efficient but 100% more flexible logic.

提交回复
热议问题