Using one data.frame to update another

后端 未结 7 420

Given 2 data frames that are identical in terms of column names/datatypes, where some columns uniquely identify the rows, is there an efficient function/method for one data.

相关标签:
7条回答
  • 2020-12-20 12:35

    Using base R, you can use the function replace.df() below, which is loosely based on the source code of merge.data.frame(). Contrary to some other solutions, this one allows multiple columns for identification. I use it rather often in my job. Feel free to copy and use.

    This function controls for cases where rows in y are not found in x. Mind that the function does not check whether the combinations are unique. match() will only replace the first occurence by the first occurence of a combination.

    The function is used as follows :

    > replace.df(original, replacement,by=c('Name','Id'))
      Name Id Value1 Value2
    1  joe  1    1.2     NA
    2 john  2    2.2    9.2
    

    Note that this effectively detects the writing error you have in your original code. replacement contains a variabe named 'value2' (small v) instead of Value2 (capital V). After correcting this, the result becomes:

    > replace.df(original, replacement,by=c('Name','Id'))
      Name Id Value1 Value2
    1  joe  1    1.2     NA
    2 john  2    2.2    5.9
    

    You can use that function as well for changing the values in only some of the columns

    > replace.df(original, replacement,by=c('Name','Id'),cols='Value2')
      Name Id Value1 Value2
    1  joe  1    1.2     NA
    2 john  2     NA    5.9
    

    The function:

    replace.df <- function(x,y,by,cols=NULL
               ){
        nx <- nrow(x)
        ny <- nrow(y)
    
        bx <- x[,by,drop=FALSE]
        by <- y[,by,drop=FALSE]
        bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
    
        bx <- bz[seq_len(nx)]
        by <- bz[nx + seq_len(ny)]
    
        idx <- match(by,bx)
        idy <- match(bx,by)
        idy <- idy[!is.na(idy)]
    
        if(is.null(cols)) {
          cols <- intersect(names(x),names(y))
          cols <- cols[!cols %in% by]
        }
    
        x[idx,cols] <- y[idy,cols]
        x
      }
    
    0 讨论(0)
  • 2020-12-20 12:41

    I produced a function that uses the method of indexing (see answer by John Colby above). Hopefully it can be useful for all such needs of updating one data frame with the values from another data frame.

    update.df.with.df <- function(original, replacement, key, value) 
    {
        ## PURPOSE: Update a data frame with the values in another data frame
        ## ----------------------------------------------------------------------
        ## ARGUMENT:
        ##   original: a data frame to update,
        ##   replacement: a data frame that has the updated values,
        ##   key: a character vector of variable names to form the unique key
        ##   value: a character vector of variable names to form the values that need to be updated
        ## ----------------------------------------------------------------------
        ## RETURN: The updated data frame from the old data frame "original". 
        ## ----------------------------------------------------------------------
        ## AUTHOR: Feiming Chen,  Date:  2 Dec 2015, 15:08
    
        n1 <- rownames(original) <- apply(original[, key, drop=F], 1, paste, collapse=".")
        n2 <- rownames(replacement) <- apply(replacement[, key, drop=F], 1, paste, collapse=".")
    
        n3 <- merge(data.frame(n=n1), data.frame(n=n2))[[1]] # make common keys
        n4 <- levels(n3)[n3]                # convert factor to character
    
        original[n4, value] <- replacement[n4, value] # update values on the common keys
        original
    }
    if (F) {                                # Unit Test 
        original <- data.frame(x=c(1, 2, 3), y=c(10, 20, 30))
        replacement <- data.frame(x=2, y=25)
        update.df.with.df(original, replacement, key="x", value="y") # data.frame(x=c(1, 2, 3), y=c(10, 25, 30))
    
        original <- data.frame(x=c(1, 2, 3), w=c("a", "b", "c"), y=c(10, 20, 30))
        replacement <- data.frame(x=2, w="b", y=25)
        update.df.with.df(original, replacement, key=c("x", "w"), value="y") # data.frame(x=c(1, 2, 3), w=c("a", "b", "c"), y=c(10, 25, 30))
    
        original = data.frame(Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,NA), Value2 = c(NA,9.2))
        replacement = data.frame(Name = c("john") , Id = 2 , Value1 = 2.2 , Value2 = 5.9)
        update.df.with.df(original, replacement, key="Id", value=c("Value1", "Value2"))
        ## goal = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,2.2), Value2 = c(NA,5.9) )
    }
    
    0 讨论(0)
  • 2020-12-20 12:43
    # limit replacement to elements that have a correspondence in original 
    existing = replacement[is.element(replacement$Id, original$Id),]
    # replace original at positions where IDs from existing match   
    original[match(existing$Id,original$Id),]=existing
    
    0 讨论(0)
  • 2020-12-20 12:48
    require(plyr)
    indexes_to_replace <- rownames(match_df(original,replacement,on='Id'))
    indexes_from_replace<-rownames(match_df(replacement,original,on='Id'))
    original[indexes_to_replace,] <- replacement[indexes_from_replace,]
    

    param on of function match_df can take vectors as well.

    0 讨论(0)
  • 2020-12-20 12:50

    Just set a unique ID as the row names. Then it is simple indexing:

    rownames(original) = original$Id
    rownames(replacement) = replacement$Id
    
    original[rownames(replacement), ] = replacement
    
    0 讨论(0)
  • 2020-12-20 12:52

    Here is an approach using the digest package.

    library(digest)
    # generate keys for each row using the md5 checksum based on first two columns
    check1 <- apply(original[,1:2], 1, digest)
    check2 <- apply(replacement[,1:2], 1, digest)
    
    # set goal to original and replace rows in replacement
    goal <- original
    goal[check1 %in% check2,] <- replacement
    
    0 讨论(0)
提交回复
热议问题