Identifying specific differences between two data sets in R

后端 未结 4 1558
长发绾君心
长发绾君心 2021-02-14 05:52

I would like to compare two data sets and identify specific instances of discrepancies between them (i.e., which variables were different).

While I have found out how t

相关标签:
4条回答
  • 2021-02-14 06:00

    One possibility. First, find out which ids both datasets have in common. The simplest way to do this is:

    commonID<-intersect(A$id,B$id)
    

    Then you can determine which rows are missing from A by:

    > B[!B$id %in% commonID,]
    #       id       name      dob vaccinedate vaccinename dose
    # 3 100002 John Smith 2/5/2010   7/13/2013        HEPB    3
    

    Next, you can restrict both datasets to the ids they have in common.

    Acommon<-A[A$id %in% commonID,]
    Bcommon<-B[B$id %in% commonID,]
    

    If you can't assume that the id's are in the right order, then sort them both:

    Acommon<-Acommon[order(Acommon$id),]
    Bcommon<-Bcommon[order(Bcommon$id),]
    

    Now you can see what fields are different like this.

    diffs<-Acommon != Bcommon
    diffs
    #      id  name   dob vaccinedate vaccinename  dose
    # 1 FALSE FALSE FALSE       FALSE       FALSE  TRUE
    # 2 FALSE  TRUE FALSE        TRUE       FALSE FALSE
    

    This is a logical matrix, and you can do whatever you want with it. For example, to find the total number of errors in each column:

    colSums(diffs)
    #         id        name         dob vaccinedate vaccinename        dose 
    #          0           1           0           1           0           1 
    

    To find all ids where the name is different:

    Acommon$id[diffs[,"name"]]
    # [1] 100001
    

    And so on.

    0 讨论(0)
  • 2021-02-14 06:12

    This should get you started, but there may be more elegant solutions.

    First, establish df1 and df2 so others can reproduce quickly:

    df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
    
    df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
    

    Next, get the discrepancies from df1 to df2 via mapply and setdiff. That is, what's in set one that's not in set two:

    discrep <- mapply(setdiff, df1, df2)
    discrep
    # $id
    # integer(0)
    # 
    # $name
    # [1] "Jane Doe"
    # 
    # $dob
    # character(0)
    # 
    # $vaccinedate
    # [1] "3/14/2013"
    # 
    # $vaccinename
    # character(0)
    # 
    # $dose
    # [1] 4
    

    To count them up we can use sapply:

    num.discrep <- sapply(discrep, length)
    num.discrep
    # id        name         dob vaccinedate vaccinename        dose 
    # 0           1           0           1           0           1 
    

    Per your question on obtaining id's in set two that are not in set one, you could reverse the process with mapply(setdiff, df2, df1) or if it's simply an exercise of ids only you could do setdiff(df2$id, df1$id).

    For more on R's functional functions (e.g., mapply, sapply, lapply, etc.) see this post.


    Updating with a purrr solution:

    map2(df1, df2, setdiff) %>% 
      map_int(length)
    
    0 讨论(0)
  • 2021-02-14 06:14
    library(compareDF)
    
    compare_df(dataframe1, dataframe2, c("columnname"))
    
    0 讨论(0)
  • 2021-02-14 06:18

    There is a new package call waldo

    install.packages("waldo")
    library(waldo)
    
    # construct the data frames
    
    
    df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
    
    df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
    
    # compare them
    compare(df1,df2)
    

    And we get:

    `old` is length 2
    `new` is length 3
    
    `names(old)`: "X" "Y"    
    `names(new)`: "X" "Y" "Z"
    
    `attr(old, 'row.names')`: 1 2 3  
    `attr(new, 'row.names')`: 1 2 3 4
    
    `old$X`: 1 2 3  
    `new$X`: 1 2 3 4
    
    `old$Y`: "a" "b" "c"    
    `new$Y`: "A" "b" "c" "d"
    
    `old$Z` is absent
    `new$Z` is a character vector ('k', 'l', 'm', 'n')
    
    0 讨论(0)
提交回复
热议问题