Identifying specific differences between two data sets in R

后端 未结 4 1578
长发绾君心
长发绾君心 2021-02-14 05:52

I would like to compare two data sets and identify specific instances of discrepancies between them (i.e., which variables were different).

While I have found out how t

4条回答
  •  暖寄归人
    2021-02-14 06:12

    This should get you started, but there may be more elegant solutions.

    First, establish df1 and df2 so others can reproduce quickly:

    df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
    
    df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
    

    Next, get the discrepancies from df1 to df2 via mapply and setdiff. That is, what's in set one that's not in set two:

    discrep <- mapply(setdiff, df1, df2)
    discrep
    # $id
    # integer(0)
    # 
    # $name
    # [1] "Jane Doe"
    # 
    # $dob
    # character(0)
    # 
    # $vaccinedate
    # [1] "3/14/2013"
    # 
    # $vaccinename
    # character(0)
    # 
    # $dose
    # [1] 4
    

    To count them up we can use sapply:

    num.discrep <- sapply(discrep, length)
    num.discrep
    # id        name         dob vaccinedate vaccinename        dose 
    # 0           1           0           1           0           1 
    

    Per your question on obtaining id's in set two that are not in set one, you could reverse the process with mapply(setdiff, df2, df1) or if it's simply an exercise of ids only you could do setdiff(df2$id, df1$id).

    For more on R's functional functions (e.g., mapply, sapply, lapply, etc.) see this post.


    Updating with a purrr solution:

    map2(df1, df2, setdiff) %>% 
      map_int(length)
    

提交回复
热议问题