I would like to compare two data sets and identify specific instances of discrepancies between them (i.e., which variables were different).
While I have found out how t
This should get you started, but there may be more elegant solutions.
First, establish df1
and df2
so others can reproduce quickly:
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
Next, get the discrepancies from df1
to df2
via mapply
and setdiff
. That is, what's in set one that's not in set two:
discrep <- mapply(setdiff, df1, df2)
discrep
# $id
# integer(0)
#
# $name
# [1] "Jane Doe"
#
# $dob
# character(0)
#
# $vaccinedate
# [1] "3/14/2013"
#
# $vaccinename
# character(0)
#
# $dose
# [1] 4
To count them up we can use sapply
:
num.discrep <- sapply(discrep, length)
num.discrep
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
Per your question on obtaining id's in set two that are not in set one, you could reverse the process with mapply(setdiff, df2, df1)
or if it's simply an exercise of ids
only you could do setdiff(df2$id, df1$id)
.
For more on R's functional functions (e.g., mapply, sapply, lapply, etc.) see this post.
Updating with a purrr
solution:
map2(df1, df2, setdiff) %>%
map_int(length)