Identifying specific differences between two data sets in R

后端 未结 4 1555
长发绾君心
长发绾君心 2021-02-14 05:52

I would like to compare two data sets and identify specific instances of discrepancies between them (i.e., which variables were different).

While I have found out how t

4条回答
  •  礼貌的吻别
    2021-02-14 06:00

    One possibility. First, find out which ids both datasets have in common. The simplest way to do this is:

    commonID<-intersect(A$id,B$id)
    

    Then you can determine which rows are missing from A by:

    > B[!B$id %in% commonID,]
    #       id       name      dob vaccinedate vaccinename dose
    # 3 100002 John Smith 2/5/2010   7/13/2013        HEPB    3
    

    Next, you can restrict both datasets to the ids they have in common.

    Acommon<-A[A$id %in% commonID,]
    Bcommon<-B[B$id %in% commonID,]
    

    If you can't assume that the id's are in the right order, then sort them both:

    Acommon<-Acommon[order(Acommon$id),]
    Bcommon<-Bcommon[order(Bcommon$id),]
    

    Now you can see what fields are different like this.

    diffs<-Acommon != Bcommon
    diffs
    #      id  name   dob vaccinedate vaccinename  dose
    # 1 FALSE FALSE FALSE       FALSE       FALSE  TRUE
    # 2 FALSE  TRUE FALSE        TRUE       FALSE FALSE
    

    This is a logical matrix, and you can do whatever you want with it. For example, to find the total number of errors in each column:

    colSums(diffs)
    #         id        name         dob vaccinedate vaccinename        dose 
    #          0           1           0           1           0           1 
    

    To find all ids where the name is different:

    Acommon$id[diffs[,"name"]]
    # [1] 100001
    

    And so on.

提交回复
热议问题