How to compare two data frames/tables and extract data in R?

In attempt to extract mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced.
What I need now is a list of mismatches:

dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      TT
# snp3      AG      AG      AG
# snp4      CA      CA      CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
#     animal1 animal2 animal3
#snp1      AA      AA      AA
#snp2      TT      TB      TB
#snp3      AG      AG      AG
#snp4      CA      DF      DF

To clarify the mismatches, here they are marked as 00's:

#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      00
# snp3      AG      AG      AG
# snp4      CA      00      00

I need the following output:

structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L))
#  snpname animalname alleledfA alleledfB
#1    snp2    animal3        TT        TB
#2    snp4    animal2        CA        DF
#3    snp4    animal3        CA        DF

So far I've been trying to extract additional data out of my lapply function which I use to replace the mismatches by zero, without success though. I also tried to write an ifelse function without success. Hope you guys can help me out here!

Eventually this will be run for data sets with a dimension of 100K by 1000, so efficiency is a pro

This question has data.table tag, so here's my attempt using this package. First step is to convert row names to columns as data.table don't like those, then converting to long format after rbinding and setting an id per data set, finding where there are more than one unique value and converting back to a wide format

library(data.table)  
setDT(dfA, keep.rownames = TRUE) 
setDT(dfB, keep.rownames = TRUE)   

dcast(melt(rbind(dfA, 
                 dfB, 
                 idcol = TRUE), 
           id = 1:2
           )[, 
             if(uniqueN(value) > 1L) .SD, 
             by = .(rn, variable)], 
      rn + variable ~ .id)

#      rn variable  1  2
# 1: snp2  animal3 TT TB
# 2: snp4  animal2 CA DF
# 3: snp4  animal3 CA DF

jogo

Here is a solution using array.indices of a matrix:

i.arr <- which(dfA != dfB, arr.ind=TRUE)

data.frame(snp=rownames(dfA)[i.arr[,1]], animal=colnames(dfA)[i.arr[,2]],
           A=dfA[i.arr], B=dfB[i.arr])
#   snp  animal  A  B
#1 snp4 animal2 CA DF
#2 snp2 animal3 TT TB
#3 snp4 animal3 CA DF

This can be done with dplyr/tidyr using a similar approach as in @David Arenburg's post.

library(dplyr)
library(tidyr)
bind_rows(add_rownames(dfA), add_rownames(dfB)) %>% 
          gather(Var, Val, -rowname) %>%
          group_by(rowname, Var) %>%
          filter(n_distinct(Val)>1) %>% 
          mutate(id = 1:2) %>% 
          spread(id, Val)
#  rowname     Var     1     2
#    (chr)   (chr) (chr) (chr)
#1    snp2 animal3    TT    TB
#2    snp4 animal2    CA    DF
#3    snp4 animal3    CA    DF

来源：https://stackoverflow.com/questions/36592184/how-to-compare-two-data-frames-tables-and-extract-data-in-r

标签

dataframe

compare

data.table

mismatch