In attempt to extract mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced.
What I need now is a list of mismatches:
dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
# animal1 animal2 animal3
# snp1 AA AA AA
# snp2 TT TB TT
# snp3 AG AG AG
# snp4 CA CA CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
# animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TB
#snp3 AG AG AG
#snp4 CA DF DF
To clarify the mismatches, here they are marked as 00's:
# animal1 animal2 animal3
# snp1 AA AA AA
# snp2 TT TB 00
# snp3 AG AG AG
# snp4 CA 00 00
I need the following output:
structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L))
# snpname animalname alleledfA alleledfB
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF
So far I've been trying to extract additional data out of my lapply
function which I use to replace the mismatches by zero, without success though. I also tried to write an ifelse function without success. Hope you guys can help me out here!
Eventually this will be run for data sets with a dimension of 100K by 1000, so efficiency is a pro
This question has data.table
tag, so here's my attempt using this package. First step is to convert row names to columns as data.table
don't like those, then converting to long format after rbind
ing and setting an id per data set, finding where there are more than one unique value and converting back to a wide format
library(data.table)
setDT(dfA, keep.rownames = TRUE)
setDT(dfB, keep.rownames = TRUE)
dcast(melt(rbind(dfA,
dfB,
idcol = TRUE),
id = 1:2
)[,
if(uniqueN(value) > 1L) .SD,
by = .(rn, variable)],
rn + variable ~ .id)
# rn variable 1 2
# 1: snp2 animal3 TT TB
# 2: snp4 animal2 CA DF
# 3: snp4 animal3 CA DF
Here is a solution using array.indices of a matrix:
i.arr <- which(dfA != dfB, arr.ind=TRUE)
data.frame(snp=rownames(dfA)[i.arr[,1]], animal=colnames(dfA)[i.arr[,2]],
A=dfA[i.arr], B=dfB[i.arr])
# snp animal A B
#1 snp4 animal2 CA DF
#2 snp2 animal3 TT TB
#3 snp4 animal3 CA DF
This can be done with dplyr/tidyr
using a similar approach as in @David Arenburg's post.
library(dplyr)
library(tidyr)
bind_rows(add_rownames(dfA), add_rownames(dfB)) %>%
gather(Var, Val, -rowname) %>%
group_by(rowname, Var) %>%
filter(n_distinct(Val)>1) %>%
mutate(id = 1:2) %>%
spread(id, Val)
# rowname Var 1 2
# (chr) (chr) (chr) (chr)
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF
来源:https://stackoverflow.com/questions/36592184/how-to-compare-two-data-frames-tables-and-extract-data-in-r