R - comparing two rows by columns and writing the result in a table

问题

I'm an R newbie and probably the solution for my problem is very simple but it's out of my reach for now... I would like to compare rows in a data frame by columns. The data in each column is a letter (nucleotide base):

seq1 A C T G T
seq2 A C G G G
seq3 A G G C A
...

I'd like to compare all rows in the data set with each other by column. The result I would like to obtain is simple 1 or 0 for TRUE and FALSE in the comparison, written in a form of table as well. So it would look like this:

seq1_seq2 1 1 0 1 0
seq1_seq3 1 0 0 0 0
seq2_seq3 1 0 1 0 0
...

My skills in R are too low to write something useful. However, I managed to find out that

ifelse(data[1,]==data[2,], 1, 0)

returns almost what I need although without showing which rows are compared (no seq1_seq2 column). I would appreciate any help on this problem. Of course, an example of complete solution would be the most desired but I will be gratefull also for any suggestions about how to solve this problem.

Thank you in advance!

回答1:

Storing sequences in dataframe by rows is wrong. You should store sequences by columns, or, if you store them by rows, at least do it in a matrix rather than dataframe. Below I assume you use a matrix. You can transform dataframe to a matrix with as.matrix function.

If you want to avoid loops, you should use combn for such tasks

> a
     [,1] [,2] [,3] [,4] [,5]
seq1 "A"  "C"  "T"  "G"  "T" 
seq2 "A"  "C"  "G"  "G"  "G" 
seq3 "A"  "G"  "G"  "C"  "A" 

> compare = t(combn(nrow(a),2,FUN=function(x)a[x[1],]==a[x[2],]))
> rownames(compare) = combn(nrow(a),2,FUN=function(x)paste0("seq",x[1],"_seq",x[2]))

> compare
          [,1]  [,2]  [,3]  [,4]  [,5]
seq1_seq2 TRUE  TRUE FALSE  TRUE FALSE
seq1_seq3 TRUE FALSE FALSE FALSE FALSE
seq2_seq3 TRUE FALSE  TRUE FALSE FALSE

To transform booleans to integers (if you really need it):

storage.mode(compare) = "integer"

回答2:

In this case, since you want all n-squared comparisons done, looping this way is one option:

result <- list()
for (i in 1:(nrow(df) - 1)) {
    for (j in (i + 1):nrow(df)) {
      result[[paste(row.names(df)[i], row.names(df)[j], sep = '_')]] <- as.integer(df[i, ] == df[j, ])
    }
}
as.data.frame(do.call(rbind, result))

Resulting output will be as follows:

          V1 V2 V3 V4 V5
seq1_seq2  1  1  0  1  0
seq1_seq3  1  0  0  0  0
seq2_seq3  1  0  1  0  0

Of course, this will be very slow for larger data sets.

回答3:

A somewhat different approach than Gopala's... There's probably a simpler way to get there, but here it is:

options(stringsAsFactors = FALSE)
myData <- data.frame(n1=c("A","A","A"),n2=c("C","C","G"),
                     n3=c("T","G","G"),n4=c("G","G","C"),n5=c("T","G","A"))
rownames(myData) <- paste0("seq",1:3)

# Generate all combinations for comparisons
compar <- apply(combn(rownames(myData),2),2,paste0)

# Create a temporary list having pairs of rows
myList <- apply(compar, 2, function(r) myData[r,])
names(myList) <- apply(combn(rownames(myData),2),2,paste0,collapse="_")

# Compare the two rows for each element in the list
results <- t(sapply(myList, function(x) as.numeric(x[1,]==x[2,])))
colnames(results) <- colnames(myData)

results

          n1 n2 n3 n4 n5
seq1_seq2  1  1  0  1  0
seq1_seq3  1  0  0  0  0
seq2_seq3  1  0  1  0  0

回答4:

You can use this code (it uses myData from the @Dominic Comtois's answer):

m <- combn(nrow(myData),2)

result <- sapply(myData,function(C) {z=C[m];z[c(TRUE,FALSE)]==z[c(FALSE,TRUE)]})
#       n1    n2    n3    n4    n5
#[1,] TRUE  TRUE FALSE  TRUE FALSE
#[2,] TRUE FALSE FALSE FALSE FALSE
#[3,] TRUE FALSE  TRUE FALSE FALSE

How it works:

combn generates all possible pairs of row indices
sapply loops over each column of myData
For each column, obtain a vector analogue of matrix m in which row indices are substituted by the values from myData
Odd elements of this vector contain first row, and even elements contain second row, thus we can use bit mask c(TRUE,FALSE) and c(FALSE,TRUE) for the comparison of odd/even elements.

来源：https://stackoverflow.com/questions/37228001/r-comparing-two-rows-by-columns-and-writing-the-result-in-a-table

标签

seq