问题
I'm an R newbie and probably the solution for my problem is very simple but it's out of my reach for now... I would like to compare rows in a data frame by columns. The data in each column is a letter (nucleotide base):
seq1 A C T G T
seq2 A C G G G
seq3 A G G C A
...
I'd like to compare all rows in the data set with each other by column. The result I would like to obtain is simple 1 or 0 for TRUE and FALSE in the comparison, written in a form of table as well. So it would look like this:
seq1_seq2 1 1 0 1 0
seq1_seq3 1 0 0 0 0
seq2_seq3 1 0 1 0 0
...
My skills in R are too low to write something useful. However, I managed to find out that
ifelse(data[1,]==data[2,], 1, 0)
returns almost what I need although without showing which rows are compared (no seq1_seq2 column). I would appreciate any help on this problem. Of course, an example of complete solution would be the most desired but I will be gratefull also for any suggestions about how to solve this problem.
Thank you in advance!
回答1:
Storing sequences in dataframe by rows is wrong. You should store sequences by columns, or, if you store them by rows, at least do it in a matrix rather than dataframe. Below I assume you use a matrix. You can transform dataframe to a matrix with as.matrix
function.
If you want to avoid loops, you should use combn
for such tasks
> a
[,1] [,2] [,3] [,4] [,5]
seq1 "A" "C" "T" "G" "T"
seq2 "A" "C" "G" "G" "G"
seq3 "A" "G" "G" "C" "A"
> compare = t(combn(nrow(a),2,FUN=function(x)a[x[1],]==a[x[2],]))
> rownames(compare) = combn(nrow(a),2,FUN=function(x)paste0("seq",x[1],"_seq",x[2]))
> compare
[,1] [,2] [,3] [,4] [,5]
seq1_seq2 TRUE TRUE FALSE TRUE FALSE
seq1_seq3 TRUE FALSE FALSE FALSE FALSE
seq2_seq3 TRUE FALSE TRUE FALSE FALSE
To transform booleans to integers (if you really need it):
storage.mode(compare) = "integer"
回答2:
In this case, since you want all n-squared comparisons done, looping this way is one option:
result <- list()
for (i in 1:(nrow(df) - 1)) {
for (j in (i + 1):nrow(df)) {
result[[paste(row.names(df)[i], row.names(df)[j], sep = '_')]] <- as.integer(df[i, ] == df[j, ])
}
}
as.data.frame(do.call(rbind, result))
Resulting output will be as follows:
V1 V2 V3 V4 V5
seq1_seq2 1 1 0 1 0
seq1_seq3 1 0 0 0 0
seq2_seq3 1 0 1 0 0
Of course, this will be very slow for larger data sets.
回答3:
A somewhat different approach than Gopala's... There's probably a simpler way to get there, but here it is:
options(stringsAsFactors = FALSE)
myData <- data.frame(n1=c("A","A","A"),n2=c("C","C","G"),
n3=c("T","G","G"),n4=c("G","G","C"),n5=c("T","G","A"))
rownames(myData) <- paste0("seq",1:3)
# Generate all combinations for comparisons
compar <- apply(combn(rownames(myData),2),2,paste0)
# Create a temporary list having pairs of rows
myList <- apply(compar, 2, function(r) myData[r,])
names(myList) <- apply(combn(rownames(myData),2),2,paste0,collapse="_")
# Compare the two rows for each element in the list
results <- t(sapply(myList, function(x) as.numeric(x[1,]==x[2,])))
colnames(results) <- colnames(myData)
results
n1 n2 n3 n4 n5
seq1_seq2 1 1 0 1 0
seq1_seq3 1 0 0 0 0
seq2_seq3 1 0 1 0 0
回答4:
You can use this code (it uses myData
from the @Dominic Comtois's answer):
m <- combn(nrow(myData),2)
result <- sapply(myData,function(C) {z=C[m];z[c(TRUE,FALSE)]==z[c(FALSE,TRUE)]})
# n1 n2 n3 n4 n5
#[1,] TRUE TRUE FALSE TRUE FALSE
#[2,] TRUE FALSE FALSE FALSE FALSE
#[3,] TRUE FALSE TRUE FALSE FALSE
How it works:
combn
generates all possible pairs of row indicessapply
loops over each column ofmyData
- For each column, obtain a vector analogue of matrix
m
in which row indices are substituted by the values frommyData
- Odd elements of this vector contain first row, and even elements contain second row, thus we can use bit mask c(TRUE,FALSE) and c(FALSE,TRUE) for the comparison of odd/even elements.
来源:https://stackoverflow.com/questions/37228001/r-comparing-two-rows-by-columns-and-writing-the-result-in-a-table