How to remove duplicate rows in both using a condition in R [duplicate]

对着背影说爱祢 提交于 2019-12-12 02:49:28

问题


The data I have is something like that:

RES1 <- c("A","B","A","A","B")
RES2 <- c("B","A","A","B","A")
VAL1 <-c(3,5,3,6,8)
VAL2 <- c(5,3,7,2,7)
dff <- data.frame(RES1,VAL1,RES2,VAL2)
dff
  RES1 VAL1 RES2 VAL2
  1    A    3    B    5 
  2    B    5    A    3
  3    A    3    A    7
  4    A    6    B    2
  5    B    8    A    7

I want to remove the lines where I already have the same res1-res2 pair. For example: A 3 interacts with B 5. That's the information I want. I do not care which pair is first. B 5 with A 3 or A 3 with B 5. What I want to get is the following dataframe:

output
  RES1 VAL1 RES2 VAL2
   1    A    3    B    5
   2    A    3    A    7
   3    A    6    B    2
   4    B    8    A    7

Then I want to do the same for another data frame such as :

RES3 <- c("B","B","B","A","B")
RES4 <- c("A","A","A","A","B")
VAL4 <- c(3,7,5,3,8)
VAL3 <- c(5,8,3,7,3)
df2 <- data.frame(RES3,VAL3,RES4,VAL4)

df2
  RES3 VAL3 RES4 VAL4
   1     B     5     A     3
   2     B     8     A     7
   3     B     3     A     5
   4     A     7     A     3
   5     B     3     B     8

At the end, I just want to keep mutual pairs (in my definition both pairs are the same, keeping one is essential : "A 5" - "B 3" is the same as "B 3" - "A 5". In other words, order does not matter.

Final output I desire should have the following pairs which are unique and which exist in BOTH dataframes:

mutualpairs
  RESA VALA RESB VALB
  A     3     B     5
  A     3     A     7
  B     8     A     7

回答1:


You can use this code:

dff[!duplicated(t(apply(cbind(paste(dff$RES1,dff$VAL1),paste(dff$RES2,dff$VAL2)),1,sort))),]

Equivalent unrolled code:

v1 <- paste(dff$RES1,dff$VAL1)
v2 <- paste(dff$RES2,dff$VAL2)
mx <- cbind(v1,v2)
mxSorted <- t(apply(mx,1,sort))
duped <- duplicated(mxSorted)
dff[!duped,]

Explanation:

1) we create two character vectors v1, v2 by concatenating columns RES1-VAL1 and RES2-VAL2 (note that paste uses a space as default separator, maybe you could use another character or string to be safer (e.g. |,@,; etc...)
Result:

> v1
[1] "A 3" "B 5" "A 3" "A 6" "B 8"
> v2
[1] "B 5" "A 3" "A 7" "B 2" "A 7"

2) we bind these two vectors to form a matrix using cbind;
Result:

     [,1]  [,2] 
[1,] "A 3" "B 5"
[2,] "B 5" "A 3"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "B 8" "A 7"

3) we sort the values of each row of the matrix using t(apply(mx,1,sort));
by sorting the rows, we simply make identical the rows having the same values just swapped (note that final transpose is necessary since apply function always returns results on the columns).
Result:

     [,1]  [,2] 
[1,] "A 3" "B 5"
[2,] "A 3" "B 5"
[3,] "A 3" "A 7"
[4,] "A 6" "B 2"
[5,] "A 7" "B 8"

4) calling duplicated on a matrix, we get a logical vector of length = nrow(matrix), being TRUE where a row is a duplicate of a previous row, so in our case, we get:

[1] FALSE  TRUE FALSE FALSE FALSE
# i.e. the second row is a duplicate

5) finally we use this vector to filter the rows of the data.frame, getting the final result:

  RES1 VAL1 RES2 VAL2
1    A    3    B    5
3    A    3    A    7
4    A    6    B    2
5    B    8    A    7



回答2:


Possible duplicate of Remove duplicates column combinations from a dataframe in R

Adapting the answer here:

dff[!duplicated(dff[c('RES1','RES2')]),]


来源:https://stackoverflow.com/questions/36802021/how-to-remove-duplicate-rows-in-both-using-a-condition-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!