问题
I have a Protein-Protein interaction data of homo sapiens. The size of the matrix is <4850628x3>. The first two columns are proteins and the third is its confident score. The problem is half the rows are duplicate pairs
if protein A interacts with B, C, D. it is mentioned as
- A B 0.8
- A C 0.5
- A D 0.6
- B A 0.8
- C A 0.5
- D A 0.6
If you observe the confident score of A interacting with B and B interacting with A is 0.8
If I have a matrix of <4850628x3> half the rows are duplicate pairs. If I choose Unique(1,:) I might loose some data.
But I want <2425314x3> i.e without duplicate pairs. How can I do it efficiently?
Thanks Naresh
回答1:
Supposing that in your matrix you store each protein with a unique id.
(Eg: A=1, B=2, C=3...) your example matrix will be:
M =
1.0000 2.0000 0.8000
1.0000 3.0000 0.5000
1.0000 4.0000 0.6000
2.0000 1.0000 0.8000
3.0000 1.0000 0.5000
4.0000 1.0000 0.6000
You must first sort
the two first columns row-wise so you will always have the protein pairs in the same order:
M2 = sort(M(:,1:2),2)
M2 =
1 2
1 3
1 4
1 2
1 3
1 4
Then use unique
with the second parameter rows
and keep the indexes of unique pairs:
[~, idx] = unique(M2, 'rows')
idx =
1
2
3
Finally filter your initial matrix to keep unly the unique pairs.
R = M(idx,:)
R =
1.0000 2.0000 0.8000
1.0000 3.0000 0.5000
1.0000 4.0000 0.6000
Et voilà!
来源:https://stackoverflow.com/questions/34811327/matlab-removing-duplicate-interactions