I want to find the indices all the rows of a matrix which have duplicates. For example
A = [1 2 3 4 1 2 3 4 2 3 4 5 1 2 3 4 6 5 4 3]
The vector to be returned would be [1,2,4]
A lot of similar questions suggest using the unique
function, which I've tried but the closest I can get to what I want is:
[C, ia, ic] = unique(A, 'rows') ia = [1 3 5] m = 5; setdiff(1:m,ia) = [2,4]
But using unique
I can only extract the 2nd,3rd,4th...etc instance of a row, and I need to also obtain the first. Is there any way I can do this?
NB: It must be a method which doesn't involve looping through the rows, as I'm dealing with large sparse matrices.
How about:
[~, ia, ic] = unique(A, 'rows') setdiff(1:size(A,1), ia( sum(bsxfun(@eq,ic,(1:max(ic))))<=1 ))
Three other possibilities:
Sort rows of the matrix (with sortrows
), detect equal rows (with diff
) and use indexing to undo the sorting:
[As inds] = sortrows(A); ind = find(all(diff(As)==0,2)); result = inds(union(ind,ind+1));
Directly compare every row against every other row (with bsxfun
):
match = squeeze(all((bsxfun(@eq, A, permute(A, [3 2 1]))), 2)); result = find(any(match - eye(size(A,1))));
Use pdist
with Hamming distance instead of bsxfun
:
match = ~squareform(pdist(A,'hamming')); result = find(any(match - eye(size(A,1))));
The advantage of approaches 2 and 3 is that you additionally get a (symmetric) matrix, match
, which tells you which row equals which other. For your example,
>> match match = 1 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1
One way to identify duplicates is to apply accumarray
on the ic
vector from unique
. Then, setdiff
will return the full list if indexes of duplicate rows.
[~, ia, ic] = unique(A,'rows')
dupRows = setdiff(1:size(A,1),ia(accumarray(ic,1)<=1))