how to speed up a 'unique' dataframe search

问题

I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:

                   Name                                            Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC

All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'. I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:

dat_sorted = data[order(data$Seq),]
    m = dat_sorted[1,]
    x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}

Again this is too slow! Is there a faster way to find unique value in one column of a dataframe?

回答1:

data[!duplicated(data$Seq), ]

should do the trick.

回答2:

library(dplyr)
data %>% distinct

Should be worth for it, especially if your data is too big to your machine.

来源：https://stackoverflow.com/questions/27267510/how-to-speed-up-a-unique-dataframe-search

标签

performance

unique