问题
I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:
Name Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'. I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:
dat_sorted = data[order(data$Seq),]
m = dat_sorted[1,]
x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}
Again this is too slow! Is there a faster way to find unique value in one column of a dataframe?
回答1:
data[!duplicated(data$Seq), ]
should do the trick.
回答2:
library(dplyr)
data %>% distinct
Should be worth for it, especially if your data is too big to your machine.
来源:https://stackoverflow.com/questions/27267510/how-to-speed-up-a-unique-dataframe-search