how to speed up a 'unique' dataframe search

会有一股神秘感。 提交于 2021-01-01 06:51:19

问题


I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:

                   Name                                            Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC

All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'. I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:

dat_sorted = data[order(data$Seq),]
    m = dat_sorted[1,]
    x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}

Again this is too slow! Is there a faster way to find unique value in one column of a dataframe?


回答1:


data[!duplicated(data$Seq), ]

should do the trick.




回答2:


library(dplyr)
data %>% distinct

Should be worth for it, especially if your data is too big to your machine.



来源:https://stackoverflow.com/questions/27267510/how-to-speed-up-a-unique-dataframe-search

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!