Remove duplicated rows

前端 未结 11 1776
清酒与你
清酒与你 2020-11-22 00:00

I have read a CSV file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in th

相关标签:
11条回答
  • 2020-11-22 00:35

    With sqldf:

    # Example by Mehdi Nellen
    a <- c(rep("A", 3), rep("B", 3), rep("C",2))
    b <- c(1,1,2,4,1,1,2,2)
    df <-data.frame(a,b)
    

    Solution:

     library(sqldf)
        sqldf('SELECT DISTINCT * FROM df')
    

    Output:

      a b
    1 A 1
    2 A 2
    3 B 4
    4 B 1
    5 C 2
    
    0 讨论(0)
  • 2020-11-22 00:36

    You can also use dplyr's distinct() function! It tends to be more efficient than alternative options, especially if you have loads of observations.

    distinct_data <- dplyr::distinct(yourdata)
    
    0 讨论(0)
  • 2020-11-22 00:37

    The function distinct() in the dplyr package performs arbitrary duplicate removal, either from specific columns/variables (as in this question) or considering all columns/variables. dplyr is part of the tidyverse.

    Data and package

    library(dplyr)
    dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))
    

    Remove rows duplicated in a specific column (e.g., columna)

    Note that .keep_all = TRUE retains all columns, otherwise only column a would be retained.

    distinct(dat, a, .keep_all = TRUE)
    
      a b
    1 1 A
    2 2 B
    

    Remove rows that are complete duplicates of other rows:

    distinct(dat)
    
      a b
    1 1 A
    2 2 B
    3 1 C
    4 2 D
    
    0 讨论(0)
  • 2020-11-22 00:39

    just isolate your data frame to the columns you need, then use the unique function :D

    # in the above example, you only need the first three columns
    deduped.data <- unique( yourdata[ , 1:3 ] )
    # the fourth column no longer 'distinguishes' them, 
    # so they're duplicates and thrown out.
    
    0 讨论(0)
  • 2020-11-22 00:43

    Here's a very simple, fast dplyr/tidy solution:

    Remove rows that are entirely the same:

    library(dplyr)
    iris %>% 
      distinct(.keep_all = TRUE)
    

    Remove rows that are the same only in certain columns:

    iris %>% 
      distinct(Sepal.Length, Sepal.Width, .keep_all = TRUE)
    
    
    0 讨论(0)
  • 2020-11-22 00:51

    the general answer can be for example:

    df <-  data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6))))
    
    
    
    new_df <- df[-which(duplicated(df)), ]
    

    output:

          X1 X2 X3
        1  2  9  6
        2  4  6  7
    
    0 讨论(0)
提交回复
热议问题