I have read a CSV
file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in th
With sqldf
:
# Example by Mehdi Nellen
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
Solution:
library(sqldf)
sqldf('SELECT DISTINCT * FROM df')
Output:
a b
1 A 1
2 A 2
3 B 4
4 B 1
5 C 2
You can also use dplyr
's distinct()
function! It tends to be more efficient than alternative options, especially if you have loads of observations.
distinct_data <- dplyr::distinct(yourdata)
The function distinct() in the dplyr package performs arbitrary duplicate removal, either from specific columns/variables (as in this question) or considering all columns/variables. dplyr
is part of the tidyverse.
Data and package
library(dplyr)
dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))
Remove rows duplicated in a specific column (e.g., columna
)
Note that .keep_all = TRUE
retains all columns, otherwise only column a
would be retained.
distinct(dat, a, .keep_all = TRUE)
a b
1 1 A
2 2 B
Remove rows that are complete duplicates of other rows:
distinct(dat)
a b
1 1 A
2 2 B
3 1 C
4 2 D
just isolate your data frame to the columns you need, then use the unique function :D
# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them,
# so they're duplicates and thrown out.
Here's a very simple, fast dplyr
/tidy
solution:
Remove rows that are entirely the same:
library(dplyr)
iris %>%
distinct(.keep_all = TRUE)
Remove rows that are the same only in certain columns:
iris %>%
distinct(Sepal.Length, Sepal.Width, .keep_all = TRUE)
the general answer can be for example:
df <- data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6))))
new_df <- df[-which(duplicated(df)), ]
X1 X2 X3
1 2 9 6
2 4 6 7