I have read a CSV
file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in th
This problem can also be solved by selecting first row from each group where the group are the columns based on which we want to select unique values (in the example shared it is just 1st column).
Using base R :
subset(df, ave(V2, V1, FUN = seq_along) == 1)
# V1 V2 V3 V4 V5
#1 platform_external_dbus 202 16 google 1
In dplyr
library(dplyr)
df %>% group_by(V1) %>% slice(1L)
Or using data.table
library(data.table)
setDT(df)[, .SD[1L], by = V1]
If we need to find out unique rows based on multiple columns just add those column names in grouping part for each of the above answer.
data
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = "platform_external_dbus", class = "factor"),
V2 = c(202L, 202L, 202L, 202L, 202L), V3 = c(16L, 16L, 16L,
16L, 16L), V4 = structure(c(1L, 4L, 3L, 5L, 2L), .Label = c("google",
"hughsie", "localhost", "space-ghost.verbum", "users.sourceforge"
), class = "factor"), V5 = c(1L, 1L, 1L, 8L, 1L)), class = "data.frame",
row.names = c(NA, -5L))