Distinct in dplyr does not work (sometimes)

问题

I have the following data frame which I have obtained from a count. I have used dput to make the data frame available and then edited the data frame so there is a duplicate of A.

df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"), 
                                         class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)), 
              class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))

print(df)

# A tibble: 4 x 2
  Procedure     n
  <fct>     <int>
1 D         10717
2 A          4412
3 A          2058
4 C          1480

Now I would like to take distinct on Procedure and only keep the first A.

df %>% 
  distinct(Procedure, .keep_all=TRUE)

# A tibble: 4 x 2
  Procedure     n
  <fct>     <int>
1 D         10717
2 A          4412
3 A          2058
4 C          1480

It does not work. Strange...

回答1:

If we print the Procedure column, we can see that there are duplicated levels for a, which is problematic for the distinct function.

df$Procedure
[1] D A A C
Levels: A A C D -1
Warning message:
In print.factor(x) : duplicated level [2] in factor

One way to fix is to drop the factor levels. We can use factor function to achieve this. Another way is to convert the Procedure column to character.

df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"), 
                                           class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))


library(tidyverse)

df %>% 
  mutate(Procedure = factor(Procedure)) %>%
  distinct(Procedure, .keep_all=TRUE)
# # A tibble: 3 x 2
#   Procedure     n
#   <fct>     <int>
# 1 D         10717
# 2 A          4412
# 3 C          1480

回答2:

You have duplicated value in a label parameter .Label = c("A", "A", "C", "D", "-1"). That is an issue. Btw your way of initializing of a tibble seems to be very strange (i do not know exactly your goal but still)

Why not use


df <- tibble(
    Procedure = c("D", "A", "A", "C"),
    n = c(10717L, 4412L, 2058L, 1480L)
)

来源：https://stackoverflow.com/questions/54705659/distinct-in-dplyr-does-not-work-sometimes

标签

dplyr

tibble