R, dplyr: cumulative version of n_distinct

后端 未结 4 1476

I have a dataframe as follows. It is ordered by column time.

Input -

df = data.frame(time = 1:20,
            grp = sort(rep(1:5,4)),
             


        
相关标签:
4条回答
  • 2021-02-09 03:49

    Assuming stuff is ordered by time already, first define a cumulative distinct function:

    dist_cum <- function(var)
      sapply(seq_along(var), function(x) length(unique(head(var, x))))
    

    Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:

    transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
    

    A data.table solution, basically doing the same thing:

    library(data.table)
    (data.table(df)[, var2:=dist_cum(var1), by=grp])
    

    And dplyr, again, same thing:

    library(dplyr)
    df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
    
    0 讨论(0)
  • 2021-02-09 03:51

    A dplyr solution inspired from @akrun's answer -

    Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -

    df = df %>%
      arrange(time) %>%
      group_by(grp,var1) %>%
      mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
      group_by(grp) %>%
      mutate(var2 = cumsum(var_temp)) %>%
      select(-var_temp)
    
    head(df,10)
    
    Source: local data frame [10 x 4]
    Groups: grp
    
       time grp var1 var2
    1     1   1    A    1
    2     2   1    B    2
    3     3   1    A    2
    4     4   1    B    2
    5     5   2    A    1
    6     6   2    B    2
    7     7   2    A    2
    8     8   2    B    2
    9     9   3    A    1
    10   10   3    B    2
    
    0 讨论(0)
  • 2021-02-09 03:56

    Try:

    Update

    With your new dataset, an approach in base R

      df$var2 <-  unlist(lapply(split(df, df$grp),
                  function(x) {x$var2 <-0
                   indx <- match(unique(x$var1), x$var1)
                   x$var2[indx] <- 1
                   cumsum(x$var2) }))
    
      head(df,7)
      #   time grp var1 var2
      # 1    1   1    A    1
      # 2    2   1    B    2
      # 3    3   1    A    2
      # 4    4   1    B    2
      # 5    5   2    A    1
      # 6    6   2    B    2
      # 7    7   2    A    2
    
    0 讨论(0)
  • 2021-02-09 04:01

    Here's another solution using data.table that's pretty quick.

    Generic Function

    cum_n_distinct <- function(x, na.include = TRUE){
      # Given a vector x, returns a corresponding vector y
      # where the ith element of y gives the number of unique
      # elements observed up to and including index i
      # if na.include = TRUE (default) NA is counted as an 
      # additional unique element, otherwise it's essentially ignored
    
      temp <- data.table(x, idx = seq_along(x))
      firsts <- temp[temp[, .I[1L], by = x]$V1]
      if(na.include == FALSE) firsts <- firsts[!is.na(x)]
      y <- rep(0, times = length(x))
      y[firsts$idx] <- 1
      y <- cumsum(y)
    
      return(y)
    }
    

    Example Use

    cum_n_distinct(c(5,10,10,15,5))  # 1 2 2 3 3
    cum_n_distinct(c(5,NA,10,15,5))  # 1 2 3 4 4
    cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE)  # 1 1 2 3 3
    

    Solution To Your Question

    d_out = df %>%
      arrange(time) %>%
      group_by(grp) %>%
      mutate(var2 = cum_n_distinct(var1))
    
    0 讨论(0)
提交回复
热议问题