R, dplyr: cumulative version of n_distinct

后端 未结 4 1468

I have a dataframe as follows. It is ordered by column time.

Input -

df = data.frame(time = 1:20,
            grp = sort(rep(1:5,4)),
             


        
4条回答
  •  夕颜
    夕颜 (楼主)
    2021-02-09 04:01

    Here's another solution using data.table that's pretty quick.

    Generic Function

    cum_n_distinct <- function(x, na.include = TRUE){
      # Given a vector x, returns a corresponding vector y
      # where the ith element of y gives the number of unique
      # elements observed up to and including index i
      # if na.include = TRUE (default) NA is counted as an 
      # additional unique element, otherwise it's essentially ignored
    
      temp <- data.table(x, idx = seq_along(x))
      firsts <- temp[temp[, .I[1L], by = x]$V1]
      if(na.include == FALSE) firsts <- firsts[!is.na(x)]
      y <- rep(0, times = length(x))
      y[firsts$idx] <- 1
      y <- cumsum(y)
    
      return(y)
    }
    

    Example Use

    cum_n_distinct(c(5,10,10,15,5))  # 1 2 2 3 3
    cum_n_distinct(c(5,NA,10,15,5))  # 1 2 3 4 4
    cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE)  # 1 1 2 3 3
    

    Solution To Your Question

    d_out = df %>%
      arrange(time) %>%
      group_by(grp) %>%
      mutate(var2 = cum_n_distinct(var1))
    

提交回复
热议问题