number of unique values sparklyr

前端 未结 2 992
名媛妹妹
名媛妹妹 2020-12-20 02:13

the following example describes how you can\'t calculate the number of distinct values without aggregating the rows using dplyr with sparklyr.

is there a work aroun

相关标签:
2条回答
  • 2020-12-20 02:14

    The best approach here is to compute counts separately, either with countdistinct:

    n_ids <- df.spark %>% 
       select(ids) %>% distinct() %>% count() %>% collect() %>%
       unlist %>% as.vector
    
    df.spark %>% mutate(n_ids = n_ids)
    

    or approx_count_distinct:

    n_ids_approx <- df.spark %>% 
       select(ids) %>% summarise(approx_count_distinct(ids)) %>% collect() %>%
       unlist %>% as.vector
    
    df.spark %>% mutate(n_ids = n_ids_approx)
    

    It is a bit verbose, but window function approach used by dplyr is a dead end anyway, if you want to use global unbounded frame.

    If you want exact results you can also:

    df.spark %>% 
        spark_dataframe() %>% 
        invoke("selectExpr", list("COUNT(DISTINCT ids) as cnt_unique_ids")) %>% 
        sdf_register()
    
    0 讨论(0)
  • 2020-12-20 02:15

    I want to link in this thread which answers this for sparklyr.

    Using approx_count_distinct I think is the best solution. In my experience, dbplyr doesn't translate this function when using a window so it is better to write the SQL yourself.

    mtcars_spk <- copy_to(sc, mtcars,"mtcars_spk",overwrite = TRUE)
    mtcars_spk2 <- mtcars_spk %>%
                    dplyr::mutate(test = paste0(gear, " ",carb)) %>%
                    dplyr::mutate(discnt = sql("approx_count_distinct(test) OVER (PARTITION BY cyl)"))
    

    This thread approaches the problem more generally and discusses CountDistinct v.s. approxCountDistinct

    0 讨论(0)
提交回复
热议问题