number of unique values sparklyr

前端未结

关注

 2  1016

the following example describes how you can\'t calculate the number of distinct values without aggregating the rows using dplyr with sparklyr.

is there a work aroun

相关标签:

2条回答

我寻月下人不归

2020-12-20 02:14

The best approach here is to compute counts separately, either with count ∘ distinct:

n_ids <- df.spark %>% 
   select(ids) %>% distinct() %>% count() %>% collect() %>%
   unlist %>% as.vector

df.spark %>% mutate(n_ids = n_ids)

or approx_count_distinct:

n_ids_approx <- df.spark %>% 
   select(ids) %>% summarise(approx_count_distinct(ids)) %>% collect() %>%
   unlist %>% as.vector

df.spark %>% mutate(n_ids = n_ids_approx)

It is a bit verbose, but window function approach used by dplyr is a dead end anyway, if you want to use global unbounded frame.

If you want exact results you can also:

df.spark %>% 
    spark_dataframe() %>% 
    invoke("selectExpr", list("COUNT(DISTINCT ids) as cnt_unique_ids")) %>% 
    sdf_register()

0 讨论(0)

一整个雨季

2020-12-20 02:15
I want to link in this thread which answers this for sparklyr.

Using approx_count_distinct I think is the best solution. In my experience, dbplyr doesn't translate this function when using a window so it is better to write the SQL yourself.
```
mtcars_spk <- copy_to(sc, mtcars,"mtcars_spk",overwrite = TRUE)
mtcars_spk2 <- mtcars_spk %>%
                dplyr::mutate(test = paste0(gear, " ",carb)) %>%
                dplyr::mutate(discnt = sql("approx_count_distinct(test) OVER (PARTITION BY cyl)"))
```
This thread approaches the problem more generally and discusses CountDistinct v.s. approxCountDistinct
0 讨论(0)
发布评论:

提交评论
- 加载中...