the following example describes how you can\'t calculate the number of distinct values without aggregating the rows using dplyr with sparklyr.
is there a work aroun
The best approach here is to compute counts separately, either with count
∘ distinct
:
n_ids <- df.spark %>%
select(ids) %>% distinct() %>% count() %>% collect() %>%
unlist %>% as.vector
df.spark %>% mutate(n_ids = n_ids)
or approx_count_distinct
:
n_ids_approx <- df.spark %>%
select(ids) %>% summarise(approx_count_distinct(ids)) %>% collect() %>%
unlist %>% as.vector
df.spark %>% mutate(n_ids = n_ids_approx)
It is a bit verbose, but window function approach used by dplyr
is a dead end anyway, if you want to use global unbounded frame.
If you want exact results you can also:
df.spark %>%
spark_dataframe() %>%
invoke("selectExpr", list("COUNT(DISTINCT ids) as cnt_unique_ids")) %>%
sdf_register()
I want to link in this thread which answers this for sparklyr.
Using approx_count_distinct I think is the best solution. In my experience, dbplyr doesn't translate this function when using a window so it is better to write the SQL yourself.
mtcars_spk <- copy_to(sc, mtcars,"mtcars_spk",overwrite = TRUE)
mtcars_spk2 <- mtcars_spk %>%
dplyr::mutate(test = paste0(gear, " ",carb)) %>%
dplyr::mutate(discnt = sql("approx_count_distinct(test) OVER (PARTITION BY cyl)"))
This thread approaches the problem more generally and discusses CountDistinct v.s. approxCountDistinct