问题
I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe
Here is some data
Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)
|Brand | Category| Clicks|
|:-----|--------:|------:|
|A | 1| 10|
|B | 2| 11|
|C | 1| 12|
|A | 1| 13|
|A | 2| 14|
|B | 1| 15|
|A | 2| 14|
|A | 1| 13|
|B | 2| 12|
|C | 1| 11|
This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination
|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A | 1| 12.0|
|A | 2| 14.0|
|B | 1| 15.0|
|B | 2| 11.5|
Which I thought could be achieved with code like this (but can't)
df %>%
group_by(Brand, Category) %>%
top_n(2, Brand) %>% # Largest 2 brands by count
summarise(mean_clicks = mean(Clicks))
EDIT: the ideal answer should be able to be used on database tables as well as local tables
回答1:
Another dplyr
solution using a join
to filter the data frame:
library(dplyr)
df %>%
group_by(Brand) %>%
summarise(n = n()) %>%
top_n(2) %>% # select top 2
left_join(df, by = "Brand") %>% # filters out top 2 Brands
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# # A tibble: 4 x 3
# # Groups: Brand [?]
# Brand Category mean_clicks
# <fct> <dbl> <dbl>
# 1 A 1 12
# 2 A 2 14
# 3 B 1 15
# 4 B 2 11.5
回答2:
A different dplyr
solution:
df %>%
group_by(Brand) %>%
mutate(n = n()) %>%
ungroup() %>%
mutate(rank = dense_rank(desc(n))) %>%
filter(rank == 1 | rank == 2) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# A tibble: 4 x 3
# Groups: Brand [?]
Brand Category mean_clicks
<fct> <dbl> <dbl>
1 A 1. 12.0
2 A 2. 14.0
3 B 1. 15.0
4 B 2. 11.5
Or a simplified version (based on suggestions from @camille):
df %>%
group_by(Brand) %>%
mutate(n = n()) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
回答3:
EDIT
Based on updated question, we can add a count column first, filter only top n
group counts, then group_by
Brand
and Category
to find the mean
for each group.
df %>%
add_count(Brand, sort = TRUE) %>%
filter(n %in% head(unique(n), 2)) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# Brand Category mean_clicks
# <fct> <dbl> <dbl>
#1 A 1 12
#2 A 2 14
#3 B 1 15
#4 B 2 11.5
Original Answer
We can group_by
Brand
and do all the calculations by group and then filter top groups by top_n
library(dplyr)
df %>%
group_by(Brand) %>%
summarise(n = n(),
mean = mean(Clicks)) %>%
top_n(2, n) %>%
select(-n)
# Brand mean
# <fct> <dbl>
#1 A 12.8
#2 B 12.7
回答4:
A data.table idea is to get the counts grouped by Brands
and filter the top two (after ordering in descending order). Then we merge with the original data frame and find the mean grouped by (Brand, Category)
library(data.table)
#Convert to data.table
dt1 <- setDT(df)
dt1[dt1[, .(cnt = .N), by = Brand][
order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL],
on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]
which gives,
Brand Category means 1: A 1 12.0 2: A 2 14.0 3: B 2 11.5 4: B 1 15.0
回答5:
How about this approach, using table
, from base R -
df %>%
filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
group_by(Brand, Category) %>%
summarise(mean_clicks = mean(Clicks))
# A tibble: 4 x 3
# Groups: Brand [?]
Brand Category mean_clicks
<chr> <dbl> <dbl>
1 A 1.00 12.0
2 A 2.00 14.0
3 B 1.00 15.0
4 B 2.00 11.5
回答6:
Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(
df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>%
filter(Brand %in% Top2$Brand) %>%
summarise(mean_clicks = mean(Clicks))
remove(Top2)
来源:https://stackoverflow.com/questions/52532080/tidyverse-filtering-n-largest-groups-in-grouped-dataframe