问题
Okay, this question is a fairly long and complex (at least for me) and I have done my best to make this as clear, organized, and detailed as possible, so please bear with me...
----------------------------------------------------------------------
I currently have an overly manual process in applying a function to subsets in my data, and I would like to figure out how to make the code more efficient. It is easiest to describe the issue with an example:
The variables in my data (myData): GDP in years 2017, 2018, and 2019 at 4 levels of granularity: Continent, Country, State (or Province), and City. (Note: GDP numbers are arbitrary; only used to make easier calculations)
myData:
|------|---------------|---------|------------|-------------|------|
| Year | Continent | Country | State | City | GDP |
|------|---------------|---------|------------|-------------|------|
| 2019 | North America | Canada | Alberta | Edmonton | 13 |
| 2018 | North America | Canada | Alberta | Calgary | 9 |
| 2018 | North America | Canada | Alberta | Edmonton | 3 |
| 2018 | Asia | India | Bihar | Patna | 19 |
| 2018 | Asia | India | Bihar | Gaya | 8 |
| 2017 | Asia | India | Bihar | Patna | 22 |
| 2019 | Asia | India | Bihar | Gaya | 19 |
| 2019 | Asia | India | Bihar | Patna | 16 |
| 2019 | North America | USA | California | San Diego | 23 |
| 2017 | North America | USA | California | Los Angeles | 18 |
| 2018 | North America | USA | California | Los Angeles | 25 |
| 2018 | North America | USA | Florida | Tampa | 14 |
| 2019 | North America | USA | Florida | Miami | 19 |
| 2018 | Asia | China | Guangdong | Shenzhen | 29 |
| 2017 | Asia | China | Guangdong | Shenzhen | 26 |
| 2019 | Asia | China | Guangdong | Shenzhen | 33 |
| 2019 | Asia | China | Guangdong | Guangzhou | 20 |
| 2018 | Asia | China | Guangdong | Guangzhou | 19 |
| 2018 | North America | Canada | Quebec | Montreal | 11 |
| 2019 | North America | Canada | Quebec | Montreal | 7 |
| 2019 | Asia | China | Shandong | Yantai | 30 |
| 2019 | Asia | China | Shandong | Jinan | 16 |
| 2018 | Asia | China | Shandong | Yantai | 17 |
| 2018 | Asia | China | Shandong | Jinan | 11 |
| 2019 | Asia | India | U.P. | Allahabad | 21 |
| 2018 | Asia | India | U.P. | Agra | 15 |
| 2018 | Asia | India | U.P. | Allahabad | 13 |
| 2019 | Asia | India | U.P. | Agra | 18 |
|------|---------------|---------|------------|-------------|------|
The overall goal is to calculate GDP Quantiles (1 = 0-25%, 2 = 25%-50%,...etc.) at varying levels of granularity. Here is exactly what I am looking for:
- Quantiles for each Year; (subset entire dataset for the 3 Years)
- Quantiles for each Continent; (subset data by Continent)
- Quantiles for each Country; (subset data by Continent and Country)
- Quantiles for each State.Province; (subset data by Continent, Country, and State.Province)
Quantiles for each City; (subset data by Continent, Country, State.Province, and City)
I currently have two steps in this process:
- Subset data at each level.
Calculate quantiles (based off GDP values) for each subset.
We subset by summing/adding GDP at each level. (Note: This step will generate dataframes with less and less rows as we move down to level 5.) Here is what I have done and it is rather manual and repetitive, so I would like to find a better way:
Level_1.Year <- aggregate(
GDP ~
Year +
Continent +
Country +
State.Province +
City,
FUN = sum,
data = myData)
Level_2.Continent <- aggregate(
GDP ~
Continent +
Country +
State.Province +
City,
FUN = sum,
data = myData)
Level_3.Country <- aggregate(
GDP ~
Country +
State.Province +
City,
FUN = sum,
data = myData)
Level_4.State.Province <- aggregate(
GDP ~
State.Province +
City,
FUN = sum,
data = myData)
Level_5.City <- aggregate(
GDP ~
City,
FUN = sum,
data = myData)
----------------------------------------------------------------------
So now that we have the subsets, we calculate the quantiles for each subset. Since they are all different lengths and do not have the same variables, I have resorted to manual/repetitive calculations (again...) for each subset:
Level_1.Year_quantiles <- Level_1.Year %>%
group_by(Year) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP,
c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
Level_2.Continent_quantiles <- Level_2.Continent %>%
group_by(Continent) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP,
c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
Level_3.Country_quantiles <- Level_3.Country %>%
group_by(Country) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP,
c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
.
.
.
# All the way through Level_5.City; I think you get the point.
----------------------------------------------------------------------
Is there a way to (1) subset each level in a more efficient way, then (2) store each subset in a list of dataframes, then (3) add quantiles to each dataframe in the list?
If theres a better way to do this entire process, please let me know! Also, if you have any comments or recommendations, I would love to hear them.
回答1:
Consider an apply family solution, namely lapply
, by
(wrapper to tapply
), and Map
(wrapper to mapply
) handling all processing in lists:
agg_factors <- c("City", "State", "Country", "Continent", "Year")
# NAMED LIST OF DATA FRAMES WHERE FORMULA DYNAMICALLY BUILT AND PASS INTO aggregate()
agg_df_list <- setNames(lapply(seq_along(agg_factors), function(i) {
agg_formula <- as.formula(paste("GDP ~", paste(agg_factors[1:i], collapse=" + ")))
aggregate(agg_formula, myData, FUN=sum)
}), agg_factors)
# FUNCTION TO CALL by() TO RUN FUNCTION ON EACH SUBSET TO BIND TOGETHER AT END
proc_quantiles <- function(df, nm) {
dfs <- by(df, df[[nm]], function(sub)
transform(sub,
Quantile = tryCatch(cut(GDP,
breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE),
error = function(e) NA)
)
)
do.call(rbind, unname(dfs))
}
# ELEMENTWISE LOOP THROUGH DFs AND CORRESPONDING NAMES
quantile_df_list <- Map(proc_quantiles, agg_df_list, names(agg_df_list))
Output
head(quantile_df_list$City)
# City GDP Quantile
# 1 Agra 33 NA
# 2 Allahabad 34 NA
# 3 Calgary 9 NA
# 4 Edmonton 16 NA
# 5 Gaya 27 NA
# 6 Guangzhou 39 NA
head(quantile_df_list$State)
# City State GDP Quantile
# 1 Calgary Alberta 9 1
# 2 Edmonton Alberta 16 4
# 3 Gaya Bihar 27 1
# 4 Patna Bihar 57 4
# 5 Los Angeles California 43 4
# 6 San Diego California 23 1
head(quantile_df_list$Country)
# City State Country GDP Quantile
# 1 Calgary Alberta Canada 9 1
# 2 Edmonton Alberta Canada 16 2
# 3 Montreal Quebec Canada 18 4
# 4 Guangzhou Guangdong China 39 2
# 5 Shenzhen Guangdong China 88 4
# 6 Jinan Shandong China 27 1
head(quantile_df_list$Continent)
# City State Country Continent GDP Quantile
# 1 Guangzhou Guangdong China Asia 39 3
# 2 Shenzhen Guangdong China Asia 88 4
# 3 Jinan Shandong China Asia 27 1
# 4 Yantai Shandong China Asia 47 3
# 5 Gaya Bihar India Asia 27 1
# 6 Patna Bihar India Asia 57 4
head(quantile_df_list$Year)
# City State Country Continent Year GDP Quantile
# 1 Shenzhen Guangdong China Asia 2017 26 4
# 2 Patna Bihar India Asia 2017 22 2
# 3 Los Angeles California USA North America 2017 18 1
# 4 Guangzhou Guangdong China Asia 2018 19 3
# 5 Shenzhen Guangdong China Asia 2018 29 4
# 6 Jinan Shandong China Asia 2018 11 1
回答2:
First, some clarification: What you are calling subsets are grouped summaries. Refer to ˙?aggregate` for further information. Second, the answer to each of your three questions is yes. Third, your level 1 summary is equivalent to your data frame.
As you are using aggregate()
, I will first illustrate how to attain a list of the grouped summaries using aggregate()
:
library(tidyverse)
formula_list <-
list(
GDP ~ Year + Continent + Country + State.Province + City,
GDP ~ Continent + Country + State.Province + City,
GDP ~ Country + State.Province + City,
GDP ~ State.Province + City,
GDP ~ City
)
summaries <- formula_list %>%
map( ~ aggregate(.x, FUN = sum, data = myData))
It's also possible to replace aggregate()
with a fully dplyr
based approach. The upside to that is replacing the notoriously inefficient aggregate()
. The downside is that we will have to deal with quosures, which are a somewhat more advanced topic (consult vignette("programming")
for further information).
var_combs <- list(
vars(Year, Continent, Country, State.Province, City),
vars(Continent, Country, State.Province, City),
vars(Country, State.Province, City),
vars(State.Province, City),
vars(City))
summaries <- var_combs %>%
map(~ myData %>%
group_by(!!!.x) %>%
summarize(GDP = sum(GDP)))
Next comes applying your code for calculating quartiles to each element of the list. As you also vary the grouping variable, we need to iterate over two lists, therefore we will be using purrr::map2()
:
grp_var <- list(
vars(Year),
vars(Continent),
vars(Country),
vars(State.Province),
vars(City)
)
map2(summaries[1:3],
grp_var[1:3],
~ .x %>%
group_by(!!!.y) %>%
mutate(Quantile = cut(GDP,
breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)),
labels = 1:4,
include.lowest = TRUE))
)
You will notice that I had to subset the lists to just the first three elements. The code you wrote for calculating the quartiles fails if one of the groups has only one observation (which makes sense: you cannot calculate quartiles for a sample of one). This will always be the case for the last of the five, as it contains only one element per group by definition. It is also questionable if your result is particularly meaningful if you just have two or three observations per group.
Data:
myData <- structure(list(
Year = c(2019, 2019, 2018, 2019, 2019, 2018, 2019,
2018, 2018, 2018, 2018, 2018, 2018, 2017, 2017, 2019, 2018, 2019,
2019, 2018, 2019, 2017, 2019, 2018, 2018, 2018, 2019, 2019),
Continent = c("North America", "Asia", "Asia", "North America",
"Asia", "North America", "Asia", "North America", "Asia",
"North America", "Asia", "Asia", "Asia", "North America",
"Asia", "North America", "Asia", "North America", "Asia",
"North America", "Asia", "Asia", "Asia", "North America",
"Asia", "Asia", "Asia", "Asia"),
Country = c("Canada", "India", "India", "USA", "China", "USA", "China",
"Canada", "China", "Canada", "India", "India", "China",
"USA", "China", "USA", "India", "Canada", "China", "USA",
"China", "India", "India", "Canada", "China", "China",
"India", "India"),
State.Province = c("Alberta", "Uttar Pradesh", "Bihar", "California",
"Shandong", "Florida", "Shandong", "Quebec", "Guangdong",
"Alberta", "Uttar Pradesh", "Bihar", "Shandong",
"California", "Guangdong", "Florida", "Uttar Pradesh",
"Quebec", "Guangdong", "California", "Guangdong", "Bihar",
"Bihar", "Alberta", "Shandong", "Guangdong", "Uttar Pradesh",
"Bihar"),
City = c("Edmonton", "Allahabad", "Patna", "Los Angeles", "Yantai", "Miami",
"Jinan", "Montreal", "Shenzhen", "Calgary", "Agra", "Gaya", "Yantai",
"Los Angeles", "Shenzhen", "Miami", "Allahabad", "Montreal",
"Shenzhen", "Los Angeles", "Guangzhou", "Patna", "Gaya", "Edmonton",
"Jinan", "Guangzhou", "Agra", "Patna"),
GDP = c(13, 21, 19, 23, 30, 14, 16, 11, 29, 9, 15, 8, 17, 18, 26, 19, 13, 7,
33, 25, 20, 22, 19, 3, 11, 19, 18, 16)),
class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -28L),
spec = structure(list(cols = list(Year = structure(list(), class = c("collector_double", "collector")),
Continent = structure(list(), class = c("collector_character", "collector")),
Country = structure(list(), class = c("collector_character", "collector")),
State.Province = structure(list(), class = c("collector_character", "collector")),
City = structure(list(), class = c("collector_character", "collector")),
GDP = structure(list(), class = c("collector_double", "collector"))),
default = structure(list(), class = c("collector_guess", "collector")),
skip = 2),
class = "col_spec"))
来源:https://stackoverflow.com/questions/58250953/subsetting-data-by-levels-of-granularity-and-applying-a-function-to-each-data-fr