Subsetting data by levels of granularity and applying a function to each data frame in R

问题

Okay, this question is a fairly long and complex (at least for me) and I have done my best to make this as clear, organized, and detailed as possible, so please bear with me...

----------------------------------------------------------------------

I currently have an overly manual process in applying a function to subsets in my data, and I would like to figure out how to make the code more efficient. It is easiest to describe the issue with an example:

The variables in my data (myData): GDP in years 2017, 2018, and 2019 at 4 levels of granularity: Continent, Country, State (or Province), and City. (Note: GDP numbers are arbitrary; only used to make easier calculations)

myData:

   |------|---------------|---------|------------|-------------|------|
   | Year | Continent     | Country | State      | City        | GDP  |
   |------|---------------|---------|------------|-------------|------|
   | 2019 | North America | Canada  | Alberta    | Edmonton    | 13   |
   | 2018 | North America | Canada  | Alberta    | Calgary     | 9    |
   | 2018 | North America | Canada  | Alberta    | Edmonton    | 3    | 
   | 2018 | Asia          | India   | Bihar      | Patna       | 19   |
   | 2018 | Asia          | India   | Bihar      | Gaya        | 8    |
   | 2017 | Asia          | India   | Bihar      | Patna       | 22   | 
   | 2019 | Asia          | India   | Bihar      | Gaya        | 19   |
   | 2019 | Asia          | India   | Bihar      | Patna       | 16   |
   | 2019 | North America | USA     | California | San Diego   | 23   |
   | 2017 | North America | USA     | California | Los Angeles | 18   |
   | 2018 | North America | USA     | California | Los Angeles | 25   |
   | 2018 | North America | USA     | Florida    | Tampa       | 14   |
   | 2019 | North America | USA     | Florida    | Miami       | 19   |
   | 2018 | Asia          | China   | Guangdong  | Shenzhen    | 29   |
   | 2017 | Asia          | China   | Guangdong  | Shenzhen    | 26   |
   | 2019 | Asia          | China   | Guangdong  | Shenzhen    | 33   |
   | 2019 | Asia          | China   | Guangdong  | Guangzhou   | 20   |
   | 2018 | Asia          | China   | Guangdong  | Guangzhou   | 19   |
   | 2018 | North America | Canada  | Quebec     | Montreal    | 11   |
   | 2019 | North America | Canada  | Quebec     | Montreal    | 7    |
   | 2019 | Asia          | China   | Shandong   | Yantai      | 30   |
   | 2019 | Asia          | China   | Shandong   | Jinan       | 16   |
   | 2018 | Asia          | China   | Shandong   | Yantai      | 17   |
   | 2018 | Asia          | China   | Shandong   | Jinan       | 11   |
   | 2019 | Asia          | India   | U.P.       | Allahabad   | 21   |
   | 2018 | Asia          | India   | U.P.       | Agra        | 15   |
   | 2018 | Asia          | India   | U.P.       | Allahabad   | 13   |
   | 2019 | Asia          | India   | U.P.       | Agra        | 18   |
   |------|---------------|---------|------------|-------------|------|

The overall goal is to calculate GDP Quantiles (1 = 0-25%, 2 = 25%-50%,...etc.) at varying levels of granularity. Here is exactly what I am looking for:

Quantiles for each Year; (subset entire dataset for the 3 Years)
Quantiles for each Continent; (subset data by Continent)
Quantiles for each Country; (subset data by Continent and Country)
Quantiles for each State.Province; (subset data by Continent, Country, and State.Province)
Quantiles for each City; (subset data by Continent, Country, State.Province, and City)

I currently have two steps in this process:

Subset data at each level.
Calculate quantiles (based off GDP values) for each subset.

We subset by summing/adding GDP at each level. (Note: This step will generate dataframes with less and less rows as we move down to level 5.) Here is what I have done and it is rather manual and repetitive, so I would like to find a better way:

Level_1.Year <- aggregate(
    GDP ~ 
      Year + 
      Continent + 
      Country + 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_2.Continent <- aggregate(
    GDP ~ 
      Continent + 
      Country + 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_3.Country <- aggregate(
    GDP ~ 
      Country + 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_4.State.Province <- aggregate(
    GDP ~ 
      State.Province + 
      City, 
    FUN = sum, 
    data = myData)

Level_5.City <- aggregate(
    GDP ~ 
      City, 
    FUN = sum, 
    data = myData)

----------------------------------------------------------------------

So now that we have the subsets, we calculate the quantiles for each subset. Since they are all different lengths and do not have the same variables, I have resorted to manual/repetitive calculations (again...) for each subset:

Level_1.Year_quantiles <- Level_1.Year %>% 
        group_by(Year) %>% 
        mutate(Quantile = cut(GDP,
            breaks = quantile(GDP, 
        c(0, 0.25, 0.5, 0.75, 1)), 
            labels = 1:4, 
            include.lowest = TRUE))

Level_2.Continent_quantiles <- Level_2.Continent %>% 
        group_by(Continent) %>% 
        mutate(Quantile = cut(GDP,
            breaks = quantile(GDP, 
        c(0, 0.25, 0.5, 0.75, 1)), 
            labels = 1:4, 
            include.lowest = TRUE))

Level_3.Country_quantiles <- Level_3.Country %>% 
        group_by(Country) %>% 
        mutate(Quantile = cut(GDP,
            breaks = quantile(GDP, 
        c(0, 0.25, 0.5, 0.75, 1)), 
            labels = 1:4, 
            include.lowest = TRUE))
        . 
        .
        .

# All the way through Level_5.City; I think you get the point. 

----------------------------------------------------------------------

Is there a way to (1) subset each level in a more efficient way, then (2) store each subset in a list of dataframes, then (3) add quantiles to each dataframe in the list?

If theres a better way to do this entire process, please let me know! Also, if you have any comments or recommendations, I would love to hear them.

回答1:

Consider an apply family solution, namely lapply, by (wrapper to tapply), and Map (wrapper to mapply) handling all processing in lists:

agg_factors <- c("City", "State", "Country", "Continent", "Year")

# NAMED LIST OF DATA FRAMES WHERE FORMULA DYNAMICALLY BUILT AND PASS INTO aggregate()
agg_df_list <- setNames(lapply(seq_along(agg_factors), function(i) {
                              agg_formula <- as.formula(paste("GDP ~", paste(agg_factors[1:i], collapse=" + ")))
                              aggregate(agg_formula, myData, FUN=sum)
                       }), agg_factors)

# FUNCTION TO CALL by() TO RUN FUNCTION ON EACH SUBSET TO BIND TOGETHER AT END
proc_quantiles <- function(df, nm) {

  dfs <- by(df, df[[nm]], function(sub) 
               transform(sub,
                         Quantile = tryCatch(cut(GDP,
                                                 breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)), 
                                                 labels = 1:4, 
                                                 include.lowest = TRUE), 
                                             error = function(e) NA)
                         )
         ) 

  do.call(rbind, unname(dfs))      
 }

# ELEMENTWISE LOOP THROUGH DFs AND CORRESPONDING NAMES
quantile_df_list <- Map(proc_quantiles, agg_df_list, names(agg_df_list))

Output

head(quantile_df_list$City)
#        City GDP Quantile
# 1      Agra  33       NA
# 2 Allahabad  34       NA
# 3   Calgary   9       NA
# 4  Edmonton  16       NA
# 5      Gaya  27       NA
# 6 Guangzhou  39       NA

head(quantile_df_list$State)
#          City      State GDP Quantile
# 1     Calgary    Alberta   9        1
# 2    Edmonton    Alberta  16        4
# 3        Gaya      Bihar  27        1
# 4       Patna      Bihar  57        4
# 5 Los Angeles California  43        4
# 6   San Diego California  23        1

head(quantile_df_list$Country)
#        City     State Country GDP Quantile
# 1   Calgary   Alberta  Canada   9        1
# 2  Edmonton   Alberta  Canada  16        2
# 3  Montreal    Quebec  Canada  18        4
# 4 Guangzhou Guangdong   China  39        2
# 5  Shenzhen Guangdong   China  88        4
# 6     Jinan  Shandong   China  27        1

head(quantile_df_list$Continent)
#        City     State Country Continent GDP Quantile
# 1 Guangzhou Guangdong   China      Asia  39        3
# 2  Shenzhen Guangdong   China      Asia  88        4
# 3     Jinan  Shandong   China      Asia  27        1
# 4    Yantai  Shandong   China      Asia  47        3
# 5      Gaya     Bihar   India      Asia  27        1
# 6     Patna     Bihar   India      Asia  57        4

head(quantile_df_list$Year)
#          City      State Country     Continent Year GDP Quantile
# 1    Shenzhen  Guangdong   China          Asia 2017  26        4
# 2       Patna      Bihar   India          Asia 2017  22        2
# 3 Los Angeles California     USA North America 2017  18        1
# 4   Guangzhou  Guangdong   China          Asia 2018  19        3
# 5    Shenzhen  Guangdong   China          Asia 2018  29        4
# 6       Jinan   Shandong   China          Asia 2018  11        1

回答2:

First, some clarification: What you are calling subsets are grouped summaries. Refer to ˙?aggregate` for further information. Second, the answer to each of your three questions is yes. Third, your level 1 summary is equivalent to your data frame.

As you are using aggregate(), I will first illustrate how to attain a list of the grouped summaries using aggregate():

library(tidyverse)

formula_list <- 
  list(
    GDP ~ Year + Continent + Country + State.Province + City,
    GDP ~ Continent + Country + State.Province + City, 
    GDP ~ Country + State.Province + City, 
    GDP ~ State.Province + City,
    GDP ~ City
    )

summaries <- formula_list %>% 
  map( ~ aggregate(.x, FUN = sum, data = myData))

It's also possible to replace aggregate() with a fully dplyr based approach. The upside to that is replacing the notoriously inefficient aggregate(). The downside is that we will have to deal with quosures, which are a somewhat more advanced topic (consult vignette("programming") for further information).

var_combs <- list(
  vars(Year, Continent, Country, State.Province, City),
  vars(Continent, Country, State.Province, City),
  vars(Country, State.Province, City),
  vars(State.Province, City),
  vars(City)) 

summaries <- var_combs %>% 
  map(~ myData %>% 
          group_by(!!!.x) %>% 
          summarize(GDP = sum(GDP)))

Next comes applying your code for calculating quartiles to each element of the list. As you also vary the grouping variable, we need to iterate over two lists, therefore we will be using purrr::map2():

grp_var <- list(
  vars(Year),
  vars(Continent),
  vars(Country),
  vars(State.Province),
  vars(City)
)

map2(summaries[1:3], 
     grp_var[1:3], 
     ~ .x %>%  
       group_by(!!!.y) %>% 
       mutate(Quantile = cut(GDP,
                             breaks = quantile(GDP, c(0, 0.25, 0.5, 0.75, 1)), 
                             labels = 1:4, 
                             include.lowest = TRUE))
)

You will notice that I had to subset the lists to just the first three elements. The code you wrote for calculating the quartiles fails if one of the groups has only one observation (which makes sense: you cannot calculate quartiles for a sample of one). This will always be the case for the last of the five, as it contains only one element per group by definition. It is also questionable if your result is particularly meaningful if you just have two or three observations per group.

Data:

myData <- structure(list(
    Year = c(2019, 2019, 2018, 2019, 2019, 2018, 2019, 
            2018, 2018, 2018, 2018, 2018, 2018, 2017, 2017, 2019, 2018, 2019, 
            2019, 2018, 2019, 2017, 2019, 2018, 2018, 2018, 2019, 2019), 
    Continent = c("North America", "Asia", "Asia", "North America", 
                 "Asia", "North America", "Asia", "North America", "Asia", 
                 "North America", "Asia", "Asia", "Asia", "North America", 
                 "Asia", "North America", "Asia", "North America", "Asia", 
                 "North America", "Asia", "Asia", "Asia", "North America", 
                 "Asia", "Asia", "Asia", "Asia"), 
    Country = c("Canada", "India", "India", "USA", "China", "USA", "China", 
               "Canada", "China", "Canada", "India", "India", "China", 
               "USA", "China", "USA", "India", "Canada", "China", "USA", 
               "China", "India", "India", "Canada", "China", "China", 
               "India", "India"), 
    State.Province = c("Alberta", "Uttar Pradesh", "Bihar", "California", 
                      "Shandong", "Florida", "Shandong", "Quebec", "Guangdong", 
                      "Alberta", "Uttar Pradesh", "Bihar", "Shandong", 
                      "California", "Guangdong", "Florida", "Uttar Pradesh", 
                      "Quebec", "Guangdong", "California", "Guangdong", "Bihar", 
                      "Bihar", "Alberta", "Shandong", "Guangdong", "Uttar Pradesh", 
                      "Bihar"), 
    City = c("Edmonton", "Allahabad", "Patna", "Los Angeles", "Yantai", "Miami", 
             "Jinan", "Montreal", "Shenzhen", "Calgary", "Agra", "Gaya", "Yantai", 
             "Los Angeles", "Shenzhen", "Miami", "Allahabad", "Montreal", 
             "Shenzhen", "Los Angeles", "Guangzhou", "Patna", "Gaya", "Edmonton", 
             "Jinan", "Guangzhou", "Agra", "Patna"), 
    GDP = c(13, 21, 19, 23, 30, 14, 16, 11, 29, 9, 15, 8, 17, 18, 26, 19, 13, 7, 
            33, 25, 20, 22, 19, 3, 11, 19, 18, 16)), 
  class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -28L), 
  spec = structure(list(cols = list(Year = structure(list(), class = c("collector_double", "collector")), 
                                    Continent = structure(list(), class = c("collector_character", "collector")), 
                                    Country = structure(list(), class = c("collector_character", "collector")), 
                                    State.Province = structure(list(), class = c("collector_character", "collector")), 
                                    City = structure(list(), class = c("collector_character", "collector")), 
                                    GDP = structure(list(), class = c("collector_double", "collector"))), 
                        default = structure(list(), class = c("collector_guess", "collector")), 
                        skip = 2), 
                   class = "col_spec"))

来源：https://stackoverflow.com/questions/58250953/subsetting-data-by-levels-of-granularity-and-applying-a-function-to-each-data-fr

标签

dataframe

dplyr

subset