Automisation of creating new variables based on the distribution of other variables in the data

佐手、 提交于 2021-02-11 14:34:05

问题


I have data as follows:

EDIT:

Sample of Original Data

DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01", 
year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2, 
upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568, 
from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451, 
c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168, 
native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209, 
c48 = 641), row.names = c(NA, -1L), class = c("data.table", 
"data.frame"))

Melted sample of Orginal Data

DT <- structure(list(Abbreviation = c("AK", "AK", "AK", "AK", "AK", 
"AK", "AK", "AK", "AK", "AK"), date = c("1/31/2011", "10/31/2011", 
"11/30/2011", "12/31/2010", "4/30/2005", "2/28/2011", "3/31/2011", 
"4/30/2011", "5/31/2011", "6/30/2011"), year = c("2011", "2011", 
"2011", "2010", "2005", "2011", "2011", "2011", "2011", "2011"
), c1 = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P"), 
    State = c("Alaska", "Alaska", "Alaska", "Alaska", "Alaska", 
    "Alaska", "Alaska", "Alaska", "Alaska", "Alaska"), month = c("01", 
    "10", "11", "12", "04", "02", "03", "04", "05", "06"), total = c(18395, 
    10654, 14113, 16248, 14029, 17915, 17152, 15543, 13325, 11637
    ), variable = structure(c(1L, 2L, 4L, 5L, 13L, 17L, 18L, 20L, 
    1L, 1L), .Label = c("male", "female", "c4", "upto22", "from22to24", 
    "from25to34", "from35to44", "from45to54", "from55to59", "from60to64", 
    "over65", "c20", "hispanic", "non_nispanic", "c42", "native", 
    "asian", "black", "hawaii", "white", "c48", "c49", "c50", 
    "c87", "c88", "c89", "c90", "c91", "c92", "c93"), class = "factor"), 
    value = c(12288, 5863, 8500, 10508, 8860, 12060, 11594, 9997, 
    8158, 6294)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

    Abbreviation       date year c1  State month total   variable value
 1:           AK  1/31/2011 2011  P Alaska    01 18395       male 12288
 2:           AK 10/31/2011 2011  P Alaska    10 10654     female  5863
 3:           AK 11/30/2011 2011  P Alaska    11 14113     upto22  8500
 4:           AK 12/31/2010 2010  P Alaska    12 16248 from22to24 10508
 5:           AK  4/30/2005 2005  P Alaska    04 14029   hispanic  8860
 6:           AK  2/28/2011 2011  P Alaska    02 17915      asian 12060
 7:           AK  3/31/2011 2011  P Alaska    03 17152      black 11594
 8:           AK  4/30/2011 2011  P Alaska    04 15543      white  9997
 9:           AK  5/31/2011 2011  P Alaska    05 13325       male  8158
10:           AK  6/30/2011 2011  P Alaska    06 11637       male  6294

The column variable has three groups of variables. These are sex, age and ethnicity. All of the groups add up to (more or less) the same total. So male + female == total, black + white + asian == total etc. What I would like to do is create new variables, such as asian_male_upto22, which would be the amount for asian, multiplied with the ratio of males to the total and the ratio of the respective age group to the total.

I am looking for a way to automise this process, but I am stuck figuring out how.

I was thinking about first assigning the variables to groups (A <- c("male", "female")), from there calculating the ratios per group, but it all seems a bit messy.

Could anyone point me in the right direction?


回答1:


That is a challenging question. This is what I came up with (but I am sure there is room for improvement).

If I understand correctly, the dataset in wide format contains 4 variables sex, age, race, and ethnicity where, e.g., sex can assume the values female, male, or NA, and so on. Columns 6 to 26 contain the counts for each value. The variables are not included but need to be added to construct groups of values. As mentioned by the OP, the columns c4, c20, c42, c48 contain the NA counts which add to the counts of the preceeding columns.

The processing includes several steps. The first part of steps will preprocess the data, the second part will create the new variables.

For creation of the new variables, there are two approaches:

  • by a cross join
  • or recursively by using Reduce().

Preprocessing

(1) Create a look-up table to associate the columns and values with the variable they belong to.

lut <- data.table(value = names(DT))[
  , variable := value %>% 
    shift() %>% 
    like("c\\d{1,2}") %>% 
    cumsum() %>% 
    add(1L) %>% 
    extract(c("id", "sex", "age", "race", "ethn"),. )][]
lut
           value variable
 1: Abbreviation       id
 2:         date       id
 3:        month       id
 4:         year       id
 5:           c1       id
 6:         male      sex
 7:       female      sex
 8:           c4      sex
 9:       upto22      age
10:   from22to24      age
11:   from25to34      age
12:   from35to44      age
13:   from45to54      age
14:   from55to59      age
15:   from60to64      age
16:       over65      age
17:          c20      age
18:     hispanic     race
19: non_hispanic     race
20:          c42     race
21:       native     ethn
22:        asian     ethn
23:        black     ethn
24:       hawaii     ethn
25:        white     ethn
26:          c48     ethn
           value variable

(2) Coerce character date to numeric date to get rid of the redundant year and month columns for brevity. In addition, numeric date is more flexible for ordering or plotting.

DT[, date := as.IDate(date, "%m/%d/%Y")]

(3) Reshape the dataset from wide to long format, thereby dropping columns year, month, and c1. (Please do not confuse the parameters value.name and variable.name of the melt() function with my definition of variable and value.)
(4) Append the matching variable to each value by an update join.
(5) Replace the obfuscating values c4, c20, etc. by NA
(6) Append the total for each group (including the NA counts).

long <- 
  melt(DT[, !c("year", "month", "c1")], id.vars = c("Abbreviation", "date"),
       value.name = "count", variable.name = "value")[
         lut, on = .(value), variable := i.variable][
           value %like% "c\\d{1,2}", value := NA][
             , total := sum(count), by = .(Abbreviation, date, variable)][]
long
    Abbreviation       date        value count variable total
 1:           AK 2011-01-31         male 12288      sex 18397
 2:           ZZ 2011-01-31         male 12298      sex 18427
 3:           AK 2011-01-31       female  6107      sex 18397
 4:           ZZ 2011-01-31       female  6117      sex 18427
 5:           AK 2011-01-31         <NA>     2      sex 18397
 6:           ZZ 2011-01-31         <NA>    12      sex 18427
 7:           AK 2011-01-31       upto22   870      age 18397
 8:           ZZ 2011-01-31       upto22   880      age 18487
 9:           AK 2011-01-31   from22to24  1441      age 18397
10:           ZZ 2011-01-31   from22to24  1451      age 18487
11:           AK 2011-01-31   from25to34  5320      age 18397
12:           ZZ 2011-01-31   from25to34  5330      age 18487
... 
31:           AK 2011-01-31       native  4856     ethn 18397
32:           ZZ 2011-01-31       native  4866     ethn 18457
33:           AK 2011-01-31        asian   791     ethn 18397
34:           ZZ 2011-01-31        asian   801     ethn 18457
35:           AK 2011-01-31        black   611     ethn 18397
36:           ZZ 2011-01-31        black   621     ethn 18457
37:           AK 2011-01-31       hawaii   289     ethn 18397
38:           ZZ 2011-01-31       hawaii   299     ethn 18457
39:           AK 2011-01-31        white 11209     ethn 18397
40:           ZZ 2011-01-31        white 11219     ethn 18457
41:           AK 2011-01-31         <NA>   641     ethn 18397
42:           ZZ 2011-01-31         <NA>   651     ethn 18457
    Abbreviation       date        value count variable total

Create new variables by a cross join

(7) Create the names of the new variables by a cross join CJ(). The cross join will also include Abbreviation and date

new_vars <- 
  long[!is.na(value), CJ(Abbreviation, 
                         date, 
                         ethn = .SD[variable == "ethn", value], 
                         sex = .SD[variable == "sex", value], 
                         age = .SD[variable == "age", value],
                         unique = TRUE)][
                           , new.var := paste(ethn, sex, age, sep = "_")][]
new_vars
     Abbreviation       date   ethn    sex        age                 new.var
  1:           AK 2011-01-31 native   male     upto22      native_male_upto22
  2:           AK 2011-01-31 native   male from22to24  native_male_from22to24
  3:           AK 2011-01-31 native   male from25to34  native_male_from25to34
  4:           AK 2011-01-31 native   male from35to44  native_male_from35to44
  5:           AK 2011-01-31 native   male from45to54  native_male_from45to54
 ---                                                                         
156:           ZZ 2011-01-31  white female from35to44 white_female_from35to44
157:           ZZ 2011-01-31  white female from45to54 white_female_from45to54
158:           ZZ 2011-01-31  white female from55to59 white_female_from55to59
159:           ZZ 2011-01-31  white female from60to64 white_female_from60to64
160:           ZZ 2011-01-31  white female     over65     white_female_over65

(8) Reshape new_vars to long format. This is required because the origial dataset has been reshaped to long format as well.

lnv <- melt(new_vars, id.vars = c("Abbreviation", "date", "new.var"))
lnv
     Abbreviation       date                 new.var variable      value
  1:           AK 2011-01-31      native_male_upto22     ethn     native
  2:           AK 2011-01-31  native_male_from22to24     ethn     native
  3:           AK 2011-01-31  native_male_from25to34     ethn     native
  4:           AK 2011-01-31  native_male_from35to44     ethn     native
  5:           AK 2011-01-31  native_male_from45to54     ethn     native
 ---                                                                    
476:           ZZ 2011-01-31 white_female_from35to44      age from35to44
477:           ZZ 2011-01-31 white_female_from45to54      age from45to54
478:           ZZ 2011-01-31 white_female_from55to59      age from55to59
479:           ZZ 2011-01-31 white_female_from60to64      age from60to64
480:           ZZ 2011-01-31     white_female_over65      age     over65

(9) Append new.var by right joining lnv with long.
(10) Aggregate by Abbreviation, date, and new.var, thereby multiplying the counts and dividing by totals (in order to get shares).

long[lnv, on = .(Abbreviation, date, variable, value)][
  , .(new.count = prod(count)/first(total)^2), by = .(Abbreviation, date, new.var)]
     Abbreviation       date                 new.var new.count
  1:           AK 2011-01-31      native_male_upto22 153.38579
  2:           AK 2011-01-31  native_male_from22to24 254.05623
  3:           AK 2011-01-31  native_male_from25to34 937.94527
  4:           AK 2011-01-31  native_male_from35to44 629.05803
  5:           AK 2011-01-31  native_male_from45to54 761.99238
 ---                                                          
156:           ZZ 2011-01-31 white_female_from35to44 720.79330
157:           ZZ 2011-01-31 white_female_from45to54 872.68769
158:           ZZ 2011-01-31 white_female_from55to59 312.04830
159:           ZZ 2011-01-31 white_female_from60to64 180.50050
160:           ZZ 2011-01-31     white_female_over65  92.86912

Alternatively: Create new variables recursively using Reduce()

Steps (7) to (10) can be replaced by recursively joining subsets.

If done manually this will look like:

long[!is.na(value) & variable == "ethn"][
  long[!is.na(value) & variable == "sex"], on = .(Abbreviation, date), allow.cartesian = TRUE,
  .(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)][
    long[!is.na(value) & variable == "age"], on = .(Abbreviation, date), allow.cartesian = TRUE,
    .(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)]

First, the subset for variable ethn is joined with the subset for variable sex (outer join) thereby computing the first part of the new variable name and the new count partially. Then the temporary result is joined with te subset for variable age thereby computing the new variable name and the new count, finally.

This can be written in a more general way as

join_fct <-   function(x, y) {
  x[y, on = .(Abbreviation, date), allow.cartesian = TRUE,
    .(Abbreviation, 
      date, 
      value = paste(value, i.value, sep ="_"), 
      count = count * i.count / i.total)]
}

Reduce(join_fct, 
       lapply(c("ethn", "sex", "age"), 
              function(x) long[!is.na(value) & variable == x])
)
     Abbreviation       date                value      count
  1:           AK 2011-01-31   native_male_upto22 153.385786
  2:           AK 2011-01-31    asian_male_upto22  24.985205
  3:           AK 2011-01-31    black_male_upto22  19.299571
  4:           AK 2011-01-31   hawaii_male_upto22   9.128602
  5:           AK 2011-01-31    white_male_upto22 354.057100
 ---                                                        
156:           ZZ 2011-01-31 native_female_over65  40.280090
157:           ZZ 2011-01-31  asian_female_over65   6.630570
158:           ZZ 2011-01-31  black_female_over65   5.140554
159:           ZZ 2011-01-31 hawaii_female_over65   2.475082
160:           ZZ 2011-01-31  white_female_over65  92.869365

This approach is quite flexible as the number and order of variables can be changed easily, e.g.,

Reduce(join_fct, lapply(c("race", "sex"), function(x) long[!is.na(value) & variable == x]) )

   Abbreviation       date               value      count
1:           AK 2011-01-31       hispanic_male   514.9779
2:           AK 2011-01-31   non_hispanic_male 11660.8090
3:           ZZ 2011-01-31       hispanic_male   521.2318
4:           ZZ 2011-01-31   non_hispanic_male 11657.9728
5:           AK 2011-01-31     hispanic_female   255.9383
6:           AK 2011-01-31 non_hispanic_female  5795.2930
7:           ZZ 2011-01-31     hispanic_female   259.2596
8:           ZZ 2011-01-31 non_hispanic_female  5798.6518

Data

The OP has provided on row of the original dataset (in wide format)

DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01", 
                     year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2, 
                     upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568, 
                     from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451, 
                     c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168, 
                     native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209, 
                     c48 = 641), row.names = c(NA, -1L), class = c("data.table", "data.frame"))

However, to verify the code works as expected, I need a second row. So I have appended a second row by:

library(data.table)
DT <- rbind(DT, DT)
DT[2, (6:ncol(DT)) := lapply(.SD, `+`, y = 10), .SDcols = 6:ncol(DT)]
DT[2, Abbreviation := "ZZ"]
DT
   Abbreviation      date month year c1  male female c4 upto22 from22to24 from25to34
1:           AK 1/31/2011    01 2011  P 12288   6107  2    870       1441       5320
2:           ZZ 1/31/2011    01 2011  P 12298   6117 12    880       1451       5330
   from35to44 from45to54 from55to59 from60to64 over65 c20 hispanic non_hispanic c42 native
1:       3568       4322       1539        886    451   0      771        17458 168   4856
2:       3578       4332       1549        896    461  10      781        17468 178   4866
   asian black hawaii white c48
1:   791   611    289 11209 641
2:   801   621    299 11219 651


来源:https://stackoverflow.com/questions/62406588/automisation-of-creating-new-variables-based-on-the-distribution-of-other-variab

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!