问题
I have data as follows:
EDIT:
Sample of Original Data
DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01",
year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2,
upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568,
from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451,
c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168,
native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209,
c48 = 641), row.names = c(NA, -1L), class = c("data.table",
"data.frame"))
Melted sample of Orginal Data
DT <- structure(list(Abbreviation = c("AK", "AK", "AK", "AK", "AK",
"AK", "AK", "AK", "AK", "AK"), date = c("1/31/2011", "10/31/2011",
"11/30/2011", "12/31/2010", "4/30/2005", "2/28/2011", "3/31/2011",
"4/30/2011", "5/31/2011", "6/30/2011"), year = c("2011", "2011",
"2011", "2010", "2005", "2011", "2011", "2011", "2011", "2011"
), c1 = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P"),
State = c("Alaska", "Alaska", "Alaska", "Alaska", "Alaska",
"Alaska", "Alaska", "Alaska", "Alaska", "Alaska"), month = c("01",
"10", "11", "12", "04", "02", "03", "04", "05", "06"), total = c(18395,
10654, 14113, 16248, 14029, 17915, 17152, 15543, 13325, 11637
), variable = structure(c(1L, 2L, 4L, 5L, 13L, 17L, 18L, 20L,
1L, 1L), .Label = c("male", "female", "c4", "upto22", "from22to24",
"from25to34", "from35to44", "from45to54", "from55to59", "from60to64",
"over65", "c20", "hispanic", "non_nispanic", "c42", "native",
"asian", "black", "hawaii", "white", "c48", "c49", "c50",
"c87", "c88", "c89", "c90", "c91", "c92", "c93"), class = "factor"),
value = c(12288, 5863, 8500, 10508, 8860, 12060, 11594, 9997,
8158, 6294)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Abbreviation date year c1 State month total variable value
1: AK 1/31/2011 2011 P Alaska 01 18395 male 12288
2: AK 10/31/2011 2011 P Alaska 10 10654 female 5863
3: AK 11/30/2011 2011 P Alaska 11 14113 upto22 8500
4: AK 12/31/2010 2010 P Alaska 12 16248 from22to24 10508
5: AK 4/30/2005 2005 P Alaska 04 14029 hispanic 8860
6: AK 2/28/2011 2011 P Alaska 02 17915 asian 12060
7: AK 3/31/2011 2011 P Alaska 03 17152 black 11594
8: AK 4/30/2011 2011 P Alaska 04 15543 white 9997
9: AK 5/31/2011 2011 P Alaska 05 13325 male 8158
10: AK 6/30/2011 2011 P Alaska 06 11637 male 6294
The column variable
has three groups of variables. These are sex
, age
and ethnicity
. All of the groups add up to (more or less) the same total. So male + female == total
, black + white + asian == total
etc. What I would like to do is create new variables, such as asian_male_upto22
, which would be the amount for asian, multiplied with the ratio of males to the total and the ratio of the respective age group to the total.
I am looking for a way to automise this process, but I am stuck figuring out how.
I was thinking about first assigning the variables to groups (A <- c("male", "female")
), from there calculating the ratios per group, but it all seems a bit messy.
Could anyone point me in the right direction?
回答1:
That is a challenging question. This is what I came up with (but I am sure there is room for improvement).
If I understand correctly, the dataset in wide format contains 4 variables sex
, age
, race
, and ethnicity
where, e.g., sex
can assume the values female
, male
, or NA
, and so on. Columns 6 to 26 contain the counts for each value
. The variables are not included but need to be added to construct groups of values. As mentioned by the OP, the columns c4
, c20
, c42
, c48
contain the NA
counts which add to the counts of the preceeding columns.
The processing includes several steps. The first part of steps will preprocess the data, the second part will create the new variables.
For creation of the new variables, there are two approaches:
- by a
cross join
- or recursively by using
Reduce()
.
Preprocessing
(1) Create a look-up table to associate the columns and values with the variable they belong to.
lut <- data.table(value = names(DT))[
, variable := value %>%
shift() %>%
like("c\\d{1,2}") %>%
cumsum() %>%
add(1L) %>%
extract(c("id", "sex", "age", "race", "ethn"),. )][]
lut
value variable 1: Abbreviation id 2: date id 3: month id 4: year id 5: c1 id 6: male sex 7: female sex 8: c4 sex 9: upto22 age 10: from22to24 age 11: from25to34 age 12: from35to44 age 13: from45to54 age 14: from55to59 age 15: from60to64 age 16: over65 age 17: c20 age 18: hispanic race 19: non_hispanic race 20: c42 race 21: native ethn 22: asian ethn 23: black ethn 24: hawaii ethn 25: white ethn 26: c48 ethn value variable
(2) Coerce character date
to numeric date to get rid of the redundant year
and month
columns for brevity. In addition, numeric date is more flexible for ordering or plotting.
DT[, date := as.IDate(date, "%m/%d/%Y")]
(3) Reshape the dataset from wide to long format, thereby dropping columns year
, month
, and c1
. (Please do not confuse the parameters value.name
and variable.name
of the melt()
function with my definition of variable and value.)
(4) Append the matching variable to each value by an update join.
(5) Replace the obfuscating values c4
, c20
, etc. by NA
(6) Append the total for each group (including the NA
counts).
long <-
melt(DT[, !c("year", "month", "c1")], id.vars = c("Abbreviation", "date"),
value.name = "count", variable.name = "value")[
lut, on = .(value), variable := i.variable][
value %like% "c\\d{1,2}", value := NA][
, total := sum(count), by = .(Abbreviation, date, variable)][]
long
Abbreviation date value count variable total 1: AK 2011-01-31 male 12288 sex 18397 2: ZZ 2011-01-31 male 12298 sex 18427 3: AK 2011-01-31 female 6107 sex 18397 4: ZZ 2011-01-31 female 6117 sex 18427 5: AK 2011-01-31 <NA> 2 sex 18397 6: ZZ 2011-01-31 <NA> 12 sex 18427 7: AK 2011-01-31 upto22 870 age 18397 8: ZZ 2011-01-31 upto22 880 age 18487 9: AK 2011-01-31 from22to24 1441 age 18397 10: ZZ 2011-01-31 from22to24 1451 age 18487 11: AK 2011-01-31 from25to34 5320 age 18397 12: ZZ 2011-01-31 from25to34 5330 age 18487 ... 31: AK 2011-01-31 native 4856 ethn 18397 32: ZZ 2011-01-31 native 4866 ethn 18457 33: AK 2011-01-31 asian 791 ethn 18397 34: ZZ 2011-01-31 asian 801 ethn 18457 35: AK 2011-01-31 black 611 ethn 18397 36: ZZ 2011-01-31 black 621 ethn 18457 37: AK 2011-01-31 hawaii 289 ethn 18397 38: ZZ 2011-01-31 hawaii 299 ethn 18457 39: AK 2011-01-31 white 11209 ethn 18397 40: ZZ 2011-01-31 white 11219 ethn 18457 41: AK 2011-01-31 <NA> 641 ethn 18397 42: ZZ 2011-01-31 <NA> 651 ethn 18457 Abbreviation date value count variable total
Create new variables by a cross join
(7) Create the names of the new variables by a cross join CJ()
. The cross join will also include Abbreviation
and date
new_vars <-
long[!is.na(value), CJ(Abbreviation,
date,
ethn = .SD[variable == "ethn", value],
sex = .SD[variable == "sex", value],
age = .SD[variable == "age", value],
unique = TRUE)][
, new.var := paste(ethn, sex, age, sep = "_")][]
new_vars
Abbreviation date ethn sex age new.var 1: AK 2011-01-31 native male upto22 native_male_upto22 2: AK 2011-01-31 native male from22to24 native_male_from22to24 3: AK 2011-01-31 native male from25to34 native_male_from25to34 4: AK 2011-01-31 native male from35to44 native_male_from35to44 5: AK 2011-01-31 native male from45to54 native_male_from45to54 --- 156: ZZ 2011-01-31 white female from35to44 white_female_from35to44 157: ZZ 2011-01-31 white female from45to54 white_female_from45to54 158: ZZ 2011-01-31 white female from55to59 white_female_from55to59 159: ZZ 2011-01-31 white female from60to64 white_female_from60to64 160: ZZ 2011-01-31 white female over65 white_female_over65
(8) Reshape new_vars
to long format. This is required because the origial dataset has been reshaped to long format as well.
lnv <- melt(new_vars, id.vars = c("Abbreviation", "date", "new.var"))
lnv
Abbreviation date new.var variable value 1: AK 2011-01-31 native_male_upto22 ethn native 2: AK 2011-01-31 native_male_from22to24 ethn native 3: AK 2011-01-31 native_male_from25to34 ethn native 4: AK 2011-01-31 native_male_from35to44 ethn native 5: AK 2011-01-31 native_male_from45to54 ethn native --- 476: ZZ 2011-01-31 white_female_from35to44 age from35to44 477: ZZ 2011-01-31 white_female_from45to54 age from45to54 478: ZZ 2011-01-31 white_female_from55to59 age from55to59 479: ZZ 2011-01-31 white_female_from60to64 age from60to64 480: ZZ 2011-01-31 white_female_over65 age over65
(9) Append new.var
by right joining lnv
with long
.
(10) Aggregate by Abbreviation
, date
, and new.var
, thereby multiplying the counts and dividing by totals (in order to get shares).
long[lnv, on = .(Abbreviation, date, variable, value)][
, .(new.count = prod(count)/first(total)^2), by = .(Abbreviation, date, new.var)]
Abbreviation date new.var new.count 1: AK 2011-01-31 native_male_upto22 153.38579 2: AK 2011-01-31 native_male_from22to24 254.05623 3: AK 2011-01-31 native_male_from25to34 937.94527 4: AK 2011-01-31 native_male_from35to44 629.05803 5: AK 2011-01-31 native_male_from45to54 761.99238 --- 156: ZZ 2011-01-31 white_female_from35to44 720.79330 157: ZZ 2011-01-31 white_female_from45to54 872.68769 158: ZZ 2011-01-31 white_female_from55to59 312.04830 159: ZZ 2011-01-31 white_female_from60to64 180.50050 160: ZZ 2011-01-31 white_female_over65 92.86912
Alternatively: Create new variables recursively using Reduce()
Steps (7) to (10) can be replaced by recursively joining subsets.
If done manually this will look like:
long[!is.na(value) & variable == "ethn"][
long[!is.na(value) & variable == "sex"], on = .(Abbreviation, date), allow.cartesian = TRUE,
.(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)][
long[!is.na(value) & variable == "age"], on = .(Abbreviation, date), allow.cartesian = TRUE,
.(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)]
First, the subset for variable ethn
is joined with the subset for variable sex
(outer join) thereby computing the first part of the new variable name and the new count partially. Then the temporary result is joined with te subset for variable age
thereby computing the new variable name and the new count, finally.
This can be written in a more general way as
join_fct <- function(x, y) {
x[y, on = .(Abbreviation, date), allow.cartesian = TRUE,
.(Abbreviation,
date,
value = paste(value, i.value, sep ="_"),
count = count * i.count / i.total)]
}
Reduce(join_fct,
lapply(c("ethn", "sex", "age"),
function(x) long[!is.na(value) & variable == x])
)
Abbreviation date value count 1: AK 2011-01-31 native_male_upto22 153.385786 2: AK 2011-01-31 asian_male_upto22 24.985205 3: AK 2011-01-31 black_male_upto22 19.299571 4: AK 2011-01-31 hawaii_male_upto22 9.128602 5: AK 2011-01-31 white_male_upto22 354.057100 --- 156: ZZ 2011-01-31 native_female_over65 40.280090 157: ZZ 2011-01-31 asian_female_over65 6.630570 158: ZZ 2011-01-31 black_female_over65 5.140554 159: ZZ 2011-01-31 hawaii_female_over65 2.475082 160: ZZ 2011-01-31 white_female_over65 92.869365
This approach is quite flexible as the number and order of variables can be changed easily, e.g.,
Reduce(join_fct, lapply(c("race", "sex"), function(x) long[!is.na(value) & variable == x]) )
Abbreviation date value count 1: AK 2011-01-31 hispanic_male 514.9779 2: AK 2011-01-31 non_hispanic_male 11660.8090 3: ZZ 2011-01-31 hispanic_male 521.2318 4: ZZ 2011-01-31 non_hispanic_male 11657.9728 5: AK 2011-01-31 hispanic_female 255.9383 6: AK 2011-01-31 non_hispanic_female 5795.2930 7: ZZ 2011-01-31 hispanic_female 259.2596 8: ZZ 2011-01-31 non_hispanic_female 5798.6518
Data
The OP has provided on row of the original dataset (in wide format)
DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01",
year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2,
upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568,
from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451,
c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168,
native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209,
c48 = 641), row.names = c(NA, -1L), class = c("data.table", "data.frame"))
However, to verify the code works as expected, I need a second row. So I have appended a second row by:
library(data.table)
DT <- rbind(DT, DT)
DT[2, (6:ncol(DT)) := lapply(.SD, `+`, y = 10), .SDcols = 6:ncol(DT)]
DT[2, Abbreviation := "ZZ"]
DT
Abbreviation date month year c1 male female c4 upto22 from22to24 from25to34 1: AK 1/31/2011 01 2011 P 12288 6107 2 870 1441 5320 2: ZZ 1/31/2011 01 2011 P 12298 6117 12 880 1451 5330 from35to44 from45to54 from55to59 from60to64 over65 c20 hispanic non_hispanic c42 native 1: 3568 4322 1539 886 451 0 771 17458 168 4856 2: 3578 4332 1549 896 461 10 781 17468 178 4866 asian black hawaii white c48 1: 791 611 289 11209 641 2: 801 621 299 11219 651
来源:https://stackoverflow.com/questions/62406588/automisation-of-creating-new-variables-based-on-the-distribution-of-other-variab