问题
I have a data table of the form
ID REGION INCOME_BAND RESIDENCY_YEARS
1 SW Under 5,000 10-15
2 Wales Over 70,000 1-5
3 Center 15,000-19,999 6-9
4 SE 15,000-19,999 15-19
5 North 15,000-19,999 10-15
6 North 15,000-19,999 6-9
created by
exp = data.table(
ID = c(1,2,3,4,5,6),
REGION=c("SW", "Wales", "Center", "SE", "North", "North"),
INCOME_BAND = c("Under ?5,000", "Over ?70,000", "?15,000-?19,999", "?15,000-?19,999", "?15,000-?19,999","?15,000-?19,999"),
RESIDENCY_YEARS = c("10-15","1-5","6-9","15-19","10-15", "6-9"))
I would like to transform this to
I've managed to perform the majority of the work with dcast:
exp.dcast = dcast(exp,ID~REGION+INCOME_BAND+RESIDENCY_YEARS, fun=length,
value.var=c('REGION', 'INCOME_BAND', 'RESIDENCY_YEARS'))
However I need some help creating sensible column headings. Currently I have
["ID"
"REGION.1_Center_?15,000-?19,999_6-9"
"REGION.1_North_?15,000-?19,999_10-15"
"REGION.1_North_?15,000-?19,999_6-9"
"REGION.1_SE_?15,000-?19,999_15-19" "REGION.1_SW_Under ?5,000_10-15" "REGION.1_Wales_Over ?70,000_1-5"
"INCOME_BAND.1_Center_?15,000-?19,999_6-9"
"INCOME_BAND.1_North_?15,000-?19,999_10-15"
"INCOME_BAND.1_North_?15,000-?19,999_6-9"
"INCOME_BAND.1_SE_?15,000-?19,999_15-19"
"INCOME_BAND.1_SW_Under ?5,000_10-15"
"INCOME_BAND.1_Wales_Over ?70,000_1-5"
"RESIDENCY_YEARS.1_Center_?15,000-?19,999_6-9" "RESIDENCY_YEARS.1_North_?15,000-?19,999_10-15" "RESIDENCY_YEARS.1_North_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_SE_?15,000-?19,999_15-19"
"RESIDENCY_YEARS.1_SW_Under ?5,000_10-15"
"RESIDENCY_YEARS.1_Wales_Over ?70,000_1-5"
And I would like the column headings to be
ID SW Wales Center SE North Under 5,000 Over 70,000 15,000-19,999 1-5 6-9 10-15 15-19
Could anybody advise?
回答1:
This apparently simple question is not easy to answer. So, we will go forward step-by step.
First, the OP has tried to reshape multiple value columns simultaneously which creates an unwanted cross product of all available combinations.
In order to treat all values in the same way, we need to melt()
all value columns first before reshaping:
melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)]
ID 1-5 10-15 15-19 6-9 ?15,000-?19,999 Center North Over ?70,000 SE SW Under ?5,000 Wales 1: 1 0 1 0 0 0 0 0 0 0 1 1 0 2: 2 1 0 0 0 0 0 0 1 0 0 0 1 3: 3 0 0 0 1 1 1 0 0 0 0 0 0 4: 4 0 0 1 0 1 0 0 0 1 0 0 0 5: 5 0 1 0 0 1 0 1 0 0 0 0 0 6: 6 0 0 0 1 1 0 1 0 0 0 0 0
Now, the result has 13 columns instead of 19 and the columns are named by the respective value as requested.
Unfortunately, the columns appear in the wrong order because they alphabetically ordered. There are two approaches to change the order:
Change order of columns after reshaping
The setcolorder()
function reorders the columns of a data.table
in place, e.g. without copying:
# define column order = order of values
col_order <- c("North", "Wales", "Center", "SW", "SE", "Under ?5,000", "?15,000-?19,999", "Over ?70,000", "1-5", "6-9", "10-15", "15-19")
melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)][
# reorder columns
, setcolorder(.SD, c("ID", col_order))]
ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19 1: 1 0 0 0 1 0 1 0 0 0 0 1 0 2: 2 0 1 0 0 0 0 0 1 1 0 0 0 3: 3 0 0 1 0 0 0 1 0 0 1 0 0 4: 4 0 0 0 0 1 0 1 0 0 0 0 1 5: 5 1 0 0 0 0 0 1 0 0 0 1 0 6: 6 1 0 0 0 0 0 1 0 0 1 0 0
Now, all REGION
columns appear first, followed by INCOME_BAND
and RESIDENCY_YEARS
columns in the specified order.
Set factor levels before reshaping
If value
is turned into a factor with appropriately ordered factor levels dcast()
will use the factor levels for ordering the columns:
melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
, dcast(.SD, ID ~ value, length)]
ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19 1: 1 0 0 0 1 0 1 0 0 0 0 1 0 2: 2 0 1 0 0 0 0 0 1 1 0 0 0 3: 3 0 0 1 0 0 0 1 0 0 1 0 0 4: 4 0 0 0 0 1 0 1 0 0 0 0 1 5: 5 1 0 0 0 0 0 1 0 0 0 1 0 6: 6 1 0 0 0 0 0 1 0 0 1 0 0
Set factor levels before reshaping - lazy version
If it is sufficient to have the columns grouped by REGION
, INCOME_BAND
, and RESIDENCY_YEARS
then we can use a short cut to avoid specifying each value in col_order
. The fct_inorder()
function from the forcats
package reorders factor levels by their first appearance in a vector:
melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
, dcast(.SD, ID ~ value, length)]
ID SW Wales Center SE North Under ?5,000 Over ?70,000 ?15,000-?19,999 10-15 1-5 6-9 15-19 1: 1 1 0 0 0 0 1 0 0 1 0 0 0 2: 2 0 1 0 0 0 0 1 0 0 1 0 0 3: 3 0 0 1 0 0 0 0 1 0 0 1 0 4: 4 0 0 0 1 0 0 0 1 0 0 0 1 5: 5 0 0 0 0 1 0 0 1 1 0 0 0 6: 6 0 0 0 0 1 0 0 1 0 0 1 0
This works because the output of melt()
is ordered by variable
:
melt(exp, id.vars = "ID")
ID variable value 1: 1 REGION SW 2: 2 REGION Wales 3: 3 REGION Center 4: 4 REGION SE 5: 5 REGION North 6: 6 REGION North 7: 1 INCOME_BAND Under ?5,000 8: 2 INCOME_BAND Over ?70,000 9: 3 INCOME_BAND ?15,000-?19,999 10: 4 INCOME_BAND ?15,000-?19,999 11: 5 INCOME_BAND ?15,000-?19,999 12: 6 INCOME_BAND ?15,000-?19,999 13: 1 RESIDENCY_YEARS 10-15 14: 2 RESIDENCY_YEARS 1-5 15: 3 RESIDENCY_YEARS 6-9 16: 4 RESIDENCY_YEARS 15-19 17: 5 RESIDENCY_YEARS 10-15 18: 6 RESIDENCY_YEARS 6-9
来源:https://stackoverflow.com/questions/46224334/data-table-dcast-column-headings