问题
I have a table as below (different row with same ID will have same gender and age value but different category and sub category value):
ID product.category sub.category gender age
1 1 food chicken M young
2 1 kitchen napkin M young
3 1 food steak M young
4 2 electronic phone F mid
5 3 cloth shirt M old
6 3 kitchen bowl M old
7 4 alch beer F young
And by combine different rows with same ID, I want to reform the table as below:
ID product.category1 sub.category1 product.category2 sub.category2 product.category3 sub.category3 gender age
1 1 food chicken kitchen napkin food steak M young
2 2 electronic phone null null null null F mid
3 3 cloth shirt kitchen bowl null null M old
4 4 alch beer null null null null F young
How can I do this in R?
#New dataset: text variable is actually text column of notes
text Category Subcategory variable1 variable2 variable3 variable4 date
aaaaa c1 s11 v1 N RETAIL Y 2014-01
aaaaa c2 s22 v1 N LEASE Y 2014-01
aaaaa c3 s31 v1 N LEASE Y 2014-01
bbbbb c1 s12 v2 N LEASE Y 2014-01
ccccc c2 s21 v1 N LEASE Y 2014-01
ddddd c2 s21 v1 N RETAIL Y 2014-01
ddddd c3 s31 v1 N LEASE Y 2014-01
eeeee c1 s11 v1 N RETAIL Y 2014-01
fffff c2 s21 v2 U RETAIL Y 2014-01
Thanks
回答1:
We use a combination of melt
and dcast
from the package reshape2
.
library(dplyr)
library(reshape2)
m2 <- melt(df, c("ID", "gender", "age")) %>% group_by(ID, variable) %>%
mutate(variable2 = paste0(variable, seq_along(value)))
newdf <- dcast(m2[!names(m2) %in% "variable"], ...~variable2, value.var="value", fill="null")
We first melt the original data frame by the product category and sub-category. Next using dplyr we group by the id column and product column (now called "variable" by default) and create a new column called variable2
. This is just a paste of the category title and a running count of observations.
Now we have a new column that we can spread the data out by. We use dcast
to go "wide" on the new variable2 column. There's also an argument called fill
that we set equal to "null"
telling dcast what to fill the missing values with.
Below we reorder the columns based on the desired output. The trick is worth noting even if it is a small one. It's interesting creating an interweaving sequence. Our output as is will order alphabetically ("p1", "p2", "p3", "s1", "s2", "s3"). We want a sequence that weaves them together. The challenge is to get something like (1,4,2,5,3,6). So we use:
c(rbind(1:3, 4:6))
[1] 1 4 2 5 3 6
Cool huh? We take advantage of the fact that rbind will unwind column-wise while we enter the values by row. In our case, writing 1:3
can't help because there might be more products in the data. But we know that there are two headings "product category" and "sub-subcategory". We divide the unique values of variable2
by 2 and use that instead.
n <- nrow(unique(m2[,"variable2"]))
newdf[c(1:3,(c(rbind(1:(n/2), (n/2+1):n))+3))]
# ID gender age product.category1 sub.category1 product.category2
# 1 1 M young food chicken kitchen
# 2 2 F mid electronic phone null
# 3 3 M old cloth shirt kitchen
# 4 4 F young alch beer null
# sub.category2 product.category3 sub.category3
# 1 napkin food steak
# 2 null null null
# 3 bowl null null
# 4 null null null
Update
With the new data set provided, the same code structure works with the new column names.
m2 <- melt(df, measure.vars=c("Category", "Subcategory")) %>% group_by(text, variable) %>%
mutate(variable2 = paste0(variable, seq_along(value)))
newdf <- dcast(m2[!names(m2) %in% "variable"], ... ~ variable2, value.var="value", fill="null")
n <- nrow(unique(m2[,"variable2"]))
newdf2 <- newdf[c(1:5, c(rbind(1:(n/2), (n/2+1):n))+5)]
newdf2
# text variable1 variable3 variable4 date Category1 Subcategory1 Category2
# 1 aaaaa v1 LEASE Y 2014-01 null null c2
# 2 aaaaa v1 RETAIL Y 2014-01 c1 s11 null
# 3 bbbbb v2 LEASE Y 2014-01 c1 s12 null
# 4 ccccc v1 LEASE Y 2014-01 c2 s21 null
# 5 ddddd v1 LEASE Y 2014-01 null null c3
# 6 ddddd v1 RETAIL Y 2014-01 c2 s21 null
# 7 eeeee v1 RETAIL Y 2014-01 c1 s11 null
# 8 fffff v2 RETAIL Y 2014-01 c2 s21 null
# Subcategory2 Category3 Subcategory3
# 1 s22 c3 s31
# 2 null null null
# 3 null null null
# 4 null null null
# 5 s31 null null
# 6 null null null
# 7 null null null
# 8 null null null
回答2:
data.table dcast You could use dcast
from the reshape2 or data.table package:
library(data.table)
setDT(DT)
DT[, obsno := 1:.N, by=ID]
res <- dcast(DT, ID+gender+age~obsno, value.var=c("product.category","sub.category"))
which gives
ID gender age product.category_1 product.category_2 product.category_3 sub.category_1 sub.category_2 sub.category_3
1: 1 M young food kitchen food chicken napkin steak
2: 2 F mid electronic NA NA phone NA NA
3: 3 M old cloth kitchen NA shirt bowl NA
4: 4 F young alch NA NA beer NA NA
To see the columns in your desired order, you could do something like
res[, c(1:3,4,7,5,8,6,9), with=FALSE]
A similar approach is probably possible with the tidyr package (though it won't be called "dcast").
I'd suggest sticking to long format (what you had originally) for any analysis. This wide format that you're looking for is very cumbersome for anything but browsing the data.
Second example For the OP's second example, I would do
DT2[, obsno := 1:.N, by=text]
dcast(DT2, ...~obsno, value.var=c("Category", "Subcategory"))
copying the ...~
trick from @PierreLafortune's answer. The result is
text variable1 variable2 variable3 variable4 date Category_1 Category_2 Category_3 Subcategory_1 Subcategory_2 Subcategory_3
1: aaaaa v1 N LEASE Y 2014-01 NA c2 c3 NA s22 s31
2: aaaaa v1 N RETAIL Y 2014-01 c1 NA NA s11 NA NA
3: bbbbb v2 N LEASE Y 2014-01 c1 NA NA s12 NA NA
4: ccccc v1 N LEASE Y 2014-01 c2 NA NA s21 NA NA
5: ddddd v1 N LEASE Y 2014-01 NA c3 NA NA s31 NA
6: ddddd v1 N RETAIL Y 2014-01 c2 NA NA s21 NA NA
7: eeeee v1 N RETAIL Y 2014-01 c1 NA NA s11 NA NA
8: fffff v2 U RETAIL Y 2014-01 c2 NA NA s21 NA NA
回答3:
An alternative with dplyr
& tidyr
:
newdf <- df %>% gather(variable, value, product.category, sub.category) %>%
group_by(ID, variable) %>%
mutate(variable2 = paste0(variable, seq_along(value))) %>%
ungroup() %>%
select(-variable) %>%
spread(variable2 , value)
which gives:
> newdf
Source: local data frame [4 x 9]
ID gender age product.category1 product.category2 product.category3 sub.category1 sub.category2 sub.category3
(int) (fctr) (fctr) (chr) (chr) (chr) (chr) (chr) (chr)
1 1 M young food kitchen food chicken napkin steak
2 2 F mid electronic NA NA phone NA NA
3 3 M old cloth kitchen NA shirt bowl NA
4 4 F young alch NA NA beer NA NA
The same can be done on the second example dataset:
newdat <- dat %>% gather(variable, value, Category, Subcategory) %>%
group_by(text, variable) %>%
mutate(var2 = paste0(variable, seq_along(value))) %>%
ungroup() %>%
select(-variable) %>%
spread(var2 , value)
which gives:
> newdat
Source: local data frame [8 x 12]
text variable1 variable2 variable3 variable4 date Category1 Category2 Category3 Subcategory1 Subcategory2 Subcategory3
(fctr) (fctr) (fctr) (fctr) (fctr) (fctr) (chr) (chr) (chr) (chr) (chr) (chr)
1 aaaaa v1 N LEASE Y 2014-01 NA c2 c3 NA s22 s31
2 aaaaa v1 N RETAIL Y 2014-01 c1 NA NA s11 NA NA
3 bbbbb v2 N LEASE Y 2014-01 c1 NA NA s12 NA NA
4 ccccc v1 N LEASE Y 2014-01 c2 NA NA s21 NA NA
5 ddddd v1 N LEASE Y 2014-01 NA c3 NA NA s31 NA
6 ddddd v1 N RETAIL Y 2014-01 c2 NA NA s21 NA NA
7 eeeee v1 N RETAIL Y 2014-01 c1 NA NA s11 NA NA
8 fffff v2 U RETAIL Y 2014-01 c2 NA NA s21 NA NA
来源:https://stackoverflow.com/questions/32570473/reformat-table-in-r