Reformat table in R | 易学教程

问题

I have a table as below (different row with same ID will have same gender and age value but different category and sub category value):

  ID product.category sub.category gender   age
1  1             food      chicken      M young
2  1          kitchen       napkin      M young
3  1             food        steak      M young
4  2       electronic        phone      F   mid
5  3            cloth        shirt      M   old
6  3          kitchen         bowl      M   old
7  4             alch         beer      F young

And by combine different rows with same ID, I want to reform the table as below:

  ID product.category1 sub.category1 product.category2 sub.category2 product.category3 sub.category3 gender   age
1  1              food       chicken           kitchen        napkin              food         steak      M young
2  2        electronic         phone              null          null              null          null      F   mid
3  3             cloth         shirt           kitchen          bowl              null          null      M   old
4  4              alch          beer              null          null              null          null      F young

How can I do this in R?

New dataset: text variable is actually text column of notes

text    Category    Subcategory variable1   variable2   variable3   variable4   date
aaaaa   c1  s11 v1  N   RETAIL  Y   2014-01
aaaaa   c2  s22 v1  N   LEASE   Y   2014-01
aaaaa   c3  s31 v1  N   LEASE   Y   2014-01
bbbbb   c1  s12 v2  N   LEASE   Y   2014-01
ccccc   c2  s21 v1  N   LEASE   Y   2014-01
ddddd   c2  s21 v1  N   RETAIL  Y   2014-01
ddddd   c3  s31 v1  N   LEASE   Y   2014-01
eeeee   c1  s11 v1  N   RETAIL  Y   2014-01
fffff   c2  s21 v2  U   RETAIL  Y   2014-01

Thanks

回答1:

We use a combination of melt and dcast from the package reshape2.

library(dplyr)
library(reshape2)
m2 <- melt(df, c("ID", "gender", "age")) %>% group_by(ID, variable) %>% 
  mutate(variable2 = paste0(variable, seq_along(value)))
newdf <- dcast(m2[!names(m2) %in% "variable"], ...~variable2, value.var="value", fill="null")

We first melt the original data frame by the product category and sub-category. Next using dplyr we group by the id column and product column (now called "variable" by default) and create a new column called variable2. This is just a paste of the category title and a running count of observations.

Now we have a new column that we can spread the data out by. We use dcast to go "wide" on the new variable2 column. There's also an argument called fill that we set equal to "null" telling dcast what to fill the missing values with.

Below we reorder the columns based on the desired output. The trick is worth noting even if it is a small one. It's interesting creating an interweaving sequence. Our output as is will order alphabetically ("p1", "p2", "p3", "s1", "s2", "s3"). We want a sequence that weaves them together. The challenge is to get something like (1,4,2,5,3,6). So we use:

c(rbind(1:3, 4:6))
[1] 1 4 2 5 3 6

Cool huh? We take advantage of the fact that rbind will unwind column-wise while we enter the values by row. In our case, writing 1:3 can't help because there might be more products in the data. But we know that there are two headings "product category" and "sub-subcategory". We divide the unique values of variable2 by 2 and use that instead.

n <- nrow(unique(m2[,"variable2"]))
newdf[c(1:3,(c(rbind(1:(n/2), (n/2+1):n))+3))]
#   ID gender   age product.category1 sub.category1 product.category2
# 1  1      M young              food       chicken           kitchen
# 2  2      F   mid        electronic         phone              null
# 3  3      M   old             cloth         shirt           kitchen
# 4  4      F young              alch          beer              null
#   sub.category2 product.category3 sub.category3
# 1        napkin              food         steak
# 2          null              null          null
# 3          bowl              null          null
# 4          null              null          null

Update

With the new data set provided, the same code structure works with the new column names.

m2 <- melt(df, measure.vars=c("Category", "Subcategory")) %>% group_by(text, variable) %>%
  mutate(variable2 = paste0(variable, seq_along(value)))

newdf <- dcast(m2[!names(m2) %in% "variable"], ... ~ variable2, value.var="value", fill="null")
n <- nrow(unique(m2[,"variable2"]))
newdf2 <- newdf[c(1:5, c(rbind(1:(n/2), (n/2+1):n))+5)]
newdf2
#    text variable1 variable3 variable4    date Category1 Subcategory1 Category2
# 1 aaaaa        v1     LEASE         Y 2014-01      null         null        c2
# 2 aaaaa        v1    RETAIL         Y 2014-01        c1          s11      null
# 3 bbbbb        v2     LEASE         Y 2014-01        c1          s12      null
# 4 ccccc        v1     LEASE         Y 2014-01        c2          s21      null
# 5 ddddd        v1     LEASE         Y 2014-01      null         null        c3
# 6 ddddd        v1    RETAIL         Y 2014-01        c2          s21      null
# 7 eeeee        v1    RETAIL         Y 2014-01        c1          s11      null
# 8 fffff        v2    RETAIL         Y 2014-01        c2          s21      null
#   Subcategory2 Category3 Subcategory3
# 1          s22        c3          s31
# 2         null      null         null
# 3         null      null         null
# 4         null      null         null
# 5          s31      null         null
# 6         null      null         null
# 7         null      null         null
# 8         null      null         null

回答2:

data.table dcast You could use dcast from the reshape2 or data.table package:

library(data.table)
setDT(DT)

DT[, obsno := 1:.N, by=ID]
res <- dcast(DT, ID+gender+age~obsno, value.var=c("product.category","sub.category"))

which gives

   ID gender   age product.category_1 product.category_2 product.category_3 sub.category_1 sub.category_2 sub.category_3
1:  1      M young               food            kitchen               food        chicken         napkin          steak
2:  2      F   mid         electronic                 NA                 NA          phone             NA             NA
3:  3      M   old              cloth            kitchen                 NA          shirt           bowl             NA
4:  4      F young               alch                 NA                 NA           beer             NA             NA

To see the columns in your desired order, you could do something like

res[, c(1:3,4,7,5,8,6,9), with=FALSE]

A similar approach is probably possible with the tidyr package (though it won't be called "dcast").

I'd suggest sticking to long format (what you had originally) for any analysis. This wide format that you're looking for is very cumbersome for anything but browsing the data.

Second example For the OP's second example, I would do

DT2[, obsno := 1:.N, by=text]
dcast(DT2, ...~obsno, value.var=c("Category", "Subcategory"))

copying the ...~ trick from @PierreLafortune's answer. The result is

    text variable1 variable2 variable3 variable4    date Category_1 Category_2 Category_3 Subcategory_1 Subcategory_2 Subcategory_3
1: aaaaa        v1         N     LEASE         Y 2014-01         NA         c2         c3            NA           s22           s31
2: aaaaa        v1         N    RETAIL         Y 2014-01         c1         NA         NA           s11            NA            NA
3: bbbbb        v2         N     LEASE         Y 2014-01         c1         NA         NA           s12            NA            NA
4: ccccc        v1         N     LEASE         Y 2014-01         c2         NA         NA           s21            NA            NA
5: ddddd        v1         N     LEASE         Y 2014-01         NA         c3         NA            NA           s31            NA
6: ddddd        v1         N    RETAIL         Y 2014-01         c2         NA         NA           s21            NA            NA
7: eeeee        v1         N    RETAIL         Y 2014-01         c1         NA         NA           s11            NA            NA
8: fffff        v2         U    RETAIL         Y 2014-01         c2         NA         NA           s21            NA            NA

回答3:

An alternative with dplyr & tidyr:

newdf <- df %>% gather(variable, value, product.category, sub.category) %>%
  group_by(ID, variable) %>%
  mutate(variable2 = paste0(variable, seq_along(value))) %>%
  ungroup() %>%
  select(-variable) %>%
  spread(variable2 , value)

which gives:

> newdf
Source: local data frame [4 x 9]

     ID gender    age product.category1 product.category2 product.category3 sub.category1 sub.category2 sub.category3
  (int) (fctr) (fctr)             (chr)             (chr)             (chr)         (chr)         (chr)         (chr)
1     1      M  young              food           kitchen              food       chicken        napkin         steak
2     2      F    mid        electronic                NA                NA         phone            NA            NA
3     3      M    old             cloth           kitchen                NA         shirt          bowl            NA
4     4      F  young              alch                NA                NA          beer            NA            NA

The same can be done on the second example dataset:

newdat <- dat %>% gather(variable, value, Category, Subcategory) %>%
  group_by(text, variable) %>%
  mutate(var2 = paste0(variable, seq_along(value))) %>%
  ungroup() %>%
  select(-variable) %>%
  spread(var2 , value)

which gives:

> newdat
Source: local data frame [8 x 12]

    text variable1 variable2 variable3 variable4    date Category1 Category2 Category3 Subcategory1 Subcategory2 Subcategory3
  (fctr)    (fctr)    (fctr)    (fctr)    (fctr)  (fctr)     (chr)     (chr)     (chr)        (chr)        (chr)        (chr)
1  aaaaa        v1         N     LEASE         Y 2014-01        NA        c2        c3           NA          s22          s31
2  aaaaa        v1         N    RETAIL         Y 2014-01        c1        NA        NA          s11           NA           NA
3  bbbbb        v2         N     LEASE         Y 2014-01        c1        NA        NA          s12           NA           NA
4  ccccc        v1         N     LEASE         Y 2014-01        c2        NA        NA          s21           NA           NA
5  ddddd        v1         N     LEASE         Y 2014-01        NA        c3        NA           NA          s31           NA
6  ddddd        v1         N    RETAIL         Y 2014-01        c2        NA        NA          s21           NA           NA
7  eeeee        v1         N    RETAIL         Y 2014-01        c1        NA        NA          s11           NA           NA
8  fffff        v2         U    RETAIL         Y 2014-01        c2        NA        NA          s21           NA           NA

来源：https://stackoverflow.com/questions/32570473/reformat-table-in-r

标签

reshape