问题
I have the following table:
perid date rating
10001 2005 RD
10001 2006 GN
10001 2007 GD
10002 2008 GD
10002 2009 YW
10002 2010 GN
10002 2011 GN
10003 2005 GD
10003 2006 GN
10003 2007 YW
how can I turn this table to the following format:
perid 2005 2006 2007 2008 2009 2010 2011
10001 RD GN GN N/A N/A N/A N/A
10002 N/A N/A N/A GD YW GN GN
10003 GD GN YW N/A N/A N/A N/A
Or if I can do this in R?
Thanks, P
回答1:
In base R, the function to use would be reshape
, and you would be converting your data from "long" to "wide".
reshape(mydf, direction = "wide", idvar="perid", timevar="date")
# perid rating.2005 rating.2006 rating.2007 rating.2008 rating.2009 rating.2010 rating.2011
# 1 10001 RD GN GD <NA> <NA> <NA> <NA>
# 4 10002 <NA> <NA> <NA> GD YW GN GN
# 8 10003 GD GN YW <NA> <NA> <NA> <NA>
Alternatively, you can look at dcast
from the "reshape2" package and try:
library(reshape2)
dcast(mydf, perid ~ date, value.var="rating")
# perid 2005 2006 2007 2008 2009 2010 2011
# 1 10001 RD GN GD <NA> <NA> <NA> <NA>
# 2 10002 <NA> <NA> <NA> GD YW GN GN
# 3 10003 GD GN YW <NA> <NA> <NA> <NA>
For better speed, convert your data.frame
to a data.table
and use dcast.data.table
instead.
library(reshape2)
library(data.table)
DT <- data.table(mydf)
dcast.data.table(DT, perid ~ date, value.var = "rating")
# perid 2005 2006 2007 2008 2009 2010 2011
# 1: 10001 RD GN GD NA NA NA NA
# 2: 10002 NA NA NA GD YW GN GN
# 3: 10003 GD GN YW NA NA NA NA
From your comments, it sounds like you have duplicated values among the combinations of column 1 and 2, which means that by default, dcast
will use length
as its aggregation function.
To get past this, you need to make a secondary ID (or "time", actually) column, which can be done like this.
First, some sample data. Note the duplicated combination of the first two columns in rows 1 and 2.
mydf <- data.frame(
period = c(10001, 10001, 10002, 10002, 10003, 10003, 10001, 10001),
date = c(2005, 2005, 2006, 2007, 2005, 2006, 2006, 2007),
rating = c("RD", "GN", "GD", "GD", "YW", "GN", "GD", "YN"))
mydf
# period date rating
# 1 10001 2005 RD
# 2 10001 2005 GN
# 3 10002 2006 GD
# 4 10002 2007 GD
# 5 10003 2005 YW
# 6 10003 2006 GN
# 7 10001 2006 GD
# 8 10001 2007 YN
When you try dcast
, it just "counts" the number under each combination.
## Not what you want
dcast(mydf, period ~ date, value.var="rating")
# Aggregation function missing: defaulting to length
# period 2005 2006 2007
# 1 10001 2 1 1
# 2 10002 0 1 1
# 3 10003 1 1 0
Either decide which duplicated row should be dropped, or, if all the data belongs in your dataset, add a "time" variable, like this:
mydf$time <- ave(1:nrow(mydf), mydf$period, mydf$date, FUN = seq_along)
mydf
# period date rating time
# 1 10001 2005 RD 1
# 2 10001 2005 GN 2
# 3 10002 2006 GD 1
# 4 10002 2007 GD 1
# 5 10003 2005 YW 1
# 6 10003 2006 GN 1
# 7 10001 2006 GD 1
# 8 10001 2007 YN 1
Now, dcast
should work fine. Here's a semi-long version...
dcast(mydf, period + time ~ date, value.var="rating")
# period time 2005 2006 2007
# 1 10001 1 RD GD YN
# 2 10001 2 GN <NA> <NA>
# 3 10002 1 <NA> GD GD
# 4 10003 1 YW GN <NA>
... and a semi-wide version.
dcast(mydf, period ~ date + time, value.var="rating")
# period 2005_1 2005_2 2006_1 2007_1
# 1 10001 RD GN GD YN
# 2 10002 <NA> <NA> GD GD
# 3 10003 YW <NA> GN <NA>
回答2:
Simple way of doing this is using the reshape2 package -
period <- c(10001,10001,10001,10002,10002,10002,10002,10003,10003,10003)
date <- c(2005, 2006,2007,2008, 2009,2010,2011,2005,2006,2007)
rating <- c("RD","GN","GD","GD","YW","GN", "GN","GD", "GN","YW")
a <- data.frame(period,date,rating)
library(reshape2)
b <- dcast(a,formula=period~date,value.var="rating")
b
>b
period 2005 2006 2007 2008 2009 2010 2011
1 10001 RD GN GD <NA> <NA> <NA> <NA>
2 10002 <NA> <NA> <NA> GD YW GN GN
3 10003 GD GN YW <NA> <NA> <NA> <NA>
来源:https://stackoverflow.com/questions/23003508/reshape-table-in-mysql-or-r