问题
I am trying to use ddply
with transform
to populate a new variable (summary_Date
) in a dataframe with variables ID
and Date
. The value of the variable is chosen based on the length of the piece that is being evaluated using ifelse
:
If there are less than five observations for an ID in a given month, I want to have summary_Date
be calculated by rounding the date to the nearest month (using round_date
from package lubridate
); if there are more than five observations for an ID in a given month, I want the summary_Date
to simply be Date
.
require(plyr)
require(lubridate)
test.df <- structure(
list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1
, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2)
, Date = structure(c(-247320000, -246196800, -245073600, -243864000
, -242654400, -241444800, -126273600, -123595200
, -121176000, -118497600, 1359385200, 1359388800
, 1359392400, 1359396000, 1359399600, 1359403200
, 1359406800, 1359410400, 1359414000, 1359417600
, 55598400, 56116800, 58881600, 62078400, 64756800
, 67348800, 69854400, 72964800, 76161600, 79012800
, 1358589600, 1358676000, 1358762400, 1358848800
, 1358935200, 1359021600, 1359108000, 1359194400
, 1359280800, 1359367200), tzone = "GMT"
, class = c("POSIXct", "POSIXt"))
, Val=rnorm(40))
, .Names = c("ID", "Date", "Val"), row.names = c(NA, 40L)
, class = "data.frame")
test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
, summary_Date=as.POSIXct(ifelse(length(ID)<5
, round_date(Date, "month")
,Date)
, origin="1970-01-01 00:00.00"
, tz="GMT")
# Included length_x to easily see the length of the subset
, length_x = length(ID))
head(test.df,5)
# floor_date(Date, "month") ID Date Val summary_Date length_x
# 1 1962-03-01 1 1962-03-01 12:00:00 -0.1037988 1962-03-01 3
# 2 1962-03-01 1 1962-03-14 12:00:00 0.2923056 1962-03-01 3
# 3 1962-03-01 1 1962-03-27 12:00:00 0.4435410 1962-03-01 3
# 4 1962-04-01 1 1962-04-10 12:00:00 0.1159164 1962-04-01 2
# 5 1962-04-01 1 1962-04-24 12:00:00 2.9824075 1962-04-01 2
The ifelse
statement seems to be working, but the value in 'summary_Date' seems to be the first value calculated for the subset that transform is working on, rather than the row-specific value. For example in row 3, summary_Date
should be 1962-04-01
because the date 1962-03-27 12:00:00'
should be rounded up (because there are fewer than five rows in the subset), but instead the first calculated value of summary_Date
(1962-03-01
) is repeated in all rows in that subset.
EDIT: I was inspired by Ricardo's answer using data.table
to try it in two steps with ddply
. It works also:
test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
, length_x = length(ID))
test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
, summary_Date=as.POSIXct(ifelse(length_x<5
, round_date(Date, "month")
,Date)
, origin="1970-01-01 00:00.00"
, tz="GMT"))
head(test.df,5)[c(1,3:7)]
# floor_date(Date, "month") ID Date Val length_x summary_Date
# 1 1962-03-01 1 1962-03-01 12:00:00 -0.1711212 3 1962-03-01
# 2 1962-03-01 1 1962-03-14 12:00:00 -0.1531571 3 1962-03-01
# 3 1962-03-01 1 1962-03-27 12:00:00 0.1256238 3 1962-04-01
# 4 1962-04-01 1 1962-04-10 12:00:00 1.4481225 2 1962-04-01
# 5 1962-04-01 1 1962-04-24 12:00:00 -0.6508731 2 1962-05-01
回答1:
One Step ddply
solution (also posted as comment)
ddply(test.df, .(ID, floor_date(Date, "month")), mutate,
length_x = length(ID),
summary_Date=as.POSIXct(ifelse(length_x < 5, round_date(Date, "month") ,Date)
, origin="1970-01-01 00:00.00", tz="GMT")
)
回答2:
# transform to data.table
library(data.table)
test.dt <- data.table(test.df)
# calculate length of id by month-year.
test.dt[, idlen := length(ID), by=list(month(Date), year(Date)) ]
# calculate the summary date
test.dt[, summary_Date := ifelse(idlen<5, as.Date(round_date(Date, "month")), as.Date(Date))]
# If you would like to have it formatted add the following:
test.dt[, summary_Date := as.Date(summary_Date, origin="1970-01-01")]
Results:
> test.dt
ID Date Val idlen summary_Date
1: 1 1962-03-01 12:00:00 0.42646422 3 1962-03-01
2: 1 1962-03-14 12:00:00 -0.29507148 3 1962-03-01
3: 1 1962-03-27 12:00:00 0.89512566 3 1962-04-01 <~~~~~
4: 1 1962-04-10 12:00:00 0.87813349 2 1962-04-01
5: 1 1962-04-24 12:00:00 0.82158108 2 1962-05-01
6: 1 1962-05-08 12:00:00 0.68864025 1 1962-05-01
UPDATE:
Explanation of why two steps are needed
The reason it cannot be done in one step has to do with the fact that you are only getting a single value per group. When you assign that value to the members of the group, you are assigning 1 element to many. R
knows how to handle such situations very well: recycling
the single element.
However, in this specifica case, you do not want to recycle; Rather, you do not want to apply the 1
element to many
. Therefore, you need unique groups, which is what we do in the second step. Each element (row) of the group then gets assigned its own, specific value.
UPDATE 2:
@Ramnath gave a great suggestion of using mutate
. Taking a look at ?mutate
, it gives:
This function is very similar to transform but it executes the transformations iteratively ... later transformations can use the columns created by earlier transformations
Which is exactly what you want to do!
来源:https://stackoverflow.com/questions/15164759/using-ifelse-with-transform-in-ddply