问题
Suppose I have two dataframes.
The first one includes "Date" at which a "Name" issues a "Rec" for an "ID" and the "Stop.Date" at which "Rec" becomes invalid.
df (only a part)
structure(list(Date = structure(c(13236, 13363, 14074, 13199,
14554), class = "Date"), ID = c("AU0000XINAA9", "AU0000XINAA9",
"AU0000XINAC5", "AU0000XINAI2", "AU0000XINAJ0"), Name = c("N+1 BREWIN",
"N+1 BREWIN", "ARBUTHNOT SECURITIES LTD.", "INVESTEC BANK (UK) PLC",
"AWRAQ INVESTMENTS"), Rec = c(1, 2, 2, 2, 1), Stop.Date = structure(c(13363,
13509, 14937, 13230, 16702), class = "Date")), .Names = c("Date",
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L))
The Second dataframe only contains a time series: Let's say in this case from 2006-03-29 until end of 2006.
df2
Date1
1: 2006-02-20
2: 2006-02-21
3: 2006-02-22
4: 2006-02-23
5: 2006-02-24
---
311: 2006-12-27
312: 2006-12-28
313: 2006-12-29
314: 2006-12-30
315: 2006-12-31
Now I want my code to sum all "Rec" gouped by ID and Name if the "Date1" variable in df2 falls within the time range (Date until Stop.Date)
I found this post R - If date falls within range, then sum and it seems very close to my problem but the solution does not consider any groups.
I want to come up with a data.frame in which for each date in df2 the sum of "REC" for each single "ID" is shown.
Expected output e.g.
Date1 ID SumRec
1 2006-02-20 AU0000XINAI2 2
2 2006-02-21 AU0000XINAI2 2
...
4 2006-03-29 AU0000XINAA9 1
5 2006-03-30 AU0000XINAA9 1
6 2006-08-03 AU0000XINAA9 2 # since Date1 2006-08-03 is at the end
of range in df (row#1)-> it falls
within range in df (row#2)
...
Please keep in mind this is only a small part of the data. Usually there exists many more Recs for each "ID" from different "Names". (then sum function makes sense)
Many thanks for your help in advance.
UPDATED VERSION
new dataframes:
df
structure(list(Date = structure(c(9905, 10381, 10381, 10954,
10584, 10632, 10778, 10520, 10631, 10905), class = "Date"), ID = c("BMG4593F1389",
"BMG4593F1389", "BMG4593F1389", "BMG4593F1389", "BMG4593F1389",
"BMG4593F1389", "BMG4593F1389", "BMG526551004", "BMG526551004",
"BMG526551004"), Name = c("ING FM", "Permission Denied 128064",
"Permission Denied 2880", "Permission Denied 2880", "Permission Denied 32",
"Permission Denied 888", "Permission Denied 888", "Permission Denied 2880",
"Permission Denied 2880", "Permission Denied 2880"), Rec = c(2,
3, 2, 2, 3, 3, 3, 1, 3, 3), Stop.Date = structure(c(12095, 11232,
10954, 11180, 11345, 10764, 11667, 10631, 10905, 11087), class = "Date")), .Names = c("Date",
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))
df2
structure(list(Date1 = structure(c(10954, 10955, 10956, 10957,
10958, 10959), class = "Date")), .Names = "Date1", row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
If I now execute the following code:
> df=df[,interval := interval(df$Date, df$Stop.Date)]
>
> df1 <- do.call(rbind, lapply(df2$Date1, function(x){ index <- x
> %within% df$interval; list(ID = ifelse(any(index), df$ID[index],
> NA), Rec = ifelse(any(index), df$Rec[index], NA),
> Name = ifelse(any(index), df$Name[index], NA),interval = ifelse(any(index),df$interval[index],NA))}))
>
> df3 <- cbind(df2, df1)
I come up with the following result:
Date1 ID Rec Name interval
1: 1999-12-29 BMG4593F1389 2 ING FM 189216000
2: 1999-12-30 BMG4593F1389 2 ING FM 189216000
3: 1999-12-31 BMG4593F1389 2 ING FM 189216000
4: 2000-01-01 BMG4593F1389 2 ING FM 189216000
5: 2000-01-02 BMG4593F1389 2 ING FM 189216000
6: 2000-01-03 BMG4593F1389 2 ING FM 189216000
But since e.g the df2$Date1 ("1999-12-29") for the df$ID "BMG4593F1389" falls within the date range of 6 more entries in df (for different df$Names) FOR THIS particular df$date1 it should be:
Expected result for Date 1999-12-29 (df3$interval variable neglected here for simplicity)
Date1 ID Rec Name
1: 1999-12-29 BMG4593F1389 2 ING FM
2: 1999-12-29 BMG4593F1389 3 Permission Denied 128064
3: 1999-12-29 BMG4593F1389 2 Permission Denied 2880
4: 1999-12-29 BMG4593F1389 3 Permission Denied 32
5: 1999-12-29 BMG4593F1389 3 Permission Denied 888
6: 1999-12-29 BMG5265510042 3 Permission Denied 2880
7: 1999-12-30 BMG4593F1389 2 ING FM
... etc
So at the end I need the Dates in df$Date1 replicated if more than one name issues a Rec for a specific df$ID which falls within the respective date range.
Can somebody help me with that?
回答1:
If I understand the updated version of the question correctly, this can be solved using a non-equi join and subsequent aggregation:
library(data.table)
# non-equi join
df[df2, on = .(Date <= Date1, Stop.Date > Date1), allow = TRUE][
# aggregation
, .(sumRec = sum(Rec)), by = .(Date, ID, Name)]
Date ID Name sumRec 1: 1999-12-29 BMG4593F1389 ING FM 2 2: 1999-12-29 BMG4593F1389 Permission Denied 128064 3 3: 1999-12-29 BMG4593F1389 Permission Denied 2880 2 4: 1999-12-29 BMG4593F1389 Permission Denied 32 3 5: 1999-12-29 BMG4593F1389 Permission Denied 888 3 6: 1999-12-29 BMG526551004 Permission Denied 2880 3 7: 1999-12-30 BMG4593F1389 ING FM 2 8: 1999-12-30 BMG4593F1389 Permission Denied 128064 3 9: 1999-12-30 BMG4593F1389 Permission Denied 2880 2 10: 1999-12-30 BMG4593F1389 Permission Denied 32 3 11: 1999-12-30 BMG4593F1389 Permission Denied 888 3 12: 1999-12-30 BMG526551004 Permission Denied 2880 3 13: 1999-12-31 BMG4593F1389 ING FM 2 14: 1999-12-31 BMG4593F1389 Permission Denied 128064 3 15: 1999-12-31 BMG4593F1389 Permission Denied 2880 2 16: 1999-12-31 BMG4593F1389 Permission Denied 32 3 17: 1999-12-31 BMG4593F1389 Permission Denied 888 3 18: 1999-12-31 BMG526551004 Permission Denied 2880 3 19: 2000-01-01 BMG4593F1389 ING FM 2 20: 2000-01-01 BMG4593F1389 Permission Denied 128064 3 21: 2000-01-01 BMG4593F1389 Permission Denied 2880 2 22: 2000-01-01 BMG4593F1389 Permission Denied 32 3 23: 2000-01-01 BMG4593F1389 Permission Denied 888 3 24: 2000-01-01 BMG526551004 Permission Denied 2880 3 25: 2000-01-02 BMG4593F1389 ING FM 2 26: 2000-01-02 BMG4593F1389 Permission Denied 128064 3 27: 2000-01-02 BMG4593F1389 Permission Denied 2880 2 28: 2000-01-02 BMG4593F1389 Permission Denied 32 3 29: 2000-01-02 BMG4593F1389 Permission Denied 888 3 30: 2000-01-02 BMG526551004 Permission Denied 2880 3 31: 2000-01-03 BMG4593F1389 ING FM 2 32: 2000-01-03 BMG4593F1389 Permission Denied 128064 3 33: 2000-01-03 BMG4593F1389 Permission Denied 2880 2 34: 2000-01-03 BMG4593F1389 Permission Denied 32 3 35: 2000-01-03 BMG4593F1389 Permission Denied 888 3 36: 2000-01-03 BMG526551004 Permission Denied 2880 3 Date ID Name sumRec
Please, note that I experienced a strange error message when using df
as provided in structure(...)
directly. The error message went away after calling
df <- as.data.table(df)
Explanation
I was asked to explain how the non-equi join works. Non-equi joins are an extension of the data.table
joins. data.table
is a package which enhances base R's data.frame
.
Here, we right join df2
with df
, i.e., we want to see all rows of df2
with matches in df
in the result but only those where Date1
(from df2
) lies between Date
and Stop.Date
(from df
), Date <= Date1 < Stop.Date
to be exact. As there are many possible matches, we need to use allow.cartesian = TRUE
.
There is a video of Arun's talk at the useR! 2016 international R User conference introducing Efficient in-memory non-equi joins using data.table.
来源:https://stackoverflow.com/questions/49662243/r-sum-by-group-if-date-within-date-range