I am seeking help after having wasted almost a day. I have a big data frame (bdf) and a small data frame (sdf). I want to add variable z to bdf depending on the value of sdf
Edit note: I initially get a slightly different result than you did which I now think was related to my lack of understanding of R difftime objects. Timezones in POSIXt
objects also remain a mystery to me but I now see that when I coerced a 'difftime' object to 'numeric' that I got the value in "days".
The findInterval
function is very useful as an index creation function that maps a values-vector where one has multiple adjoining non overlapping intervals. You really only have two time-points that split into three intervals.
bdf$z <- c(0.2,-0.1,0.3)[findInterval(bdf$tb,
c(-Inf,
sdf$ts[2] - 0.5*as.numeric(difftime(sdf$ts[2], sdf$ts[1], units="secs")),
sdf$ts[3] - 0.5*as.numeric(difftime(sdf$ts[3], sdf$ts[2],units="sec")),
Inf))]
> bdf
tb z
1 2013-05-19 17:11:22 0.2
2 2013-05-21 06:40:58 0.2
3 2013-05-22 20:10:34 0.2
4 2013-05-24 09:40:10 -0.1
5 2013-05-25 23:09:46 -0.1
6 2013-05-27 12:39:22 0.3
7 2013-05-29 02:08:58 0.3
8 2013-05-30 15:38:34 0.3
9 2013-06-01 05:08:10 0.3
10 2013-06-02 18:37:46 0.3
I also checked to see if my result would be affected by whether the intervals in findIntervals were closed on their right rather than the left (default) and saw no difference.
This seems now absolutely unnecessary, but in base R
bdf$z <- numeric(nrow(bdf))
for(i in seq_along(bdf$z)){
ind <- which.min(abs(bdf$tb[i] - sdf$ts))
bdf$z[i] <- sdf$y[ind]
}
While being little clumsy, it has an advantage in clarity, which accomodates easy adaptation to dplyr
library(dplyr)
bdf %>% rowwise() %>%
mutate(z= sdf$y[which.min(abs(as.numeric(tb)-as.numeric(sdf$ts)))])
#Source: local data frame [10 x 2]
#Groups: <by row>
# tb z
#1 2013-05-19 17:11:22 0.2
#2 2013-05-21 06:40:58 0.2
#3 2013-05-22 20:10:34 0.2
#4 2013-05-24 09:40:10 -0.1
#5 2013-05-25 23:09:46 -0.1
#6 2013-05-27 12:39:22 0.3
#7 2013-05-29 02:08:58 0.3
#8 2013-05-30 15:38:34 0.3
#9 2013-06-01 05:08:10 0.3
#10 2013-06-02 18:37:46 0.3
Here's a solution using data.table
's rolling joins:
require(data.table)
setkey(setDT(sdf), ts)
sdf[bdf, roll = "nearest"]
# ts y
# 1: 2013-05-19 17:11:22 0.2
# 2: 2013-05-21 06:40:58 0.2
# 3: 2013-05-22 20:10:34 0.2
# 4: 2013-05-24 09:40:10 -0.1
# 5: 2013-05-25 23:09:46 -0.1
# 6: 2013-05-27 12:39:22 0.3
# 7: 2013-05-29 02:08:58 0.3
# 8: 2013-05-30 15:38:34 0.3
# 9: 2013-06-01 05:08:10 0.3
# 10: 2013-06-02 18:37:46 0.3
setDT
converts data.frame to data.table by reference.
setkey
sorts the data.table by reference in increasing order by the columns provided, and marks those columns as key columns (so that we can join on those key columns later.
In data.table, x[i]
performs a join when i
is a data.table. I'll refer you to this answer to catch up on data.table joins, if you're not already familiar with.
x[i]
performs an equi-join. That is, it finds matching row indices in x
for every row in i
and then extracts those rows from x
to return the join result along with the corresponding row from i
. In case a row in i
doesn't find matching row indices in x
, that row would have NA
for x
by default.
However, x[i, roll = .]
performs a rolling join. When there's no match, either the last observation is carried forward (roll = TRUE
or -Inf
), or the next observation can be carried backward (roll = Inf
), or rolled to the nearest value (roll = "nearest"
). And in this case you require roll = "nearest"
IIUC.
HTH
Here's my approach:
library(zoo)
m <- c(rollmean(as.POSIXct(sdf$ts), 2), Inf)
transform(bdf, z = sdf$y[sapply(tb, function(x) which.max(x < m))])
# tb z
#1 2013-05-19 17:11:22 0.2
#2 2013-05-21 06:40:58 0.2
#3 2013-05-22 20:10:34 0.2
#4 2013-05-24 09:40:10 -0.1
#5 2013-05-25 23:09:46 -0.1
#6 2013-05-27 12:39:22 0.3
#7 2013-05-29 02:08:58 0.3
#8 2013-05-30 15:38:34 0.3
#9 2013-06-01 05:08:10 0.3
#10 2013-06-02 18:37:46 0.3
Update: removed conversion to numeric (not required)
Brief explanation:
as.POSIXct(sdf$ts)
converts the dates to POSIXct-style date-timesrollmean(as.POSIXct(sdf$ts), 2)
computes the rolling mean of each two consecutive rows. This happens to be exactly the time you want to use for separating the observations. rollmean
is from package zoo
. Computing a rollmean(..,2)
means the output vector is shortened by 1 compared to the input vector.rollmean
in c(.., Inf)
which means that the infinity value is added to the rollmean vector as the last value. This will ensure that the last entries of z
in sdf
are also returned (0.3 in the specific example).transform
to add the z
column to bdf
sapply(tb, function(x) which.max(x < m))
loops through the entries in bdf$tb
and for each entry, computes the maximum index for which bdf$tb
is less (earlier) than m
(which holds the vector of rollmean entries). Only the maximum (latest) index is returned for each bdf$tb
entry.sdf$y[sapply(tb, function(x) which.max(x < m))]
to extract the corresponding elements of sdf$y
which will then be stored/copied to the new z
column in bdf
Hope that helps