Creating variable in R data frame depending on another data frame

前端 未结 4 678
后悔当初
后悔当初 2020-12-31 20:34

I am seeking help after having wasted almost a day. I have a big data frame (bdf) and a small data frame (sdf). I want to add variable z to bdf depending on the value of sdf

相关标签:
4条回答
  • 2020-12-31 21:02

    Edit note: I initially get a slightly different result than you did which I now think was related to my lack of understanding of R difftime objects. Timezones in POSIXt objects also remain a mystery to me but I now see that when I coerced a 'difftime' object to 'numeric' that I got the value in "days".

    The findInterval function is very useful as an index creation function that maps a values-vector where one has multiple adjoining non overlapping intervals. You really only have two time-points that split into three intervals.

    bdf$z <- c(0.2,-0.1,0.3)[findInterval(bdf$tb, 
                    c(-Inf, 
      sdf$ts[2] - 0.5*as.numeric(difftime(sdf$ts[2], sdf$ts[1], units="secs")), 
      sdf$ts[3] - 0.5*as.numeric(difftime(sdf$ts[3], sdf$ts[2],units="sec")), 
                     Inf))]
    
    > bdf
                        tb    z
    1  2013-05-19 17:11:22  0.2
    2  2013-05-21 06:40:58  0.2
    3  2013-05-22 20:10:34  0.2
    4  2013-05-24 09:40:10 -0.1
    5  2013-05-25 23:09:46 -0.1
    6  2013-05-27 12:39:22  0.3
    7  2013-05-29 02:08:58  0.3
    8  2013-05-30 15:38:34  0.3
    9  2013-06-01 05:08:10  0.3
    10 2013-06-02 18:37:46  0.3
    

    I also checked to see if my result would be affected by whether the intervals in findIntervals were closed on their right rather than the left (default) and saw no difference.

    0 讨论(0)
  • 2020-12-31 21:03

    This seems now absolutely unnecessary, but in base R

    bdf$z <- numeric(nrow(bdf))
    for(i in seq_along(bdf$z)){
      ind <- which.min(abs(bdf$tb[i] - sdf$ts))
      bdf$z[i] <- sdf$y[ind]
    }
    

    While being little clumsy, it has an advantage in clarity, which accomodates easy adaptation to dplyr

    library(dplyr)
    bdf %>% rowwise() %>% 
      mutate(z= sdf$y[which.min(abs(as.numeric(tb)-as.numeric(sdf$ts)))])
    
    #Source: local data frame [10 x 2]
    #Groups: <by row>
    
    #                    tb    z
    #1  2013-05-19 17:11:22  0.2
    #2  2013-05-21 06:40:58  0.2
    #3  2013-05-22 20:10:34  0.2
    #4  2013-05-24 09:40:10 -0.1
    #5  2013-05-25 23:09:46 -0.1
    #6  2013-05-27 12:39:22  0.3
    #7  2013-05-29 02:08:58  0.3
    #8  2013-05-30 15:38:34  0.3
    #9  2013-06-01 05:08:10  0.3
    #10 2013-06-02 18:37:46  0.3
    
    0 讨论(0)
  • 2020-12-31 21:07

    Here's a solution using data.table's rolling joins:

    require(data.table)
    setkey(setDT(sdf), ts)
    sdf[bdf, roll = "nearest"]
    #                      ts    y
    #  1: 2013-05-19 17:11:22  0.2
    #  2: 2013-05-21 06:40:58  0.2
    #  3: 2013-05-22 20:10:34  0.2
    #  4: 2013-05-24 09:40:10 -0.1
    #  5: 2013-05-25 23:09:46 -0.1
    #  6: 2013-05-27 12:39:22  0.3
    #  7: 2013-05-29 02:08:58  0.3
    #  8: 2013-05-30 15:38:34  0.3
    #  9: 2013-06-01 05:08:10  0.3
    # 10: 2013-06-02 18:37:46  0.3
    
    • setDT converts data.frame to data.table by reference.

    • setkey sorts the data.table by reference in increasing order by the columns provided, and marks those columns as key columns (so that we can join on those key columns later.

    • In data.table, x[i] performs a join when i is a data.table. I'll refer you to this answer to catch up on data.table joins, if you're not already familiar with.

    • x[i] performs an equi-join. That is, it finds matching row indices in x for every row in i and then extracts those rows from x to return the join result along with the corresponding row from i. In case a row in i doesn't find matching row indices in x, that row would have NA for x by default.

      However, x[i, roll = .] performs a rolling join. When there's no match, either the last observation is carried forward (roll = TRUE or -Inf), or the next observation can be carried backward (roll = Inf), or rolled to the nearest value (roll = "nearest"). And in this case you require roll = "nearest" IIUC.

    HTH

    0 讨论(0)
  • 2020-12-31 21:07

    Here's my approach:

    library(zoo)
    m <- c(rollmean(as.POSIXct(sdf$ts), 2), Inf)
    transform(bdf, z = sdf$y[sapply(tb, function(x) which.max(x < m))])
    #                    tb    z
    #1  2013-05-19 17:11:22  0.2
    #2  2013-05-21 06:40:58  0.2
    #3  2013-05-22 20:10:34  0.2
    #4  2013-05-24 09:40:10 -0.1
    #5  2013-05-25 23:09:46 -0.1
    #6  2013-05-27 12:39:22  0.3
    #7  2013-05-29 02:08:58  0.3
    #8  2013-05-30 15:38:34  0.3
    #9  2013-06-01 05:08:10  0.3
    #10 2013-06-02 18:37:46  0.3
    

    Update: removed conversion to numeric (not required)

    Brief explanation:

    • as.POSIXct(sdf$ts) converts the dates to POSIXct-style date-times
    • rollmean(as.POSIXct(sdf$ts), 2) computes the rolling mean of each two consecutive rows. This happens to be exactly the time you want to use for separating the observations. rollmean is from package zoo. Computing a rollmean(..,2) means the output vector is shortened by 1 compared to the input vector.
    • That is why I wrap the result of rollmean in c(.., Inf) which means that the infinity value is added to the rollmean vector as the last value. This will ensure that the last entries of z in sdf are also returned (0.3 in the specific example).
    • I use transform to add the z column to bdf
    • sapply(tb, function(x) which.max(x < m)) loops through the entries in bdf$tb and for each entry, computes the maximum index for which bdf$tb is less (earlier) than m (which holds the vector of rollmean entries). Only the maximum (latest) index is returned for each bdf$tb entry.
    • That vector of indices is used in sdf$y[sapply(tb, function(x) which.max(x < m))] to extract the corresponding elements of sdf$y which will then be stored/copied to the new z column in bdf

    Hope that helps

    0 讨论(0)
提交回复
热议问题