Set column value using the first next row in the same group that meets a condition

前端 未结 3 841
予麋鹿
予麋鹿 2021-02-08 10:26

I am new to R and this is my first question on stackoverflow.

I am trying

  • to assign by reference to a new column
  • for each row
  • using the v
3条回答
  •  梦毁少年i
    2021-02-08 10:46

    Join a data.table to a subset of itself, by group, to get values from rows matching non-equal criteria.

    Summary:

    • Below I show 5 working data.table solutions that were candidates to performance test against the OP's actual data set (1.4M records).

    • All 5 solutions use "non-equi" joins (using inequality to compare columns for the join) in the on clause.

    • Each solution is just a small progressive code change so it should be easy to follow along to compare different data.table options and syntax choices.

    Approach

    To work through data.table syntax for this I broke it into to the following steps for the OP's problem:

    1. Join the dt to a subset of itself (or another data.table for that matter).
    2. Select (and rename) the columns you want from either dt or the subset.
    3. Define the join criteria based on columns from dt compared to columns in the subset, including using "non-equi" (non-equal) comparisons.
    4. Optionally define whether first or last match should be selected when multiple matching records are found in the subset.

    Solution 1:

    # Add row numbers to all records in dt (only because you 
    # have criteria based on comparing sequential rows):
    dt[, row := .I] 
    
    # Compute result columns (  then standard assignment into dt using <-  )
    dt$found_date  <- 
                dt[code=='p'][dt,   # join dt to the data.table matching your criteria, in this case dt[code=='p']
                              .( x.date_up ),   # columns to select, x. prefix means columns from dt[code=='p'] 
                              on = .(id==id, row > row, date_up > date_down),   # join criteria: dt[code=='p'] fields on LHS, main dt fields on RHS
                              mult = "first"]   # get only the first match if multiple matches
    

    Note in the join expressions above:

    • i in this case is your main dt. This way you get all records from your main data.table.
    • x is the subset (or any other data.table) from which you want to find matching values.

    Result matches requested output:

    dt
    
        id code  date_down    date_up row found_date
     1:  1    p 2019-01-01 2019-01-02   1       
     2:  1    f 2019-01-02 2019-01-03   2       
     3:  2    f 2019-01-02 2019-01-02   3       
     4:  2    p 2019-01-03          4       
     5:  3    p 2019-01-04          5       
     6:  4  2019-01-05 2019-01-05   6       
     7:  5    f 2019-01-07 2019-01-08   7 2019-01-08
     8:  5    p 2019-01-07 2019-01-08   8 2019-01-09
     9:  5    p 2019-01-09 2019-01-09   9       
    10:  6    f 2019-01-10 2019-01-10  10 2019-01-11
    11:  6    p 2019-01-10 2019-01-10  11 2019-01-11
    12:  6    p 2019-01-10 2019-01-11  12       
    

    Note: You may remove the row column by doing dt[, row := NULL] if you like.

    Solution 2:

    Identical logic as above to join and find the result columns, but now using "assign by reference" := to create found_date in dt:

    dt[, row := .I] # add row numbers (as in all the solutions)
    
    # Compute result columns (  then assign by reference into dt using :=  
    
    # dt$found_date  <- 
    dt[, found_date :=   # assign by reference to dt$found_date 
                dt[code=='p'][dt, 
                              .( x.date_up ), 
                              on = .(id==id, row > row, date_up > date_down),
                              mult = "first"]]
    

    In Solution 2, the slight variation to assign our results "by reference" into dt should be more efficient than Solution 1. Solution 1 calculated results the exact same way - the only difference is Solution 1 used standard assignment <- to create dt$found_date (less efficient).

    Solution 3:

    Like Solution 2 but now using .(.SD) in place of dt to refer to the original dt without naming it directly.

    dt[, row := .I] # add row numbers (as in all the solutions)
    setkey(dt, id, row, date_down)  #set key for dt 
    
    # For all rows of dt, create found_date by reference :=
    dt[, found_date := 
                # dt[code=='p'][dt, 
                dt[code=='p'][.(.SD),   # our subset (or another data.table), joined to .SD (referring to original dt)
                              .( x.date_up ), 
                              on = .(id==id, row > row, date_up > date_down),  
                              mult = "first"] ]  
    

    .SD above references back to the original dt that we are assigning back into. It corresponds to the subset of data.table that contains the rows selected in the first dt[, which is all the rows because we didn't filter it.

    Note: In Solution 3 I used setkey() to set the key. I should have done that in Solution 1 & Solution 2 - however I didn't want to change those solutions after @OllieB tested them successfully.

    Solution 4:

    Like Solution 3 but using .SD once more than previously. Our main data.table name dt now appears only once across our entire expression!

    # add row column and setkey() as previous solutions
    
    dt[, found_date :=
                # dt[code=='p'][.(.SD), 
                .SD[code=='p'][.SD,   # .SD in place of dt at left!  Also, removed .() at right (not sure on this second change)
                               .(found_date = x.date_up),
                               on = .(id==id, row > row, date_up > date_down),
                               mult = "first"]]
    

    With the change above our data.table name dt appears only once. I like that a lot because it makes it easy to copy, adapt and reuse elsewhere.

    Also note: Where I'd previously used .(SD) I've now removed the .() around .SD because it doesn't appear to require it. However for that change I'm not sure if it has any performance benefit or whether it's data.table preferred syntax. I would be grateful if anyone can add a comment to advise on that point.

    Solution 5:

    Like previous solutions but making use of by to explicitly group subsets over operations when joining

    # add row column and setkey() as previous solutions
    
    dt[, found_date :=
           .SD[code=='p'][.SD,
                          .(found_date = x.date_up),
                          # on = .(id==id, row > row, date_up > date_down),
                          on = .(row > row, date_up > date_down),  # removed the id column from here
                          mult = "first"]
       , by = id]   # added by = id to group the .SD subsets 
    

    On this last solution I changed it to use the by clause to explicitly group the .SD subsets on id.

    Note: Solution 5 did not perform well against OllieB's actual data compared to Solutions 1 - 4. However, testing my own mock data I found that Solution 5 could perform well when the number of unique groups from the id column were low:
    - With only 6 groups in 1.5M records this solution worked just as fast as the others.
    - With 40k groups in 1.5M records I saw similar poor performance as OllieB reported.

    Results

    Solutions 1 - 4 performed well:

    • For 1.45M records in OllieB's actual data each of Solutions 1 to 4 were all 2.42 seconds or less "elapsed" time according to OllieB's feedback. Solution 3 appears worked fastest for OllieB having "elapsed=1.22" seconds.

    • I personally prefer Solution 4 because of the simpler syntax.

    Solution 5

    • Solution 5 (using by clause) performed poorly taking 577 seconds for OllieB's testing on his real data.

    Versions used

    data.table version: 1.12.0

    R version 3.5.3 (2019-03-11)


    Possible further improvements:

    • Changing the date fields to integer may help join more efficiently. See as.IDate() to convert dates to integer in data.tables.
    • The setkey() step may no longer bee needed: As explained here by @Arun due to on envoking [often] more efficient secondary indicies and auto indexing.

    References to data.table

    As part of your question you've asked for "any good references to data.table". I've found the following helpful:

    • data.table Getting started Wiki on GitHub is the place to start.

    • In particular for this problem it's worth reading:

      • What does .SD stand for in data.table in R
      • The HTML vignette for Secondary indices and auto indexing

    Importantly note this answer by @Arun which explains "the reason for implementing on= argument" suggests it may no longer be necessary to set keys any more:

    It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

    In most cases therefore, there shouldn't be a need to set keys any more. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.

    • This SO question seems to be the hub of information about the different data.table joins: How to join (merge) data frames (inner, outer, left, right)?

    • Finally, data.table cheat sheet is an excellent reference (from link found on the data.table Getting started Wiki on GitHub).


    As always I'm grateful if anyone has suggestions as perhaps this can be improved further.

    Please feel free to comment, correct or post other solutions if you can add anything.

提交回复
热议问题