Remove rows in data.table according to another data.table

后端 未结 2 1601
长发绾君心
长发绾君心 2021-01-06 04:50

I have a data.table named dtA:

My actual dtA has 62871932 rows and 3 columns:

  date    company    value
19810         


        
相关标签:
2条回答
  • 2021-01-06 05:22

    I think I know how to solve this:

    in dtB I add a pointer using data.table syntax:

    dtB[, pointer := 1]
    

    dtB will looks like this

      date    company    value    pointer
    198101          A        2          1
    198102          B        5          1
    

    Then I use LEFT OUTER JOIN method from here: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

    setkey(dtA, date, company, value)
    setkey(dtB, date, company, value)
    dtA=merge(dtA, dtB, all.x)
    

    This means on pointer column, if dtB's row exist in dtA, it will give 1. If dtB's row do not exist in dtA's, then it will be given NA

    Result will be:

      date    company    value    pointer
    198101          A        1         NA
    198101          A        2          1
    198101          B        5         NA
    198102          A        2         NA
    198102          B        5          1
    198102          B        6         NA
    

    I then select those rows with NA and remove pointer column:

    dtA=dtA[!(pointer %in% "1")][,-c("pointer")]
    

    I get my result:

      date    company    value
    198101          A        1
    198101          B        5
    198102          A        2
    198102          B        6
    
    0 讨论(0)
  • 2021-01-06 05:35

    Use an anti-join:

    dtA[!dtB, on=.(date, company, value)]
    

    This matches all records in dtA that are not found in dtB using the columns in on.

    0 讨论(0)
提交回复
热议问题