Incorrect behavior with dplyr's left_join?

后端 未结 2 1474
青春惊慌失措
青春惊慌失措 2021-01-17 14:52

Surely this is not intended? Is this something that happens in other parts of dplyr\'s functionality and should I be concerned? I love the performance and hat

相关标签:
2条回答
  • 2021-01-17 15:09

    I posted something similar the other day. I think what you need to do is to have ORDER as numeric (or possibly the other way around). A has ORDER has integer. But B has ORDER as numeric. At the moment, dplyr asks you to have group-by variables in the same class. I received a comment from an SO user saying that this is something Hadley and his team has been working on now. This issue will be fixed in the future.

    A$ORDER <- as.numeric(A$ORDER)
    left_join(A,B, by = "ORDER")
    
         ORDER               COST   AREA
    1 30305720                  0     NA
    2 30334659                         0
    3 30379936           11430.52   2339
    4 30406397 20196.279999999999   2162
    5 30407697                  0  23040
    6 30431950           10445.99 475466
    

    UPDATE After exchanging comments with thelatemail, I decided to add more observations here.

    CASE 1: Treat ORDER as numeric

    A$ORDER <- as.numeric(A$ORDER)
    
    > left_join(A,B, by = "ORDER")
         ORDER               COST   AREA
    1 30305720                  0     NA
    2 30334659                         0
    3 30379936           11430.52   2339
    4 30406397 20196.279999999999   2162
    5 30407697                  0  23040
    6 30431950           10445.99 475466
    
    > left_join(B,A, by = "ORDER")
    Source: local data frame [5 x 3]
    
         ORDER   AREA               COST
    1 30334659      0                   
    2 30379936   2339           11430.52
    3 30406397   2162 20196.279999999999
    4 30407697  23040                  0
    5 30431950 475466           10445.99
    

    If you have ORDER as integer in both A and B, that works too.

    CASE 2: Treat ORDER as integer and numeric

    > left_join(A,B, by = "ORDER")
         ORDER               COST AREA
    1 30305720                  0   NA
    2 30334659                      NA
    3 30379936           11430.52   NA
    4 30406397 20196.279999999999   NA
    5 30407697                  0   NA
    6 30431950           10445.99   NA
    
    > left_join(B,A, by = "ORDER")
    Source: local data frame [5 x 3]
    
         ORDER   AREA               COST
    1 30334659      0                   
    2 30379936   2339           11430.52
    3 30406397   2162 20196.279999999999
    4 30407697  23040                  0
    5 30431950 475466           10445.99
    

    As suggested by thelatemail, integer/numeric combination does not work. But numeric/integer combination works.

    Given these observations, it is safe to be consistent in group-by variable at the moment. Alternatively, merge() is the way to go. It can handle integer and numeric.

    > merge(A,B, by = "ORDER", all = TRUE)
         ORDER               COST   AREA
    1 30305720                  0     NA
    2 30334659                         0
    3 30379936           11430.52   2339
    4 30406397 20196.279999999999   2162
    5 30407697                  0  23040
    6 30431950           10445.99 475466 
    
    > merge(B,A, by = "ORDER", all = TRUE)
         ORDER   AREA               COST
    1 30305720     NA                  0
    2 30334659      0                   
    3 30379936   2339           11430.52
    4 30406397   2162 20196.279999999999
    5 30407697  23040                  0
    6 30431950 475466           10445.99
    

    UPDATE2 (as of the 8th of November, 2014)

    I am using a dev version of dplyr(dplyr_0.3.0.9000), which you can download from Github. The issue above is now solved.

    left_join(A,B, by = "ORDER")
    #     ORDER               COST   AREA
    #1 30305720                  0     NA
    #2 30334659                         0
    #3 30379936           11430.52   2339
    #4 30406397 20196.279999999999   2162
    #5 30407697                  0  23040
    #6 30431950           10445.99 475466
    
    0 讨论(0)
  • 2021-01-17 15:35

    From the dplyr documentation:

    left_join()

    returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

    semi_join()

    returns all rows from x where there are matching values in y, keeping just columns from x.

    A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

    Is semi_join() a valuable option for you?

    0 讨论(0)
提交回复
热议问题