Merge (join) data frames - too many rows in result

前端 未结 3 1723
萌比男神i
萌比男神i 2021-01-16 19:50

I have two data frames(df1 and df2). I want to join them using merge function.

df1 has 3903 rows and df2 has 351 rows.

I want to left join df2 to df1 by a c

相关标签:
3条回答
  • 2021-01-16 20:30

    This may be because the values in column1 from df2 are not a 1-1 mapping. Meaning a single value in column1 may be related to more than one value in column2. You can check this by using table(df2$column1). If you find a value from column1 with a count > 1 then this is the reason.

    Also I would like to recommend an alternative if you are more comfortable with sql there is a very nice library called sqldf which allows you to use sql like queries on your data frames!

    0 讨论(0)
  • 2021-01-16 20:31

    I can't be sure without seeing an example of your problem, but usually the syntax is:

    df <- merge(df1, df2, by.all="name_of_column_in_common", all.x=T)
    

    However, if the columns you are matching on have duplicated values, r will match all possible combinations. So,

    df1 <- data.frame(id=c("a","a","b","c"), x1=rnorm(4))
    df2 <- data.frame(id=c("a","a","b"), x2=rnorm(3))
    df <- merge(df1, df2, by.all="id", all.x=T)
    

    Will give you a df of dimensions 6 by 3, as each "a" in df2 has been matched to each "a" in df1, 2 by 2 for 4 permutations.

    0 讨论(0)
  • 2021-01-16 20:34

    To make sure that your second data frame is unique on the join column(s), you can use my package safejoin (a wrapper around dplyr's join functions) which will give you an explicit error if it's not the case.

    Current situation :

    df1 <- data.frame(column1 = c("a","b","b"), X = 1:3)
    df2 <- data.frame(column1 = c("a","b"), Y = 4:5)
    df3 <- data.frame(column1 = c("a","a","b"), Y = 4:6)
    
    merge(df1,df2, by="column1",all.x=TRUE)
    #   column1 X Y
    # 1       a 1 4
    # 2       b 2 5
    # 3       b 3 5
    
    merge(df1,df3, by="column1",all.x=TRUE)
    #   column1 X Y
    # 1       a 1 4
    # 2       a 1 5
    # 3       b 2 6
    # 4       b 3 6
    

    Some values were duplicated by mistake.

    Using safejoin :

    # devtools::install_github("moodymudskipper/safejoin")
    library(safejoin)
    safe_left_join(df1, df2, check= "V")
    #   column1 X Y
    # 1       a 1 4
    # 2       b 2 5
    # 3       b 3 5
    
    safe_left_join(df1, df3, check= "V")
    # Error: y is not unique on column1
    # Call `rlang::last_error()` to see a backtrace
    

    check = "V" controls that the join columns are unique on the right hand side (check = "U" like Unique checks that they are unique on the left hand side, "V" is the next letter in the alphabet).

    0 讨论(0)
提交回复
热议问题