Error while merging data frames using “data.table” package

后端 未结 1 1339
借酒劲吻你
借酒劲吻你 2021-01-20 01:27

The following is a reproducible example of a situation that I\'m experiencing and stuck with (it\'s a test client I\'m using to evaluate various ap

1条回答
  •  佛祖请我去吃肉
    2021-01-20 01:45

    I'd approach the issue in this manner:

    First, there's an error message. What does it say?

    Join results in 121229 rows; more than 100000 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

    Great! But I've so many datasets I'm working with, and so many packages and so many functions. I've got to narrow this down to which data set produces this error.

    Testing one by one:

    ans1 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[2]]), 
                    all.x=TRUE, all.y=FALSE, by="Project ID")
    ## works fine.
    
    ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]), 
                    all.x=TRUE, all.y=FALSE, by="Project ID")
    ## same error
    

    Aha, got the same error.

    Reading the second line of the error message:

    So, something seems to happen with dataSets[[3]]. It says to check for duplicate key values in i. Let's do that:

    dim(dataSets[[3]])
    # [1] 81487     3
    dim(unique(as.data.table(dataSets[[3]]), by="Project ID"))
    # [1] 49999     3
    

    So, dataSets[[3]] has duplicated 'Project ID' values, and so for each duplicated value, all the matching rows from dataSets[[1]] is returned - which is what the 2nd part of the 2nd line explains: each of which join to the same group in x over and over again.

    Trying out allow.cartesian=TRUE:

    I know that there are duplicate keys and still wish to proceed. But the error message mentions how we can proceed, add "allow.cartesian=TRUE".

    ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]), 
                    all.x=TRUE, all.y=FALSE, by="Project ID", allow.cartesian=TRUE)
    

    Aha, now it works fine! So what does allow.cartesian = TRUE do? Or why was it added? The error message says to search for the message on stackoverflow (amidst other things).

    Searching for allow.cartesian=TRUE on SO:

    And the search lands me in on to this Why is allow.cartesian required at times when when joining data.tables with duplicate keys? question, which explains the purpose, and which also contains, under the comment, another link from @Roland: Merging data.tables uses more than 10 GB RAM which points to the initial issue that all started it. Let me read those posts now.


    Is base::merge giving a different result?

    Now, does base::merge return a different result (with 100,000 rows)?

    dim(merge(dataSets[[1]], dataSets[[3]], all.x=TRUE, all.y=FALSE, by="Project ID"))
    # [1] 121229      4
    

    Not really. It's giving the same dimension as when using data.table, but it just doesn't care if there are duplicate keys, whereas data.table warns you of potential explosion of the merged results and allows you to make an informed decision.

    0 讨论(0)
提交回复
热议问题