Error while merging data frames using “data.table” package

后端未结

关注

 1  1331

The following is a reproducible example of a situation that I\'m experiencing and stuck with (it\'s a test client I\'m using to evaluate various ap

相关标签:

1条回答

佛祖请我去吃肉

2021-01-20 01:45
I'd approach the issue in this manner:

First, there's an error message. What does it say?

Join results in 121229 rows; more than 100000 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

Great! But I've so many datasets I'm working with, and so many packages and so many functions. I've got to narrow this down to which data set produces this error.

Testing one by one:
```
ans1 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[2]]), 
                all.x=TRUE, all.y=FALSE, by="Project ID")
## works fine.

ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]), 
                all.x=TRUE, all.y=FALSE, by="Project ID")
## same error
```
Aha, got the same error.

Reading the second line of the error message:

So, something seems to happen with dataSets[[3]]. It says to check for duplicate key values in i. Let's do that:
```
dim(dataSets[[3]])
# [1] 81487     3
dim(unique(as.data.table(dataSets[[3]]), by="Project ID"))
# [1] 49999     3
```
So, dataSets[[3]] has duplicated 'Project ID' values, and so for each duplicated value, all the matching rows from dataSets[[1]] is returned - which is what the 2nd part of the 2nd line explains: each of which join to the same group in x over and over again.

Trying out allow.cartesian=TRUE:

I know that there are duplicate keys and still wish to proceed. But the error message mentions how we can proceed, add "allow.cartesian=TRUE".
```
ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]), 
                all.x=TRUE, all.y=FALSE, by="Project ID", allow.cartesian=TRUE)
```
Aha, now it works fine! So what does allow.cartesian = TRUE do? Or why was it added? The error message says to search for the message on stackoverflow (amidst other things).

Searching for allow.cartesian=TRUE on SO:

And the search lands me in on to this Why is allow.cartesian required at times when when joining data.tables with duplicate keys? question, which explains the purpose, and which also contains, under the comment, another link from @Roland: Merging data.tables uses more than 10 GB RAM which points to the initial issue that all started it. Let me read those posts now.

Is base::merge giving a different result?

Now, does base::merge return a different result (with 100,000 rows)?
```
dim(merge(dataSets[[1]], dataSets[[3]], all.x=TRUE, all.y=FALSE, by="Project ID"))
# [1] 121229      4
```
Not really. It's giving the same dimension as when using data.table, but it just doesn't care if there are duplicate keys, whereas data.table warns you of potential explosion of the merged results and allows you to make an informed decision.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Error while merging data frames using “data.table” package

Testing one by one:

Reading the second line of the error message:

Trying out `allow.cartesian=TRUE`:

Searching for `allow.cartesian=TRUE` on SO:

Is `base::merge` giving a different result?

Error while merging data frames using “data.table” package

Testing one by one:

Reading the second line of the error message:

Trying out allow.cartesian=TRUE:

Searching for allow.cartesian=TRUE on SO:

Is base::merge giving a different result?

Trying out `allow.cartesian=TRUE`:

Searching for `allow.cartesian=TRUE` on SO:

Is `base::merge` giving a different result?