The following is a reproducible example of a situation that I\'m experiencing and stuck with (it\'s a test client I\'m using to evaluate various ap
I'd approach the issue in this manner:
First, there's an error message. What does it say?
Join results in 121229 rows; more than 100000 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Great! But I've so many datasets I'm working with, and so many packages and so many functions. I've got to narrow this down to which data set produces this error.
ans1 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[2]]),
all.x=TRUE, all.y=FALSE, by="Project ID")
## works fine.
ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]),
all.x=TRUE, all.y=FALSE, by="Project ID")
## same error
Aha, got the same error.
So, something seems to happen with dataSets[[3]]
. It says to check for duplicate key values in i
. Let's do that:
dim(dataSets[[3]])
# [1] 81487 3
dim(unique(as.data.table(dataSets[[3]]), by="Project ID"))
# [1] 49999 3
So, dataSets[[3]]
has duplicated 'Project ID' values, and so for each duplicated value, all the matching rows from dataSets[[1]]
is returned - which is what the 2nd part of the 2nd line explains: each of which join to the same group in x over and over again
.
allow.cartesian=TRUE
:I know that there are duplicate keys and still wish to proceed. But the error message mentions how we can proceed, add "allow.cartesian=TRUE".
ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]),
all.x=TRUE, all.y=FALSE, by="Project ID", allow.cartesian=TRUE)
Aha, now it works fine! So what does allow.cartesian = TRUE
do? Or why was it added? The error message says to search for the message on stackoverflow (amidst other things).
allow.cartesian=TRUE
on SO:And the search lands me in on to this Why is allow.cartesian required at times when when joining data.tables with duplicate keys? question, which explains the purpose, and which also contains, under the comment, another link from @Roland: Merging data.tables uses more than 10 GB RAM which points to the initial issue that all started it. Let me read those posts now.
base::merge
giving a different result?Now, does base::merge return a different result (with 100,000 rows)?
dim(merge(dataSets[[1]], dataSets[[3]], all.x=TRUE, all.y=FALSE, by="Project ID"))
# [1] 121229 4
Not really. It's giving the same dimension as when using data.table
, but it just doesn't care if there are duplicate keys, whereas data.table
warns you of potential explosion of the merged results and allows you to make an informed decision.