Summarize the self-join index while avoiding cartesian product in R data.table

前端未结

关注

 3  543

With a 2-column data.table, I\'d like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words,

相关标签:

3条回答

不思量自难忘°

2021-01-15 07:31
You already have solution written in SQL so I suggest R package sqldf

Here's code:
```
library(sqldf)

result <- sqldf("SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-01-15 07:39
If you can split your Y's into groups that don't have a large intersection of X's, you could do the computation by those groups first, resulting in a smaller intermediate table:
```
d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
    .(N = sum(N)), by = .(X, i.X)]
```
The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.
0 讨论(0)
发布评论:

提交评论
- 加载中...

[愿得一人]

2021-01-15 07:41

How about this one using foverlaps(). The more consecutive values of Y you've for each X, the lesser number of rows this'll produce compared to a cartesian join.

d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setorder(d, X)
d[, id := cumsum(c(0L, diff(Y)) != 1L), by=X]
dd = d[, .(start=Y[1L], end=Y[.N]), by=.(X,id)][, id := NULL][]

ans <- foverlaps(dd, setkey(dd, start, end))
ans[, count := pmin(abs(i.end-start+1L), abs(end-i.start+1L), 
                    abs(i.end-i.start+1L), abs(end-start+1L))]
ans[, .(count = sum(count)), by=.(X, i.X)][order(i.X, X)]
#     X i.X count
#  1: 1   1     3
#  2: 2   1     3
#  3: 3   1     1
#  4: 1   2     3
#  5: 2   2     4
#  6: 3   2     1
#  7: 4   2     1
#  8: 1   3     1
#  9: 2   3     1
# 10: 3   3     3
# 11: 4   3     1
# 12: 2   4     1
# 13: 3   4     1
# 14: 4   4     2

Note: make sure X and Y are integers for faster results. This is because joins on integer types are faster than on double types (foverlaps performs binary joins internally).

You can make this more memory efficient by using which=TRUE in foverlaps() and using the indices to generate count in the next step.

0 讨论(0)