R data.table Multiple Conditions Join

问题

I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”

The exercise is, given an X by X grid dtGrid and a list of X^2 random Events dtEvents within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).

I really like the flexibility of being able to add multiple conditions to my val statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.

Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?

Thanks so much!

#Initialization stuff
library(data.table)
set.seed(77L)

#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L

#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)

#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)

#Return # of events within 1 unit radius of each grid point
system.time(
    dtEventsWithinRadius <- dtEvents[dtGrid, {
        val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];  #basic circle fomula: x^2 + y^2 = radius^2
        list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
    }, by=.EACHI]
)

回答1:

Very interesting question.. and great use of by = .EACHI! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.

Issue: Your use of by=.EACHI is completely justified because the other alternative is to perform a cross join (each row of dtGrid joined to all rows of dtEvents) but that's too exhaustive and is bound to explode very quickly.

However by = .EACHI is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid, all possible distances are still computed with dtEvents; hence it doesn't scale as well as expected.

Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid to dtEvents.

Let (x_i, y_i) come from dtGrid and (a_j, b_j) come from from dtEvents, say, where 1 <= i <= nrow(dtGrid) and 1 <= j <= nrow(dtEvents). Then, i = 1 implies, all j that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1 needs to be extracted. That can only happen when:

(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1

This helps reduce the search space drastically because, instead of looking at all rows in dtEvents for each row in dtGrid, we just have to extract those rows where,

a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius

This constraint can be directly translated to a non-equi join, and combined with by = .EACHI as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1 as follows:

foo1 <- function(dt1, dt2) {
    dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)]                   ## (1) 
    tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y), 
              .(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI, 
              allow=TRUE, nomatch=0L
          ][, c("xp", "yp") := NULL]                              ## (2)
    tmp[]
}

## (1) constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on= yet.

## (2) performs a non-equi join that computes distances and checks for all distances that are < 1 on the restricted set of combinations for each row in dtGrid -- hence should be much faster.

Benchmarks:

# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
    ans = dt2[dt1, 
                {
                 val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
                 .(xm=i.x, ym=i.y, V1=sum(val))
                }, 
            by=.EACHI][, "DummyJoin" := NULL]
    ans[]
}

# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s

# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s

# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s

identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them

The speedups are ~10x, 32x and 53x respectively.

Note that the rows in dtGrid for which the condition is not satisfied even for a single row in dtEvents will not be present in the result (due to nomatch=0L). If you want those rows, you'll have to also add one of the xm/xp/ym/yp cols.. and check them for NA (= no matches).

This is the reason we had to remove all 0 counts to get identical = TRUE.

HTH

PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.

来源：https://stackoverflow.com/questions/38297148/r-data-table-multiple-conditions-join

标签

join

data.table