问题
I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”
The exercise is, given an X by X grid dtGrid
and a list of X^2 random Events dtEvents
within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).
I really like the flexibility of being able to add multiple conditions to my val
statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.
Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?
Thanks so much!
#Initialization stuff
library(data.table)
set.seed(77L)
#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L
#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)
#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)
#Return # of events within 1 unit radius of each grid point
system.time(
dtEventsWithinRadius <- dtEvents[dtGrid, {
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2]; #basic circle fomula: x^2 + y^2 = radius^2
list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
}, by=.EACHI]
)
回答1:
Very interesting question.. and great use of by = .EACHI
! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.
Issue: Your use of by=.EACHI
is completely justified because the other alternative is to perform a cross join (each row of dtGrid
joined to all rows of dtEvents
) but that's too exhaustive and is bound to explode very quickly.
However by = .EACHI
is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid
, all possible distances are still computed with dtEvents
; hence it doesn't scale as well as expected.
Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid
to dtEvents
.
Let (x_i, y_i)
come from dtGrid
and (a_j, b_j)
come from from dtEvents
, say, where 1 <= i <= nrow(dtGrid)
and 1 <= j <= nrow(dtEvents)
. Then, i = 1
implies, all j
that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1
needs to be extracted. That can only happen when:
(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1
This helps reduce the search space drastically because, instead of looking at all rows in dtEvents
for each row in dtGrid
, we just have to extract those rows where,
a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius
This constraint can be directly translated to a non-equi join, and combined with by = .EACHI
as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1
as follows:
foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}
## (1)
constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on=
yet.
## (2)
performs a non-equi join that computes distances and checks for all distances that are < 1
on the restricted set of combinations for each row in dtGrid
-- hence should be much faster.
Benchmarks:
# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}
# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s
# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s
# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s
identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them
The speedups are ~10x, 32x and 53x respectively.
Note that the rows in dtGrid
for which the condition is not satisfied even for a single row in dtEvents
will not be present in the result (due to nomatch=0L
). If you want those rows, you'll have to also add one of the xm/xp/ym/yp
cols.. and check them for NA
(= no matches).
This is the reason we had to remove all 0 counts to get identical = TRUE
.
HTH
PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.
来源:https://stackoverflow.com/questions/38297148/r-data-table-multiple-conditions-join