问题
I have two tables, A
and B
. For each row of table A
, I want to get some summary statistics for B$value
where the value of B$location
is within 100
of A$location
. I've accomplished this using the for-loop below, but this is a slow solution that works well when the tables are small but I would like to scale up to a table A
which is thousands of rows and a table B
which is nearly a millions of rows. Any ideas of how to achieve this? Thanks in advance!
The for-loop:
for (i in 1:nrow(A)) {
subset(B, abs(A$location[i] - B$location) <= 100) -> temp
A$n[i] <- nrow(temp)
A$sum[i] <- sum(temp$value)
A$avg[i] <- mean(temp$value)
}
An example:A
loc
150
250
400
B
loc value
25 7
77 19
170 10
320 15
Would become:A
loc n sum avg
150 2 29 14.5
250 2 25 12.5
400 1 15 15
回答1:
Similar to Matt Summersgill's answer, you could do a non-equi join to update A
:
A[, up := loc + 100]
A[, dn := loc - 100]
A[, c("n", "s", "m") :=
B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][, .(N, V2, V3)]
]
Or in one chained command:
A[, up := loc + 100][, dn := loc - 100][, c("n", "s", "m") :=
B[copy(.SD), on=.(loc >= dn, loc <= up),
.(.N, sum(value), mean(value)), by=.EACHI][,
.(N, V2, V3)]
]
This should be fairly efficient, I guess.
How it works
Inside j
of x[i, j]
, .SD
refers to the subset of data from x
(in this case it's all of A
).
x[i, on=, j, by=.EACHI]
is a join, using each row of i
(in this case copy(.SD)
== A
) to look up matching rows of x
(in this case B
) using the conditions in on=
. For each row of i
, j
is calculated (which is what by=.EACHI
means).
When j
doesn't have names, they are assigned automatically. V1
, V2
, and so on. .N
by default gets named N
.
回答2:
My pure R solution (below) is still considerably slow, in my system it required 32 seconds to finish Matt Summersgill's big example, but compared to other solutions, it's still reasonable.
The logic of my solution is that,
since the inputs are sorted,
as you consider new elements of A_loc
,
the range of values in B_loc
will either stay the same if the new A_loc
element is identical to the previous,
or it will shift to the right in B_loc
,
possibly contracting or expanding.
Note that if you were working with double
inputs,
you'd have to be a bit more careful with the comparisons.
This C++ version is naturally faster.
If you can Rcpp::sourceCpp
this code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame foo(IntegerVector A_loc, IntegerVector B_loc, IntegerVector B_val) {
IntegerVector n(A_loc.length());
IntegerVector sum(A_loc.length());
NumericVector avg(A_loc.length());
int lower = 0;
int upper = 0;
int count = 0;
int current_sum = 0;
for (int i = 0; i < A_loc.length(); i++) {
checkUserInterrupt();
while (lower < B_loc.length()) {
if (B_loc[lower] >= A_loc[i] - 100) {
break;
}
if (count > 0) {
count--;
current_sum -= B_val[lower];
}
lower++;
}
if (upper < lower) {
upper = lower;
}
while (upper < B_loc.length()) {
if (B_loc[upper] > A_loc[i] + 100) {
break;
}
count++;
current_sum += B_val[upper++];
}
n[i] = count;
sum[i] = current_sum;
avg[i] = static_cast<double>(current_sum) / count;
}
DataFrame df = DataFrame::create(
Named("loc") = A_loc,
Named("n") = n,
Named("sum") = sum,
Named("avg") = avg
);
return df;
}
then this:
A <- data.frame(loc = sample.int(1000, size = 1e4, replace = TRUE))
B <- data.frame(loc = sample.int(1000, size = 1e6, replace = TRUE),
value = sample.int(100, size = 1e6, replace = TRUE))
test <- function() {
# remove unique if you want to consider duplicated values
A_loc <- sort(unique(A$loc), decreasing = FALSE)
B <- B[order(B$loc),]
out <- foo(A_loc, B$loc, B$value)
}
microbenchmark::microbenchmark(test())
shows these timings:
Unit: milliseconds
expr min lq mean median uq max neval
test() 44.74469 45.8118 51.35361 47.34657 48.99376 95.00938 100
If you can't use Rcpp, then consider the R version below, or Frank's solution with data.table, I think sorting the inputs might also help that case?
for
loops are usually avoided in R,
but I don't think they are always slow,
you just have to be careful not to copy data too much.
Also, since R v3.5.0,
writing something like for i in 1:10
no longer allocates the whole vector first,
it supports compact representation.
A_loc <- sort(unique(A$loc), decreasing = FALSE)
B <- B[order(B$loc),]
out <- data.frame(loc = A_loc,
n = 0L,
sum = 0L,
avg = 0)
lower <- 1L
upper <- 1L
count <- 0L
sum <- 0L
upper_limit <- nrow(B)
for (i in seq_along(A_loc)) {
current_loc <- A_loc[i]
while (lower <= upper_limit) {
if (B$loc[lower] >= current_loc - 100L) {
break
}
if (count > 0L) {
count <- count - 1L
sum <- sum - B$value[lower]
}
lower <- lower + 1L
}
if (upper < lower) {
upper <- lower
}
while (upper <= upper_limit) {
if (B$loc[upper] > current_loc + 100L) {
break
}
count <- count + 1L
sum <- sum + B$value[upper]
upper <- upper + 1L
}
out$n[i] <- count
out$sum[i] <- sum
out$avg[i] <- sum / count
}
回答3:
Here's a tidyverse
solution
library(tidyverse)
A = read.table(text = "
loc
150
250
400
", header=T)
B = read.table(text = "
loc value
25 7
77 19
170 10
320 15
", header=T)
A %>%
mutate(B = list(B)) %>% # create all combinations of rows of A and B
unnest() %>%
filter(abs(loc - loc1) <= 100) %>% # keep rows that satisfy your condition
group_by(loc) %>% # for each loc values
summarise(sum = sum(value), # calculate sum
avg = mean(value)) # calculate mean
# # A tibble: 3 x 3
# loc sum avg
# <int> <int> <dbl>
# 1 150 29 14.5
# 2 250 25 12.5
# 3 400 15 15
Maybe not the best solution if you have large A
and B
tables as you have to create all combinations of rows and then filter.
回答4:
This is possible with the foverlaps
function within data.table
and the following method actually has a prayer at finishing your actual use case -- A
which is thousands of rows and a table B
which is nearly a millions of rows -- in a reasonable amount of time.
With your toy example:
library(data.table)
A <- fread("
loc
150
250
400")
B <- fread("
loc value
25 7
77 19
170 10
320 15")
## Create a 'dummy' value to create an interval w/same start and end in A
A[,loc_Dummy := loc]
## Create values bounding the match range for loc in B
B[,loc_Plus100 := loc + 100]
B[,loc_Minus100 := loc - 100]
## Set up for the overlap join
setkey(A,loc,loc_Dummy)
setkey(B,loc_Minus100, loc_Plus100)
## Create a table of only matches instead of doing a full cartesian join of all cases
Matches <- foverlaps(A[,.(loc, loc_Dummy)],
B[,.(loc_Minus100,loc_Plus100,value)])
## Create a summary table
Matches[,.(n = .N, sum = sum(value), avg = mean(value)), by = .(loc)]
# loc n sum avg
# 1: 150 2 29 14.5
# 2: 250 2 25 12.5
# 3: 400 1 15 15.0
Scaling up - yikes!
However - this is actually an extremely computationally intensive problem. Scaling up to your actual case sizes shows the challenge here -- using 10,000 rows for table A
and 1,000,000 rows for table B
, this method completes in 91 seconds on the server I'm running on, but uses over 112 GB of memory!
A <- data.table(loc = sample.int(1000, size = 1e4, replace = TRUE))
B <- data.table(loc = sample.int(1000, size = 1e6, replace = TRUE),
value = sample.int(100, size = 1e6, replace = TRUE))
system.time({
A[,loc_Dummy := loc]
B[,loc_Plus100 := loc + 100]
B[,loc_Minus100 := loc - 100]
setkey(A,loc,loc_Dummy)
setkey(B,loc_Minus100, loc_Plus100)
Matches <- foverlaps(A[,.(loc, loc_Dummy)],
B[,.(loc_Minus100,loc_Plus100,value)])
Summary <- Matches[,.(n = .N, sum = sum(value), avg = mean(value)), by = .(loc)]
})
## Warning: Memory usage peaks at ~112 GB!
# user system elapsed
# 56.407 46.258 91.135
This is barely within the capabilities of the server I use, and likely may not actually be applicable for your case.
If you don't have hundreds of Gigabytes of memory at your disposal, you'll probably have to get a little more clever in the way you approach this and iterate through chunks at a time.
From what I can tell, your problem is actually similar in ways to the one posed (and solved) by Lorenzo Busetto and detailed in a blog post: Speeding up spatial analyses by integrating sf and data.table: a test case.
Chunking to the rescue
Requiring over ~100 Gigabytes of memory isn't really a feasible solution -- especially if you wanted to scale A
or B
up by an order of magnitude at some point.
However, a chunking method (inspired by Lorenzo's post linked above) that splits up the problem into 100 chunks actually only increases by run-time a trivial amount to 116 seconds, but reduces peak memory usage to less than 3 GB! If I were planning on doing this in production, I'd go with something like the following.
One note: I didn't really do some in-depth auditing on the accuracy of the results (I might have specified one of the range bounds incorrectly open or closed), so I'd scrutinize the output with the data you're familiar with before putting into production.
A <- data.table(loc = sample.int(1000, size = 1e4, replace = TRUE))
B <- data.table(loc = sample.int(1000, size = 1e6, replace = TRUE),
value = sample.int(100, size = 1e6, replace = TRUE))
system.time({
A[,loc_Dummy := loc]
B[,loc_Plus100 := loc + 100]
B[,loc_Minus100 := loc - 100]
setkey(A,loc)
setkey(B,loc)
ChunkCount <- 100
ChunkSize <- A[,.N/ChunkCount]
ResultList <- vector("list", ChunkCount)
for (j in seq_len(ChunkCount)){
A_loc_Min <- A[((j-1)*ChunkSize + 1):(min(nrow(A),(j)*ChunkSize)), min(loc)]
A_loc_Max <- A[((j-1)*ChunkSize + 1):(min(nrow(A),(j)*ChunkSize)), max(loc)]
A_Sub <- A[loc >= A_loc_Min & loc < A_loc_Max]
B_Sub <- B[loc_Plus100 >= A_loc_Min & loc_Minus100 < A_loc_Max]
setkey(A_Sub,loc,loc_Dummy)
setkey(B_Sub,loc_Minus100, loc_Plus100)
Matches <- foverlaps(A_Sub[,.(loc, loc_Dummy)],
B_Sub[,.(loc_Minus100,loc_Plus100,value)])
ResultList[[j]] <- Matches[,.(n = .N, sum = sum(value), avg = mean(value)), by = .(loc)]
}
Summary <- rbindlist(ResultList)
})
# user system elapsed
# 109.125 16.864 116.129
Validating
Update: @Alexis and @Frank's suggestion in the comments generate the same result set, mine comes out slightly different, but only on the count. If someone else can verify that the correct answer is actually that provided by @Alexis/@Frank, then I'd be happy to retract my answer as both methods execute faster than the one I proposed.
set.seed(1234)
A <- data.table(loc = sample.int(1000, size = 1e3, replace = TRUE))
B <- data.table(loc = sample.int(1000, size = 1e4, replace = TRUE),
value = sample.int(10, size = 1e4, replace = TRUE))
## Matt
Matt_A <- copy(A)
Matt_B <- copy(B)
Matt_A[,loc_Dummy := loc]
Matt_B[,loc_Plus100 := loc + 100]
Matt_B[,loc_Minus100 := loc - 100]
setkey(Matt_A,loc,loc_Dummy)
setkey(Matt_B,loc_Minus100, loc_Plus100)
Matches <- foverlaps(Matt_A[,.(loc, loc_Dummy)],
Matt_B[,.(loc_Minus100,loc_Plus100,value)])
Summary_Matt <- Matches[,.(n = .N, sum = sum(value), avg = mean(value)), keyby = .(loc)]
## Alexis
Rcpp::sourceCpp("RowRanges.cpp")
A_loc <- sort(A$loc, decreasing = FALSE)
B <- B[order(B$loc),]
Alexis <- foo(unique(A_loc), B$loc, B$value)
Summary_Alexis <- as.data.table(Alexis)
colnames(Summary_Alexis) <- c("n","sum","avg")
Summary_Alexis[,loc := unique(A_loc)]
setcolorder(Summary_Alexis, c("loc","n","sum","avg"))
## Frank
Frank <- A[, up := loc + 100][
, dn := loc - 100][
, c("n", "s", "m") := B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][
, .(N, V2, V3)]][]
Summary_Frank <- unique(Frank[,.(loc,n, sum = s, avg = m)][order(loc)])
## Comparing
all.equal(Summary_Frank,Summary_Alexis)
# [1] TRUE
all.equal(Summary_Frank,Summary_Matt)
# [1] "Column 'n': Mean relative difference: 1.425292"
回答5:
I don't normally suggest solutions that rely on installing packages, but I think this one will do the trick for you. It will install a package that enables you to code in SQL inside R.
# Load the package
install.packages("sqldf")
library(sqldf)
# Create tables
A <- data.frame("loc"=c(150,250,400))
B <- data.frame("loc"=c(25,77,170,320),"value"=c(7,19,10,15))
# Join tables
df0 <- sqldf('select a.loc
,count(b.value) as n_value
,sum(b.value) as sum_value
,avg(b.value) as avg_value
from A as a
left join B as b
on abs(a.loc - b.loc) <= 100
group by a.loc')
# Print data frame
df0
回答6:
I'm not sure how well this solution will scale - it depends on whether the filter matrix fits in memory.
A <- within(A,{
B.filter <- outer(B$loc, A$loc, function(x, y) abs(x - y) <= 100)
n <- colSums(B.filter)
sum <- colSums(B$value * B.filter)
avg <- sum / n
rm(B.filter)
})
If locations in A and/or B repeat, you might be able to reduce the size of the filter matrix by only using unique values:
A <- within(A,{
B.filter <- outer(unique(B$loc), unique(A$loc), function(x, y) abs(x - y) <= 100)
colnames(B.filter) <- unique(A$loc)
rownames(B.filter) <- unique(B$loc)
n <- colSums(B.filter[,as.character(A$loc)])
sum <- colSums(B$value * B.filter[as.character(B$loc),])
avg <- sum / n
rm(B.filter)
})
来源:https://stackoverflow.com/questions/50514654/r-efficiently-computing-summaries-of-value-subsets-whose-contents-are-determine