I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuc
The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that
con <- textConnection("SNP BP
rs064 12292
rs319 345367
rs285 700042")
snps <- read.delim(con, head=TRUE, sep="")
con <- textConnection("Gene BP_start BP_end
E613 345344 363401
E92 694501 705408
E49 362370 368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")
then create 'IRanges' out of each
library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene)
(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')
olaps <- findOverlaps(isnps, igenes)
two of the SNPs overlap
> queryHits(olaps)
[1] 2 3
and they overlap the first and second genes
> subjectHits(olaps)
[1] 1 2
If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as
> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
SNP BP Gene BP_start BP_end
2 rs319 345367 E613 345344 363401
3 rs285 700042 E92 694501 705408
Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you
library(GenomicRanges)
isnps <-
with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))
You can merge by range in base by using apply
and which
. To return only those which match (inner join):
do.call(rbind, apply(file1test, 1, function(x) {
if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
cbind(rbind(x), file2[idx,])
}
}))
# SNP BP Gene BP_start BP_end
#x rs2343 860269 E3543 860260 879955
#1 rs754 861822 E3543 860260 879955
#2 rs754 861822 E11 861322 879533
#x1 rs854 367934 E613 367640 368634
or all from file1test (left join):
do.call(rbind, apply(file1test, 1, function(x) {
if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
cbind(rbind(x), file2[idx,])
} else {
cbind(rbind(x), file2[1,][NA,])
}
}))
# SNP BP Gene BP_start BP_end
#x rs2343 860269 E3543 860260 879955
#x1 rs211 369640 <NA> NA NA
#1 rs754 861822 E3543 860260 879955
#2 rs754 861822 E11 861322 879533
#x2 rs854 367934 E613 367640 368634
#x3 rs343 706940 <NA> NA NA
#x4 rs626 717244 <NA> NA NA
I believe what you're asking for is a conditional join
. They're easy in SQL, and the sqldf
package makes it easy to query data frames in R using SQL.
Just pick a version depending on how you want unmatched SNPs handled.
Inner join version:
> sqldf("select * from file1test f1 inner join file2 f2
+ on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")
Output:
SNP BP Gene BP_start BP_end
1 rs2343 860269 E3543 860260 879955
2 rs754 861822 E3543 860260 879955
3 rs754 861822 E11 861322 879533
4 rs854 367934 E613 367640 368634
>
Left Join version:
> sqldf("select * from file1test f1 left join file2 f2
+ on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")
Output:
SNP BP Gene BP_start BP_end
1 rs2343 860269 E3543 860260 879955
2 rs211 369640 <NA> NA NA
3 rs754 861822 E3543 860260 879955
4 rs754 861822 E11 861322 879533
5 rs854 367934 E613 367640 368634
6 rs343 706940 <NA> NA NA
7 rs626 717244 <NA> NA NA
>
Note that you may want to be careful where you place the =
if it matters which group a BP will fall in for the case where a BP exactly matches a BP_start or BP_end.