Merge by Range in R - Applying Loops

前端未结

关注

 3  1060

I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuc

相关标签:

3条回答

灰色年华

2020-11-28 13:33
The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that
```
con <- textConnection("SNP     BP
rs064   12292
rs319   345367
rs285   700042")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("Gene    BP_start    BP_end
E613    345344      363401
E92     694501      705408
E49     362370      368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")
```
then create 'IRanges' out of each
```
library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene)
```
(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')
```
olaps <- findOverlaps(isnps, igenes)
```
two of the SNPs overlap
```
> queryHits(olaps)
[1] 2 3
```
and they overlap the first and second genes
```
> subjectHits(olaps)
[1] 1 2
```
If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as
```
> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
    SNP     BP Gene BP_start BP_end
2 rs319 345367 E613   345344 363401
3 rs285 700042  E92   694501 705408
```
Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you
```
library(GenomicRanges)
isnps <-
    with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
    with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

心在旅途

2020-11-28 13:42

You can merge by range in base by using apply and which. To return only those which match (inner join):

do.call(rbind, apply(file1test, 1, function(x) {
  if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
    cbind(rbind(x), file2[idx,])
  }
}))
#      SNP     BP  Gene BP_start BP_end
#x  rs2343 860269 E3543   860260 879955
#1   rs754 861822 E3543   860260 879955
#2   rs754 861822   E11   861322 879533
#x1  rs854 367934  E613   367640 368634

or all from file1test (left join):

do.call(rbind, apply(file1test, 1, function(x) {
  if(length(idx <- which(x["BP"] >= file2$BP_start & x["BP"] <= file2$BP_end)) > 0) {
    cbind(rbind(x), file2[idx,])
  } else {
    cbind(rbind(x), file2[1,][NA,])
  }
}))
#      SNP     BP  Gene BP_start BP_end
#x  rs2343 860269 E3543   860260 879955
#x1  rs211 369640  <NA>       NA     NA
#1   rs754 861822 E3543   860260 879955
#2   rs754 861822   E11   861322 879533
#x2  rs854 367934  E613   367640 368634
#x3  rs343 706940  <NA>       NA     NA
#x4  rs626 717244  <NA>       NA     NA

0 讨论(0)

一整个雨季

2020-11-28 13:45

I believe what you're asking for is a conditional join. They're easy in SQL, and the sqldf package makes it easy to query data frames in R using SQL.

Just pick a version depending on how you want unmatched SNPs handled.

Inner join version:

> sqldf("select * from file1test f1 inner join file2 f2 
+       on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")

Output:

     SNP     BP  Gene BP_start BP_end
1 rs2343 860269 E3543   860260 879955
2  rs754 861822 E3543   860260 879955
3  rs754 861822   E11   861322 879533
4  rs854 367934  E613   367640 368634
>

Left Join version:

> sqldf("select * from file1test f1 left join file2 f2 
+       on (f1.BP > f2.BP_start and f1.BP<= f2.BP_end) ")

Output:

     SNP     BP  Gene BP_start BP_end
1 rs2343 860269 E3543   860260 879955
2  rs211 369640  <NA>       NA     NA
3  rs754 861822 E3543   860260 879955
4  rs754 861822   E11   861322 879533
5  rs854 367934  E613   367640 368634
6  rs343 706940  <NA>       NA     NA
7  rs626 717244  <NA>       NA     NA
>

Note that you may want to be careful where you place the = if it matters which group a BP will fall in for the case where a BP exactly matches a BP_start or BP_end.

0 讨论(0)