Merge 2 dataframes if value within range

后端未结

关注

 4  1354

I have been struggling with this for some time now and couldn\'t find any way of doing it, so I would be incredibly grateful if you could help! I am a novice in programming

相关标签:

4条回答

独厮守ぢ

2020-11-27 22:21

You can use sqldf package:

library(sqldf)

#dummy data
fixes <- read.table(text="
Order Participant Sentence Fixation StartPosition
1       1          1         1       -6.89
2       1          1         2       -5.88
3       1          1         3       -5.33
4       1          1         4       -4.09
5       1          1         5       -5.36 
",header=TRUE)
zones <- read.table(text="
Sentence     Zone  ZoneStart   ZoneEnd
1           1     -8.86      -7.49
1           2     -7.49      -5.89
1           3     -5.88      -4.51
1           4     -4.51      -2.90
",header=TRUE)

#output merged result
res <- 
  sqldf("SELECT [Order],Participant,f.Sentence,Fixation,StartPosition,Zone
       FROM fixes f,zones z
       WHERE f.Sentence=z.Sentence AND
             f.StartPosition>=z.ZoneStart AND
             f.StartPosition<z.ZoneEnd")

0 讨论(0)

忘掉有多难

2020-11-27 22:28
There is a package in Bioconductor called IRanges that does what you want.

First, form an IRanges object for your zones:
```
zone.ranges <- with(zones, IRanges(ZoneStart, ZoneEnd))
```
Next, find the overlaps:
```
zone.ind <- findOverlaps(fixes$StartPosition, zone.ranges, select="arbitrary")
```
Now you have indices into the rows of the zones data frame, so you can merge:
```
fixes$Zone <- zones$Zone[zone.ind]
```
Edit: Just realized you have floating point values, while IRanges is integer-based. So you would need to multiply the coordinates by 100, given your precision.
0 讨论(0)
发布评论:

提交评论
- 加载中...

野趣味

2020-11-27 22:42

With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins and range joins:

library(data.table)
setDT(fixes)[setDT(zones), 
             on = .(Sentence, StartPosition >= ZoneStart, StartPosition < ZoneEnd), 
             Zone := Zone][]

   Order Participant Sentence Fixation StartPosition Zone
1:     1           1        1        1         -6.89    2
2:     2           1        1        2         -5.88    3
3:     3           1        1        3         -5.33    3
4:     4           1        1        4         -4.09    4
5:     5           1        1        5         -5.36    3

Data

fixes <- readr::read_table(
  "Order Participant Sentence Fixation StartPosition
  1       1          1         1       -6.89
  2       1          1         2       -5.88
  3       1          1         3       -5.33
  4       1          1         4       -4.09
  5       1          1         5       -5.36"
)
zones <- readr::read_table(
  "Sentence     Zone  ZoneStart   ZoneEnd
  1           1     -8.86      -7.49
  1           2     -7.49      -5.89
  1           3     -5.88      -4.51
  1           4     -4.51      -2.90"
)

0 讨论(0)

独厮守ぢ

2020-11-27 22:42
I think the best approach is to change zones to a more friendly format for what you're doing:
```
ZoneLookUp = lapply(split(zones, zones$Sentence), function(x) c(x$ZoneStart, x$ZoneEnd[nrow(x)]))

#$`1`
#[1] -8.86 -7.49 -5.88 -4.51 -2.90
```
Then you can easily look up each zone:
```
fixes$Zone = NULL
for(i in 1:nrow(fixes))
    fixes$Zone[i] = cut(fixes$StartPosition[i], ZoneLookUp[[fixes$Sentence[i]]], labels=FALSE)
```
If performance is an issue, you can take a (only) slightly less simple approach using by or data.table with by.
0 讨论(0)
发布评论:

提交评论
- 加载中...