Merge 2 dataframes if value within range

后端 未结 4 1354
盖世英雄少女心
盖世英雄少女心 2020-11-27 22:13

I have been struggling with this for some time now and couldn\'t find any way of doing it, so I would be incredibly grateful if you could help! I am a novice in programming

相关标签:
4条回答
  • 2020-11-27 22:21

    You can use sqldf package:

    library(sqldf)
    
    #dummy data
    fixes <- read.table(text="
    Order Participant Sentence Fixation StartPosition
    1       1          1         1       -6.89
    2       1          1         2       -5.88
    3       1          1         3       -5.33
    4       1          1         4       -4.09
    5       1          1         5       -5.36 
    ",header=TRUE)
    zones <- read.table(text="
    Sentence     Zone  ZoneStart   ZoneEnd
    1           1     -8.86      -7.49
    1           2     -7.49      -5.89
    1           3     -5.88      -4.51
    1           4     -4.51      -2.90
    ",header=TRUE)
    
    #output merged result
    res <- 
      sqldf("SELECT [Order],Participant,f.Sentence,Fixation,StartPosition,Zone
           FROM fixes f,zones z
           WHERE f.Sentence=z.Sentence AND
                 f.StartPosition>=z.ZoneStart AND
                 f.StartPosition<z.ZoneEnd")
    
    0 讨论(0)
  • 2020-11-27 22:28

    There is a package in Bioconductor called IRanges that does what you want.

    First, form an IRanges object for your zones:

    zone.ranges <- with(zones, IRanges(ZoneStart, ZoneEnd))
    

    Next, find the overlaps:

    zone.ind <- findOverlaps(fixes$StartPosition, zone.ranges, select="arbitrary")
    

    Now you have indices into the rows of the zones data frame, so you can merge:

    fixes$Zone <- zones$Zone[zone.ind]
    

    Edit: Just realized you have floating point values, while IRanges is integer-based. So you would need to multiply the coordinates by 100, given your precision.

    0 讨论(0)
  • 2020-11-27 22:42

    With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins and range joins:

    library(data.table)
    setDT(fixes)[setDT(zones), 
                 on = .(Sentence, StartPosition >= ZoneStart, StartPosition < ZoneEnd), 
                 Zone := Zone][]
    
       Order Participant Sentence Fixation StartPosition Zone
    1:     1           1        1        1         -6.89    2
    2:     2           1        1        2         -5.88    3
    3:     3           1        1        3         -5.33    3
    4:     4           1        1        4         -4.09    4
    5:     5           1        1        5         -5.36    3
    

    Data

    fixes <- readr::read_table(
      "Order Participant Sentence Fixation StartPosition
      1       1          1         1       -6.89
      2       1          1         2       -5.88
      3       1          1         3       -5.33
      4       1          1         4       -4.09
      5       1          1         5       -5.36"
    )
    zones <- readr::read_table(
      "Sentence     Zone  ZoneStart   ZoneEnd
      1           1     -8.86      -7.49
      1           2     -7.49      -5.89
      1           3     -5.88      -4.51
      1           4     -4.51      -2.90"
    )
    
    0 讨论(0)
  • 2020-11-27 22:42

    I think the best approach is to change zones to a more friendly format for what you're doing:

    ZoneLookUp = lapply(split(zones, zones$Sentence), function(x) c(x$ZoneStart, x$ZoneEnd[nrow(x)]))
    
    #$`1`
    #[1] -8.86 -7.49 -5.88 -4.51 -2.90
    

    Then you can easily look up each zone:

    fixes$Zone = NULL
    for(i in 1:nrow(fixes))
        fixes$Zone[i] = cut(fixes$StartPosition[i], ZoneLookUp[[fixes$Sentence[i]]], labels=FALSE)
    

    If performance is an issue, you can take a (only) slightly less simple approach using by or data.table with by.

    0 讨论(0)
提交回复
热议问题