categorize based on date ranges in R

这一生的挚爱 提交于 2021-01-28 07:55:26

问题


How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?

My large dataframe, captures, looks similar to this when called via head(captures) :

       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1

My small dataframe, seasons, looks similar to this in its entirety:

Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15 

I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.

Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.

#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
  for (j in 1:length(captures$id{
    captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
  }
}

Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!


回答1:


It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.

The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.

Your data:

txt <- "Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)

txt <- "       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)

And the start the process, we assume that all data's season is as yet not defined:

dat$season <- NA

Loop around each of the seasons' rows:

for (i in seq_len(nrow(seasons))) {
  dat$season <- ifelse(is.na(dat$season) &
                         dat$date >= seasons$Opening.Date[i] &
                         dat$date < seasons$Closing.Date[i],
                       seasons$Season[i], dat$season)                       
}
dat
#        id       date sex season
# 1  160520 2016-11-22   1   2016
# 2 1029735 2016-11-12   1   2016
# 3 1885200 2016-11-05   1   2016
# 4 2058366 2015-09-26   2   2015
# 5 2058367 2015-09-26   1   2015
# 6 2058368 2015-09-26   1   2015



回答2:


Here's using non equi joins from data.table:

require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)

ans <- seasons[captures, Season,
                 on=.(Opening.Date<=date, Closing.Date>=date), 
                 mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]

For each row in captures, the index corresponding to the first matching row (mult="first") in seasons is figured out based on the condition provided to on argument. The value of Season for corresponding indices is then returned and saved under ans. It is then added as a new column to seasons by reference.

I've shown it in two steps for sake of understanding.


You can see the first matching indices by using which=TRUE instead:

seasons[captures, 
          on=.(Opening.Date<=date, Closing.Date>=date),
          mult="first", 
          which=TRUE]
# [1] 1 1 1 2 2 2



回答3:


You could try with sqldf. Note, I had to change the point in Opening_Date and Closing_Date to an "_".

library(sqldf)

captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures    
       id       date sex Season
1  160520 2016-11-22   1   2016
2 1029735 2016-11-12   1   2016
3 1885200 2016-11-05   1   2016
4 2058366 2015-09-26   2   2015
5 2058367 2015-09-26   1   2015
6 2058368 2015-09-26   1   2015

data

txt <- "Season Opening_Date Closing_Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)

txt <- "       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)


来源:https://stackoverflow.com/questions/43480624/categorize-based-on-date-ranges-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!