问题
How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?
My large dataframe, captures, looks similar to this when called via head(captures)
:
id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1
My small dataframe, seasons, looks similar to this in its entirety:
Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15
I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date
falls in the ranges defined in seasons.
Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.
#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
for (j in 1:length(captures$id{
captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
}
}
Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!
回答1:
It would be great indeed if you could do a join
operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for
loop.
The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.
Your data:
txt <- "Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)
And the start the process, we assume that all data's season
is as yet not defined:
dat$season <- NA
Loop around each of the seasons' rows:
for (i in seq_len(nrow(seasons))) {
dat$season <- ifelse(is.na(dat$season) &
dat$date >= seasons$Opening.Date[i] &
dat$date < seasons$Closing.Date[i],
seasons$Season[i], dat$season)
}
dat
# id date sex season
# 1 160520 2016-11-22 1 2016
# 2 1029735 2016-11-12 1 2016
# 3 1885200 2016-11-05 1 2016
# 4 2058366 2015-09-26 2 2015
# 5 2058367 2015-09-26 1 2015
# 6 2058368 2015-09-26 1 2015
回答2:
Here's using non equi joins from data.table:
require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)
ans <- seasons[captures, Season,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]
For each row in captures
, the index corresponding to the first matching row (mult="first"
) in seasons
is figured out based on the condition provided to on
argument. The value of Season
for corresponding indices is then returned and saved under ans
. It is then added as a new column to seasons
by reference.
I've shown it in two steps for sake of understanding.
You can see the first matching indices by using which=TRUE
instead:
seasons[captures,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first",
which=TRUE]
# [1] 1 1 1 2 2 2
回答3:
You could try with sqldf
. Note, I had to change the point in Opening_Date and Closing_Date to an "_".
library(sqldf)
captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures
id date sex Season
1 160520 2016-11-22 1 2016
2 1029735 2016-11-12 1 2016
3 1885200 2016-11-05 1 2016
4 2058366 2015-09-26 2 2015
5 2058367 2015-09-26 1 2015
6 2058368 2015-09-26 1 2015
data
txt <- "Season Opening_Date Closing_Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)
来源:https://stackoverflow.com/questions/43480624/categorize-based-on-date-ranges-in-r