问题
I'm trying to subset some data and am stuck on the last part of cleaning.
What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.
I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.
Here's the data (Note this is a very small subset that only includes individuals from 1:5): https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0
And here's the for loop :
epa.d <- read.csv("/.../data.csv")
#Function for loops
days <- function (month){
if (month == 06) return(as.numeric(30))
if (month == 07) return(as.numeric(31))
if (month == 08) return(as.numeric(31))
}
#Subset data for 75% in June, July, and August
for (i in unique(epa.d$indivID)){
for (j in unique(epa.d$year)){
for (k in unique(epa.d$month)){
monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k )
monthperc = (monthsum/days(k))* 100
if (monthperc < 75){
epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]
}
}
}
}
回答1:
If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:
library(dplyr)
# For each indivID, calculate percent of days in each month with
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
filter(pctCoverage >= 0.75)
We now have a data frame epa.d_75
that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.
# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
Update: To answer the questions in the commments:
Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.
The
%>%
is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr Vignette to learn more about how to use it. Here's how the logic works in this case:group_by
splits the data set by the grouping variables, then runs the next functions separately on each group. In this case,summarise
counts the number of rows in the data frame for each unique combination ofindivID
,month
, andyear
, thenmutate
adds a column with the fractional coverage for thatindivID
for thatmonth
andyear
.filter
then gets rid of any combination ofindivID
,month
, andyear
with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see whatepa.d_75
looks like before the filtering operation:
epa.d_75 = epa.d %>% group_by(indivID, year, month) %>% summarise(count=n()) %>% mutate(pctCoverage = ifelse(month==6, count/30, count/31))
- why the hell this is so much faster than running for loops? I don't know the answer in detail, but
dplyr
does most of its magic inC
code under the hood, which is faster than nativeR
. Hopefully someone else can give a more precise and detailed answer.
回答2:
Another option would be to use data.table
(similar to @eipi10's dplyr
method), which would be very fast.
library(data.table)
epa.d_75 <- setDT(epa.d)[, list(pctCoverage=ifelse(month==6, .N/30,
.N/31)),by=list(indivID, year, month)][pctCoverage >=0.75]
epa.d_75New = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
data
epa.d <- read.csv('data.csv', row.names=1)
来源:https://stackoverflow.com/questions/27074023/subsetting-based-on-observations-in-a-month