Elegant way to filter records based on multiple criteria using R

ぐ巨炮叔叔 提交于 2020-08-26 07:15:07

问题


I have a data frame like as shown below

test_df <- data.frame("subject_id" = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), 
                      "date_1" = c("01/01/2003", "12/31/2007", "12/30/2008", "12/31/2005",
                                   "01/01/2007", "01/01/2013", "12/31/2008", "03/04/2006", 
                                   "12/31/2009", "01/01/2015", "01/01/2009"))

What I would like to do is

  1. Arrange the dates in ascending order for each subject (sort asc within groups)

  2. Remove date records for each subject based on below criteria (year doesn't matter):

    2a. remove only Dec 31st records if the first record of the subject is Jan 1st ex: subject_id = 1

    2b. remove only Jan 1st records if the first record of the subject is Dec 31st ex: subject_id = 2

    2c. remove only Dec 31st records if the subject has both Dec 31st and Jan 1st in their non-first records (meaning from 2nd record till the end of its records) ex:subject_id = 3

I was trying the below

sorted <- test_df %>% arrange(date_1,group_by = subject_id) #Am I right in sorts the dates within group?
test_df$month = month(test_df$date_1)  #get the month
test_df$day = day(test_df$date_1)  #get the year
filter(test_df, month==12 and day == 31)  # doesn't work here

Can you help me with how can I filter out records based on my criteria?

I expect my output to be like as shown below


回答1:


starting_names <- names(test_df)

test_df %>% 
  mutate(date_1 = lubridate::mdy(date_1)) %>% 
  group_by(subject_id) %>% 
  arrange() %>%
  mutate(
    without_year = format(date_1, "%m-%d"),
    first_date = first(without_year),
    has_both = all(c("01-01", "12-31") %in% tail(without_year, -1))
  ) %>%
  filter(!(first_date == "01-01" & without_year == "12-31")) %>%
  filter(!(first_date == "12-31" & without_year == "01-01")) %>%
  filter(!(first_date != "01-01" & first_date != "12-31" & has_both == TRUE & without_year == "12-31")) %>%
  select(all_of(starting_names)) %>%
  ungroup()

gives:

# A tibble: 7 x 2
  subject_id date_1    
       <dbl> <date>    
1          1 2003-01-01
2          1 2008-12-30
3          2 2005-12-31
4          2 2008-12-31
5          3 2006-03-04
6          3 2015-01-01
7          3 2009-01-01



回答2:


You can maybe also try a base solution with a bit of lubridate:

library(lubridate)
# put date_1 as date
test_df$date_1 <- lubridate::mdy(test_df$date_1)
# create the field that's going to be the filter
test_df$cntrl <- paste0(month(test_df$date_1),day(test_df$date_1))

Now the idea is to split your df in a list by group subject_id, then lapply a function that filter using your condition:

# split as list
listed <- split(test_df, test_df$subject_id)

# order each df: requested and fundamental for the function
listed <- lapply(listed, function(df){df[order(df$date_1),]})

# here the function: it's a nested if else statement on the field
filtering <- function(x){if
                        (head(x,1)$cntrl == "11") { x[x$cntrl != '1231', ] }
                        else if
                        (head(x,1)$cntrl == "1231") { x[x$cntrl != '11', ] }
                        else if
                        ( "11" %in% tail(x,nrow(x)-1)$cntrl & "1231" %in% tail(x,nrow(x)-1)$cntrl) { x[x$cntrl != '1231', ] }
                        else(x)}

# lapply it!
listed  <- lapply(listed, function(x)filtering(x))

# now as a dataframe, removing the useless column:
res <- do.call(rbind, listed)[,-3]

# lastly you can rename the rownames
rownames(res) <- 1:nrow(res)

res
 subject_id     date_1
1          1 2003-01-01
2          1 2008-12-30
3          2 2005-12-31
4          2 2008-12-31
5          3 2006-03-04
6          3 2009-01-01
7          3 2015-01-01



回答3:


It's not the prettiest code I've every written, but this works. I assumed the filters were performed in sequence; otherwise, the second and third filters knock out all of subject 2.

    test_df %>%
      mutate(date_1 = as.Date(as.character(date_1), format = "%m/%d/%Y"),
      month = as.numeric(format(date_1, "%m")),
      day = as.numeric(format(date_1, "%d"))) %>%
      group_by(subject_id) %>%
      arrange(date_1) %>%
      filter(!(rep(month[1] == 1 & day[1] == 1, n()) & month == 12 & day == 31)) %>%
      filter(!(rep(month[1] == 12 & day[1] == 31, n()) & month == 1 & day == 1)) %>%
      filter(!(rep(sum(month[-1] == 1 & day[-1] == 1) > 0 & sum(month[-1] == 12 & day[-1] == 31) > 0, n()) & month == 12 & day == 31)) %>%
      ungroup() %>%
      arrange(subject_id, date_1)

      subject_id date_1     month   day
           <dbl> <date>     <dbl> <dbl>
    1          1 2003-01-01     1     1
    2          1 2008-12-30    12    30
    3          2 2005-12-31    12    31
    4          2 2008-12-31    12    31
    5          3 2006-03-04     3     4
    6          3 2009-01-01     1     1
    7          3 2015-01-01     1     1


来源:https://stackoverflow.com/questions/62708660/elegant-way-to-filter-records-based-on-multiple-criteria-using-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!