R - Obtaining the highest/lowest value in a set of columns defined by the value in a different dataframe

问题

I have two dataframes: one (A) containing the start and end dates (Julian date, so a continuous count of days) of an event, and the other (B) containing values at dates from start to beyond the end dates in the first dataframe. The start date in A is stable, the end date varies.

I want to be able to, for each row, identify the value with the greatest magnitude of change (highest and/or lowest values) between the start and end date in the series in B, then write to a new dataframe.

Example dataframes

dfA <- data.frame(ID = c(1,2,3,4,5), 
                  startDate = rep(1001,5),
                  endDate = c(1007, 1003, 1004, 1005, 1006))

dfB <- data.frame(ID = c(1,2,3,4,5),
                  "1001" = c(0.5,0.3,1,2,1.1),
                  "1002" = c(0.9,0.3,0.5,1.0,1.2), 
                  "1003" = c(0.8,0.3,0.1,1,2), 
                  "1004" = c(1,0.7,0.8,0.9,1.1), 
                  "1005" = c(2,1,3,1,4), 
                  "1006" = c(1,0.5,0.1,0.3,2), 
                  "1007" = c(1,2,3,4,5),
                  "1008" = c(0.5,1,2,1,0.3))

So, for ID = 1, I want to find the lowest value in B between 1001 and 1007, the start and end dates. This would then be repeated as ID = 1,2,3...n

Is there a solution in the tidyverse package for this?

Thanks in advance.

回答1:

Inspired by Matt's answer, but taking highest and lowest values inside the time interval (as I understand the question):

test2 <- left_join(dfA, dfB, by = "ID") %>% 
  pivot_longer(-c(ID, startDate, endDate)) %>% 
  mutate(name = str_remove(name, "X")) %>% 
  filter(name >= startDate & name <= endDate) %>% #here we keep only the rows with name between startDate and endDate
  group_by(ID) %>%
  mutate(highest = max(value), 
         lowest = min(value)) %>% 
  select(ID, highest, lowest) %>% 
  distinct()

回答2:

It's hard to tell what your expected output should be, but here's a dplyr/tidyverse approach by joining dataframes:

library(tidyverse)

    left_join(dfA, dfB, by = "ID") %>% 
  pivot_longer(-c(ID, startDate, endDate)) %>% 
  group_by(ID) %>%
  mutate(name = str_remove(name, "X"),
         highest = max(value), 
         lowest = min(value)) %>% 
  filter(name <= endDate)

This gives us:

     ID startDate endDate name  value highest lowest
   <dbl>     <dbl>   <dbl> <chr> <dbl>   <dbl>  <dbl>
 1     1      1001    1007 1001    0.5       2    0.5
 2     1      1001    1007 1002    0.9       2    0.5
 3     1      1001    1007 1003    0.8       2    0.5
 4     1      1001    1007 1004    1         2    0.5
 5     1      1001    1007 1005    2         2    0.5
 6     1      1001    1007 1006    1         2    0.5
 7     1      1001    1007 1007    1         2    0.5
 8     2      1001    1003 1001    0.3       2    0.3
 9     2      1001    1003 1002    0.3       2    0.3
10     2      1001    1003 1003    0.3       2    0.3
11     3      1001    1004 1001    1         3    0.1
12     3      1001    1004 1002    0.5       3    0.1
13     3      1001    1004 1003    0.1       3    0.1
14     3      1001    1004 1004    0.8       3    0.1
15     4      1001    1005 1001    2         4    0.3
16     4      1001    1005 1002    1         4    0.3
17     4      1001    1005 1003    1         4    0.3
18     4      1001    1005 1004    0.9       4    0.3
19     4      1001    1005 1005    1         4    0.3
20     5      1001    1006 1001    1.1       5    0.3

回答3:

Base R solution:

If all dates in range are present as vectors in dfB:

# Enure all dates in range have a corresponding vector in dfB copy:
jdrng <- seq(min(dfA$startDate, na.rm = TRUE), max(dfA$endDate, na.rm = TRUE))
prod_df <- merge(dfA, dfB, by = "ID")

# Calculate vector indicies to be used in roc, max and min value calcs:
vnidx <- which(grepl("^X\\d+", names(prod_df)))
strtidx <- vnidx[match(prod_df$startDate, jdrng)]
endidx <- vnidx[match(prod_df$endDate, jdrng)]

# Calculate moc, max and min vals:
res <- cbind(ID = prod_df$ID, do.call(rbind, Map(function(x, y, z) {
  data.frame(moc = (x[, z] - x[, y]) / x[, y],
             maxval = max(unlist(x[, y:z]), na.rm = TRUE),
             minval = min(unlist(x[, y:z]), na.rm = TRUE))
}, split(prod_df, prod_df$ID), strtidx, endidx)))

If not:

# Ensure all dates in range have a corresponding vector in dfB copy:
jdrng <- seq(min(dfA$startDate, na.rm = TRUE), max(dfA$endDate, na.rm = TRUE))
jdvecs <- as.integer(gsub("\\D+", "", grep("^X\\d+", names(dfB), value = TRUE)))
if(!identical(jdrng, jdvecs)){dfB[,paste0("X", setdiff(jdrng, jdvecs))] <- NA_real_}
prod_df <- merge(dfA,
                 dfB[, c(names(dfB)[!grepl("^X\\d+", names(dfB))],
                         paste0("X", sort(jdrng))),], by = "ID")

# Calculate vector indicies to be used in roc, max and min value calcs:
vnidx <- which(grepl("^X\\d+", names(prod_df)))
strtidx <- vnidx[match(prod_df$startDate, jdrng)]
endidx <- vnidx[match(prod_df$endDate, jdrng)]

# Calculate moc, max and min vals:
res <- cbind(ID = prod_df$ID, do.call(rbind, Map(function(x, y, z) {
  data.frame(moc = (x[, z] - x[, y]) / x[, y],
             maxval = max(unlist(x[, y:z]), na.rm = TRUE),
             minval = min(unlist(x[, y:z]), na.rm = TRUE))
}, split(prod_df, prod_df$ID), strtidx, endidx)))

来源：https://stackoverflow.com/questions/64886149/r-obtaining-the-highest-lowest-value-in-a-set-of-columns-defined-by-the-value

标签

dataframe

tidyverse

data-manipulation