accelerating code execution in R to provide counting of all possible combinations of events belonging to certain ID

问题

I have a data set that has 3 columns , (ID , D , AE).

sample=data.frame(
ID=c(1,1,1,2,2,2),
  D=c('a','b','c','a','c','c'),
  AE=c('m','x','w','y','m','f')
      )

I want to count the number of IDs where all possible combinations that consist of any combination between any two drugs within a certain ID and the AEs corresponding to that ID. please see the image to understand exactly what i mean enter image description here.

Someone could help me with a code that worked perfectly on the small dataset(sample). However the code would take several hours to work on the actual dataset that has (46000 unique IDs , 1600 unique D values and 3200 unique AE values). Actually i had to interrupt the session after 3 hours (while code execution) without getting any output.

here is the code that i want to improve its performance to suit huge dataset:

enter code here

library(tidyverse)

combinations <- sample %>%
  mutate(D2 = D) %>%
  group_by(ID) %>%
  expand(crossing(D, D2, AE)) %>% # Get all D1, D2, AE combinations within-ID
  filter(D2 > D) %>% # Deduplicate to unique combinations
  rename(D1 = D) %>%
  ungroup() %>%
  distinct(D1, D2, AE) # Deduplicate across IDs

# For a given combination of D1, D2, AE; check how many IDs in sample have that combination
count_ids <- function(D1_val, D2_val, AE_val, data) {
  data %>%
    group_by(ID) %>%
    mutate(
      has_D1 = if_else(D1_val %in% D, "D1", "no D1"),
      has_D2 = if_else(D2_val %in% D, "D2", "no D2"),
      has_AE = if_else(AE_val %in% AE, "AE", "no AE")
    ) %>% 
    group_by(has_D1, has_D2, has_AE) %>%
    summarise(n_IDs = n_distinct(ID), .groups = "drop") %>%
    list(.)
}

combinations %>%
  mutate(data = list(sample)) %>%
  rowwise() %>%
  mutate(data = count_ids(D1, D2, AE, data)) %>% # Get the ID counts for each combination
  unnest(data) %>%
  mutate(colname = str_c(has_D1, has_D2, has_AE, sep = ",")) %>% # Create a column name for each possibility of the combinations
  select(-starts_with("has_")) %>%
  pivot_wider(names_from = colname, values_from = n_IDs, values_fill = 0L) # Spread out to wide format

I would really appreciate your help. Thanks in advance

来源：https://stackoverflow.com/questions/64140017/accelerating-code-execution-in-r-to-provide-counting-of-all-possible-combination

标签

counting

data-wrangling