Comparison loop | 易学教程

问题

I have a nested list

combine <- list(c('A', 'B', 'C'), c('D', 'H', 'G', 'J'), c('A', 'E'))

and a df

df <- data.frame(appln_id = c(1, 1, 2, 2, 4, 4, 4, 3, 3, 3, 3, 5, 9, 9), 
                 prior_year = c(1997,1997,1997,1997,1997,1997,1997,1998,1998,1998,1998,2000,2000,2000),
                 IPC = c('B','E','E','B','H','J','D','H','J','D','E','A','E','B'))

I want to aggregate IPC according to appln_id (eg: for appln_id=1: c('B','E'), for appln_id=2: c('E','B'), for appln_id=4: c('H','J','D'), etc.). Then for each value of prior_year, I want to compare the IPC sets to elements of the list combine.

For the IPC set that is not a subset of any element of combine, I want to save its data in df in another dataframe called new as follows:

new <- data.frame(appln_id = c(1, 1, 3, 3, 3, 3), 
                  prior_year = c(1997,1997,1998,1998,1998,1998),
                  IPC = c('B','E','H','J','D','E'))

and add this IPC set into combine as follows:

combine <- list(c('A', 'B', 'C'), c('D', 'H', 'G', 'J'), c('A', 'E'), c('B', 'E'), c('D','E','J','H'))

This is my code:

new <- data.frame(appln_id=integer(),prio_year=integer(), IPC=character()) 
new_combine=list()
prio_year <- unique(df$prio_year)
appln_id <- unique(df$appln_id)
for (i in prio_year){
  for (j in appln_id){
    x <- sort((df[(df$prio_year==i) & (df$appln_id==j),3])[[1]])
    for (k in combine){
      if (all(x %in% k) == FALSE){
        new <- rbind(new, df[df$appln_id==j,])
        new_combine[[length(new_combine)+1]] <- x
      }
    }
  }
  combine <- c(combine,unique(new_combine))
}

However, it takes too long for my code to run. Could anyone have another way to make it faster? Thank you.

回答1:

Here's something that is only a single loop. Up front, though, I changed $IPC from factor to character, since merging differing factor levels can be a little annoying. (If you're on R-4.0 or $IPC is already character, then no need to do this step.)

df$usable <- TRUE
df$grps <- interaction(df$appln_id, df$prior_year)
newlist <- list()
for (grp in levels(df$grps)) {
  rows <- df$grps == grp & df$usable
  if (!length(rows)) next
  thisIPC <- df$IPC[rows]
  matches <- sapply(combine, function(comb) all(thisIPC %in% comb))
  if (any(matches)) {
    # repeat
  } else {
    # new!
    combine <- c(combine, list(thisIPC))
    newlist <- c(newlist, list(df[rows,]))
    df$usable[rows] <- FALSE
  }
}
df <- df[df$usable,]
new <- do.call(rbind, newlist)
df$usable <- df$grps <- 
  new$usable <- new$grps <- NULL

df
#    appln_id prior_year IPC
# 3         2       1997   E
# 4         2       1997   B
# 5         4       1997   H
# 6         4       1997   J
# 7         4       1997   D
# 12        5       2000   A
# 13        9       2000   E
# 14        9       2000   B
new
#    appln_id prior_year IPC
# 1         1       1997   B
# 2         1       1997   E
# 8         3       1998   H
# 9         3       1998   J
# 10        3       1998   D
# 11        3       1998   E
str(combine)
# List of 5
#  $ : chr [1:3] "A" "B" "C"
#  $ : chr [1:4] "D" "H" "G" "J"
#  $ : chr [1:2] "A" "E"
#  $ : chr [1:2] "B" "E"
#  $ : chr [1:4] "H" "J" "D" "E"

Notes:

I create the $grps variable to make easy single-loop grouping; once this loop is done, feel free to remove it. Using factor and then levels ensures that I iterate over else present combination, nothing more.
I may be going to more extremes than necessary, but iteratively growing frames is bad in the long-term for performance: each time you "add rows", the entire frame is perfectly copied in memory, so with each addition, you duplicate the memory take for the frame. Granted, the memory is cleared, but it is a "known thing" that this slows down noticeably asymptotically. (See chapter 2, Growing Objects, in the R Inferno.) This applies (to a slightly-lesser degree) to iteratively removing rows, too.

Because of this, I don't actually change the contents of the frame until the very end. To accommodate this, I also add a column $usable to indicate if it should be removed in the end. (In the unlikely event that you run this code twice on the same frame, I also use $enable in grabbing $IPC, that might just be overly-defensive.)

Post-loop, I remove the relevant rows from df once, and do a single row-concatenation (rbind) on the newlist, which is a list with frames (or nothing, if nothing happened).

来源：https://stackoverflow.com/questions/62077664/comparison-loop

标签

for-loop