问题
I have a nested list
combine <- list(c('A', 'B', 'C'), c('D', 'H', 'G', 'J'), c('A', 'E'))
and a df
df <- data.frame(appln_id = c(1, 1, 2, 2, 4, 4, 4, 3, 3, 3, 3, 5, 9, 9),
prior_year = c(1997,1997,1997,1997,1997,1997,1997,1998,1998,1998,1998,2000,2000,2000),
IPC = c('B','E','E','B','H','J','D','H','J','D','E','A','E','B'))
I want to aggregate IPC
according to appln_id
(eg: for appln_id=1: c('B','E')
, for appln_id=2: c('E','B')
, for appln_id=4: c('H','J','D')
, etc.). Then for each value of prior_year
, I want to compare the IPC
sets to elements of the list combine
.
For the IPC
set that is not a subset of any element of combine
, I want to save its data in df
in another dataframe called new
as follows:
new <- data.frame(appln_id = c(1, 1, 3, 3, 3, 3),
prior_year = c(1997,1997,1998,1998,1998,1998),
IPC = c('B','E','H','J','D','E'))
and add this IPC
set into combine
as follows:
combine <- list(c('A', 'B', 'C'), c('D', 'H', 'G', 'J'), c('A', 'E'), c('B', 'E'), c('D','E','J','H'))
This is my code:
new <- data.frame(appln_id=integer(),prio_year=integer(), IPC=character())
new_combine=list()
prio_year <- unique(df$prio_year)
appln_id <- unique(df$appln_id)
for (i in prio_year){
for (j in appln_id){
x <- sort((df[(df$prio_year==i) & (df$appln_id==j),3])[[1]])
for (k in combine){
if (all(x %in% k) == FALSE){
new <- rbind(new, df[df$appln_id==j,])
new_combine[[length(new_combine)+1]] <- x
}
}
}
combine <- c(combine,unique(new_combine))
}
However, it takes too long for my code to run. Could anyone have another way to make it faster? Thank you.
回答1:
Here's something that is only a single loop. Up front, though, I changed $IPC
from factor
to character
, since merging differing factor levels can be a little annoying. (If you're on R-4.0 or $IPC
is already character
, then no need to do this step.)
df$usable <- TRUE
df$grps <- interaction(df$appln_id, df$prior_year)
newlist <- list()
for (grp in levels(df$grps)) {
rows <- df$grps == grp & df$usable
if (!length(rows)) next
thisIPC <- df$IPC[rows]
matches <- sapply(combine, function(comb) all(thisIPC %in% comb))
if (any(matches)) {
# repeat
} else {
# new!
combine <- c(combine, list(thisIPC))
newlist <- c(newlist, list(df[rows,]))
df$usable[rows] <- FALSE
}
}
df <- df[df$usable,]
new <- do.call(rbind, newlist)
df$usable <- df$grps <-
new$usable <- new$grps <- NULL
df
# appln_id prior_year IPC
# 3 2 1997 E
# 4 2 1997 B
# 5 4 1997 H
# 6 4 1997 J
# 7 4 1997 D
# 12 5 2000 A
# 13 9 2000 E
# 14 9 2000 B
new
# appln_id prior_year IPC
# 1 1 1997 B
# 2 1 1997 E
# 8 3 1998 H
# 9 3 1998 J
# 10 3 1998 D
# 11 3 1998 E
str(combine)
# List of 5
# $ : chr [1:3] "A" "B" "C"
# $ : chr [1:4] "D" "H" "G" "J"
# $ : chr [1:2] "A" "E"
# $ : chr [1:2] "B" "E"
# $ : chr [1:4] "H" "J" "D" "E"
Notes:
- I create the
$grps
variable to make easy single-loop grouping; once this loop is done, feel free to remove it. Usingfactor
and thenlevels
ensures that I iterate over else present combination, nothing more. I may be going to more extremes than necessary, but iteratively growing frames is bad in the long-term for performance: each time you "add rows", the entire frame is perfectly copied in memory, so with each addition, you duplicate the memory take for the frame. Granted, the memory is cleared, but it is a "known thing" that this slows down noticeably asymptotically. (See chapter 2, Growing Objects, in the R Inferno.) This applies (to a slightly-lesser degree) to iteratively removing rows, too.
Because of this, I don't actually change the contents of the frame until the very end. To accommodate this, I also add a column
$usable
to indicate if it should be removed in the end. (In the unlikely event that you run this code twice on the same frame, I also use$enable
in grabbing$IPC
, that might just be overly-defensive.)Post-loop, I remove the relevant rows from
df
once, and do a single row-concatenation (rbind
) on thenewlist
, which is a list with frames (or nothing, if nothing happened).
来源:https://stackoverflow.com/questions/62077664/comparison-loop