问题
I have a data frame from several experiments. I am looking to calculate cumulative number of unique values obtained after each successive experiment.
For example, consider:
test <- data.frame(exp = c( rep("exp1" , 4) , rep("exp2" , 4), rep("exp3" , 4) , rep("exp4" , 5) ) ,
entries = c("abcd","efgh","ijkl","mnop", "qrst" , "uvwx" , "abcd","efgh","ijkl" , "qrst" , "uvwx",
"yzab" , "yzab" , "cdef" , "mnop" , "uvwx" , "ghij"))
> test
exp entries
1 exp1 abcd
2 exp1 efgh
3 exp1 ijkl
4 exp1 mnop
5 exp2 qrst
6 exp2 uvwx
7 exp2 abcd
8 exp2 efgh
9 exp3 ijkl
10 exp3 qrst
11 exp3 uvwx
12 exp3 yzab
13 exp4 yzab
14 exp4 cdef
15 exp4 mnop
16 exp4 uvwx
17 exp4 ghij
total number of unique entries are nine. Now I want the result to look like:
exp cum_unique_entries
1 exp1 4
2 exp2 6
3 exp3 7
4 exp4 9
Finally I want to plot this in the form of a barplot. I can do this with for loops approach, but feel there has to be more elegant way.
回答1:
Here's another solution with dplyr
:
library(dplyr)
test %>%
mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
group_by(exp) %>%
slice(n()) %>%
select(-entries)
or
test %>%
mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
group_by(exp) %>%
summarise(cum_unique_entries = last(cum_unique_entries))
Result:
# A tibble: 4 x 2
exp cum_unique_entries
<fctr> <int>
1 exp1 4
2 exp2 6
3 exp3 7
4 exp4 9
Note:
First find the cumulative sum of all non-duplicates (cumsum(!duplicated(entries))
), group_by
exp
, and take the last cumsum
of each group, this number would be the cumulative unique entries for each group.
回答2:
With library(data.table)
we can do
setDT(test)[, new := cumsum(!duplicated(entries))]
test[, .(cum_unique_entries = new[.N]), by = exp]
回答3:
1) sqldf This can be done in a single complex join:
library(sqldf)
sqldf("select b.exp, count(Distinct a.entries) cum
from test a join test b on a.exp <= b.exp group by b.exp")
giving:
exp cum
1 exp1 4
2 exp2 6
3 exp3 7
4 exp4 9
This idea could be translated into the corresponding code using the other frameworks as well although it would be rather inefficient in base and dplyr because one would have to do an n x n cartesian join and then subset it whereas SQL might optimize it to avoid producing the large intermediate result.
2) rollapplyr Sort test
by exp
and use rollapplyr
to calculate the cumulative number of unique entries. Then use !duplicated(..., fromLast = TRUE)
take the last of each exp
group
library(zoo)
n <- nrow(test)
test_sorted <- test[order(test$exp), ]
len_uniq <- function(x) length(unique(x))
test_cum <- transform(test_sorted, cum = rollapplyr(entries, 1:n, len_uniq, fill = NA))
test_cum[!duplicated(test_cum$exp, fromLast = TRUE), -2]
giving:
exp cum
4 exp1 4
8 exp2 6
12 exp3 7
17 exp4 9
回答4:
temp = split(test$entries, test$exp)
data.frame(E = names(temp),
V = sapply(Reduce(c, temp, accumulate = TRUE), function(x) length(unique(x))))
# E V
#1 exp1 4
#2 exp2 6
#3 exp3 7
#4 exp4 9
回答5:
I would use the cast function from the reshape package (simple example, full reference). It sould be as simple as
reshape::cast(test, exp~., value="entries", function(x) length(unique(x)) )
What I'm doing is telling the function to take into account your dataset, use the test
variable as id-variable and "ignore" all the other variables (that's the .
meaning), adding the "metric" returned by the given function (function(x) length(unique(x))
) computed on the column named entries
.
It's somthing like the SQL
aggregate functions and group by
construct!
It's SQL equivalent is
SELECT exp, count(distinct entries)
FROM test
GROUP BY test
来源:https://stackoverflow.com/questions/47836031/r-calculating-cumulative-number-of-unique-entries