R: Cumulatively count number of times column value appears in other column

问题

It is probably easier to describe what I want to do using an example... Say I have the following dataframe:

id1 id2 var
1   2   a
2   3   b
2   1   a
3   2   a
2   3   a
4   2   a
3   1   b

Which you can generate as follows

df <- data.frame(id1 = c(1,2,2,3,2,4,3),
                 id2 = c(2,3,1,2,3,2,1),
                 var = c('a','b','a','a','a','a','b'))

I want a cumulative count of the number of times id2 has appeared in id1 with the same var, so I would end up with

id1 id2 var count
1   2   a   0
2   3   b   0 
2   1   a   1
3   2   a   1
2   3   a   1
4   2   a   2
3   1   b   0

So the count in row 3 is 1 since we see id1 = 1 and var = 'a' once before row 3 (in row 1), then in row 4 the count is also 1 since we see id1 = 2 and var 'a' in row 3 (we only check before row 4 so don't count the one we see in row 5).

If I was checking the number of times id1 had appeared in id1 I would do something like

with(df, ave(id1 == id1, paste(id1, var), FUN = cumsum))

Is there a quick and easy way of doing this for id2?

Thanks in advance

回答1:

There might be more elegant ways to do it, but this gets the job done. The key here is the split<- function.

df$count <- NA # This column must be added prior to calling `split<-`
               # because otherwise we can't assign values to it
split(df, df$var) <- lapply(split(df, df$var), function(x){
    x$count <- cumsum(sapply(1:nrow(x), function(i) x$id2[i] %in% x$id1[1:i]))
    x
})

The result is the following. There are some discrepancies, so either you made some errors in your manual construction of the desired results or I have misunderstood the question.

  id1 id2 var count
1   1   2   a     0
2   2   3   b     0
3   2   1   a     1
4   3   2   a     2
5   2   3   a     3
6   4   2   a     4
7   3   1   b     0

Update:

Just to make this answer complete and working, this is my take on your solution. Essentially the same, but I think it's nicer and more readable to have the ave inside the lapply.

df$count <- NA
split(df, df$var) <- lapply(split(df, df$var), function(x){
    hit <- sapply(1:nrow(x), function(i) x$id2[i] %in% x$id1[1:i])
    x$count <- ave(hit, x$id2, FUN=cumsum)
    x
})

回答2:

Have used and edited Backlin's answer to get what I want, code is as follows

df$count<- NA 

split(df, df$var) <- lapply(split(df, df$var), function(x){
    x$count<- sapply(1:nrow(x), function(i) sum(x$id2[i] == x$id1[1:i]))
    x
})

There is probably a more elegant way of doing it but I think this works...

来源：https://stackoverflow.com/questions/19491258/r-cumulatively-count-number-of-times-column-value-appears-in-other-column

标签

cumsum