R: Cumulatively count number of times column value appears in other column

It is probably easier to describe what I want to do using an example... Say I have the following dataframe:

id1 id2 var
1   2   a
2   3   b
2   1   a
3   2   a
2   3   a
4   2   a
3   1   b

Which you can generate as follows

df <- data.frame(id1 = c(1,2,2,3,2,4,3),
                 id2 = c(2,3,1,2,3,2,1),
                 var = c('a','b','a','a','a','a','b'))

I want a cumulative count of the number of times id2 has appeared in id1 with the same var, so I would end up with

id1 id2 var count
1   2   a   0
2   3   b   0 
2   1   a   1
3   2   a   1
2   3   a   1
4   2   a   2
3   1   b   0

So the count in row 3 is 1 since we see id1 = 1 and var = 'a' once before row 3 (in row 1), then in row 4 the count is also 1 since we see id1 = 2 and var 'a' in row 3 (we only check before row 4 so don't count the one we see in row 5).

If I was checking the number of times id1 had appeared in id1 I would do something like

with(df, ave(id1 == id1, paste(id1, var), FUN = cumsum))

Is there a quick and easy way of doing this for id2?

Thanks in advance

There might be more elegant ways to do it, but this gets the job done. The key here is the split<- function.

df$count <- NA # This column must be added prior to calling `split<-`
               # because otherwise we can't assign values to it
split(df, df$var) <- lapply(split(df, df$var), function(x){
    x$count <- cumsum(sapply(1:nrow(x), function(i) x$id2[i] %in% x$id1[1:i]))
    x
})

The result is the following. There are some discrepancies, so either you made some errors in your manual construction of the desired results or I have misunderstood the question.

  id1 id2 var count
1   1   2   a     0
2   2   3   b     0
3   2   1   a     1
4   3   2   a     2
5   2   3   a     3
6   4   2   a     4
7   3   1   b     0

Update:

Just to make this answer complete and working, this is my take on your solution. Essentially the same, but I think it's nicer and more readable to have the ave inside the lapply.

df$count <- NA
split(df, df$var) <- lapply(split(df, df$var), function(x){
    hit <- sapply(1:nrow(x), function(i) x$id2[i] %in% x$id1[1:i])
    x$count <- ave(hit, x$id2, FUN=cumsum)
    x
})

Have used and edited Backlin's answer to get what I want, code is as follows

df$count<- NA 

split(df, df$var) <- lapply(split(df, df$var), function(x){
    x$count<- sapply(1:nrow(x), function(i) sum(x$id2[i] == x$id1[1:i]))
    x
})

There is probably a more elegant way of doing it but I think this works...

来源：https://stackoverflow.com/questions/19491258/r-cumulatively-count-number-of-times-column-value-appears-in-other-column

标签

cumsum