Create a “sessionID” based on “userID” and differences in “timeStamp”

后端未结

关注

 2  1783

Sorry, another newbie question. I am trying to take parts of data frame based on an existing ID or index, and then create a new ID or index column based on the the difference i

相关标签:

2条回答

梦如初夏

2021-01-22 07:46

library(plyr)

ddply(myDF, .(userID), transform, 
      sessID3 = paste(userID, 
                      c(0, cumsum(sapply(1:(length(userID) - 1),
                                         function(x)
                                           ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
                                                  1, 0)))), sep = '.'),
      sessID4 = paste(userID, 
                      c(0, cumsum(sapply(1:(length(userID) - 1),
                                         function(x)
                                           ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
                                                  1, 0)))) + 1, sep = '.'))

Gives me:

#    userID timeStamp var1 var2 varN sessID1 sessID2 sessID3 sessID4
# 1       1         1    x    y    N     1.0     1.1     1.0     1.1
# 2       1         3    x    y    N     1.0     1.1     1.0     1.1
# 3       1         6    x    y    N     1.0     1.1     1.0     1.1
# 4       1        40    x    y    N     1.1     1.2     1.1     1.2
# 5       1        42    x    y    N     1.1     1.2     1.1     1.2
# 6       1        43    x    y    N     1.1     1.2     1.1     1.2
# 7       1        47    x    y    N     1.1     1.2     1.1     1.2
# 8       2         5    x    y    N     2.0     2.1     2.0     2.1
# 9       2         8    x    y    N     2.0     2.1     2.0     2.1
# 10      3         2    x    y    N     3.0     3.1     3.0     3.1
# 11      3         5    x    y    N     3.0     3.1     3.0     3.1
# 12      3        38    x    y    N     3.1     3.2     3.1     3.2
# 13      3        39    x    y    N     3.1     3.2     3.1     3.2
# 14      3        39    x    y    N     3.1     3.2     3.1     3.2
# 15      3        82    x    y    N     3.2     3.3     3.2     3.3
# 16      3        83    x    y    N     3.2     3.3     3.2     3.3
# 17      3        90    x    y    N     3.2     3.3     3.2     3.3
# 18      3        91    x    y    N     3.2     3.3     3.2     3.3
# 19      3       102    x    y    N     3.2     3.3     3.2     3.3

0 讨论(0)

梦如初夏

2021-01-22 08:04
And a "data table" way...
```
library(data.table)
myDT <- data.table(myDF)
setkey(myDT,userID)
myDT[,sessID3:=paste(userID,cumsum(c(0,diff(timeStamp)>30)),sep="."),by=userID]
all.equal(myDT$sessID1,as.numeric(myDT$sessID3))
# [1] TRUE
```
Explanation:

Using by=userID with data table groups the rows by userID. Using diff(timeStamp)>30 creates a logical vector with one fewer element than the number of rows in the group, so we prepend 0 with c(0,diff(timesStamp)>30). Using cumsum(c(0,diff(timeStamp>30)) coerces logical to integer and calculates the cumulative sum. Every time we encounter a diff > 30, the cumsum increments by 1. Finally ,using paste(...) just concatenates the userID with the secondary index.

One note: you have it set up so that the sessID is numeric. This gets a bit dicey if there are more than 10 sessions for a given user. IMO better to use character for sessID.
0 讨论(0)
发布评论:

提交评论
- 加载中...