Create a “sessionID” based on “userID” and differences in “timeStamp”

后端 未结 2 1783
攒了一身酷
攒了一身酷 2021-01-22 07:40

Sorry, another newbie question. I am trying to take parts of data frame based on an existing ID or index, and then create a new ID or index column based on the the difference i

相关标签:
2条回答
  • 2021-01-22 07:46
    library(plyr)
    
    ddply(myDF, .(userID), transform, 
          sessID3 = paste(userID, 
                          c(0, cumsum(sapply(1:(length(userID) - 1),
                                             function(x)
                                               ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
                                                      1, 0)))), sep = '.'),
          sessID4 = paste(userID, 
                          c(0, cumsum(sapply(1:(length(userID) - 1),
                                             function(x)
                                               ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
                                                      1, 0)))) + 1, sep = '.'))
    

    Gives me:

    #    userID timeStamp var1 var2 varN sessID1 sessID2 sessID3 sessID4
    # 1       1         1    x    y    N     1.0     1.1     1.0     1.1
    # 2       1         3    x    y    N     1.0     1.1     1.0     1.1
    # 3       1         6    x    y    N     1.0     1.1     1.0     1.1
    # 4       1        40    x    y    N     1.1     1.2     1.1     1.2
    # 5       1        42    x    y    N     1.1     1.2     1.1     1.2
    # 6       1        43    x    y    N     1.1     1.2     1.1     1.2
    # 7       1        47    x    y    N     1.1     1.2     1.1     1.2
    # 8       2         5    x    y    N     2.0     2.1     2.0     2.1
    # 9       2         8    x    y    N     2.0     2.1     2.0     2.1
    # 10      3         2    x    y    N     3.0     3.1     3.0     3.1
    # 11      3         5    x    y    N     3.0     3.1     3.0     3.1
    # 12      3        38    x    y    N     3.1     3.2     3.1     3.2
    # 13      3        39    x    y    N     3.1     3.2     3.1     3.2
    # 14      3        39    x    y    N     3.1     3.2     3.1     3.2
    # 15      3        82    x    y    N     3.2     3.3     3.2     3.3
    # 16      3        83    x    y    N     3.2     3.3     3.2     3.3
    # 17      3        90    x    y    N     3.2     3.3     3.2     3.3
    # 18      3        91    x    y    N     3.2     3.3     3.2     3.3
    # 19      3       102    x    y    N     3.2     3.3     3.2     3.3
    
    0 讨论(0)
  • 2021-01-22 08:04

    And a "data table" way...

    library(data.table)
    myDT <- data.table(myDF)
    setkey(myDT,userID)
    myDT[,sessID3:=paste(userID,cumsum(c(0,diff(timeStamp)>30)),sep="."),by=userID]
    all.equal(myDT$sessID1,as.numeric(myDT$sessID3))
    # [1] TRUE
    

    Explanation:

    Using by=userID with data table groups the rows by userID. Using diff(timeStamp)>30 creates a logical vector with one fewer element than the number of rows in the group, so we prepend 0 with c(0,diff(timesStamp)>30). Using cumsum(c(0,diff(timeStamp>30)) coerces logical to integer and calculates the cumulative sum. Every time we encounter a diff > 30, the cumsum increments by 1. Finally ,using paste(...) just concatenates the userID with the secondary index.

    One note: you have it set up so that the sessID is numeric. This gets a bit dicey if there are more than 10 sessions for a given user. IMO better to use character for sessID.

    0 讨论(0)
提交回复
热议问题