Strange number of subsequences?

我只是一个虾纸丫 提交于 2019-12-10 19:10:16

问题


I have a sequence object created like this:

subsequences <- function(data){
  slmax <- max(data$time)
  sequences.seqe <- seqecreate(data)
  sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
  sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
  (sequences.sts)
}

data <- subsequences(data)

head(data)

Which gives the output:

    Sequence                                                                     
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged             
[3] *-discussed-*-discussed-*-discussed-*-discussed                              
[4] *-opened-*-discussed-merged-discussed                                        
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed     
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed

But when I calculate the subsequences, I get seemingly ridiculous answers:

seqsubsn(head(data))
 [!] found missing state in the sequence(s), adding missing state to the alphabet
    Subseq.
[1]    1036
[2]    1248
[3]      88
[4]      49
[5]     294
[6]     240

How could the number of subsequences be far longer than the number of events in each sequence?

A 'dput()' of the dataset can be found here. The issue seems to be that the original data has time stamps in seconds. However, I've used the function below in order to change the timestamps to simply be sequential:

read_seqdata <- function(data, startdate, stopdate){
  data <- read.table(data, sep = ",", header = TRUE)
  data <- subset(data, select = c("pull_req_id", "action", "created_at"))
  colnames(data) <- c("id", "event", "time")
  data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') <= '",stopdate,"'"))
  data$end <- data$time
  data <- data[with(data, order(time)), ]
  data$time <- match( data$time , unique( data$time ) )
      data$end <- match( data$end , unique( data$end ) )
  slmax <- max(data$time)
  (data)
}

This makes it possible to create appropriate measures for entropy, sequence length etc., but the number of subsequences is still problematic.


回答1:


The number of subsequences returned are not surprising at all. It is a matter of definition of 'subsequence', which should not be confused with 'substring'.

A sequence $x = (x_1, x_2, ... , x_3)$ is a subsequence of $y$ if its elements $x_i$ are all in $y$ and occur in the same order as in $y$. For instance, A-B-A is a subsequence of C-A-D-B-C-D-A-D.

To illustrate, consider the `mvad' example from the TraMineR package.

library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")

##    Sequence                      
##[1] (EM,4)-(TR,2)-(EM,64)         
##[2] (FE,36)-(HE,34)               
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)

seqsubsn(mvad.seq)[1:3]

##[1]  7  4 16

By default, seqsubsn computes the number of subsequences of the distinct successive states (DSS). The DSS of the first sequence, for example, is EM-TR-EM. The seven subsequences of EM-TR-EM are:

  • the empty sequence
  • the two sequences made of a single element: EM and TR
  • the two-length subsequences: EM-TR, EM-EM, TR-EM
  • the three-length sequence: EM-TR-EM

Proceeding the same way you can verify that your fourth sequence (that is equal to its DSS)

*-opened-*-discussed-merged-discussed

has 49 subsequences, of which the nine two-length subsequences:

*-open, *-discussed, *-merged, opened-*, opened-discussed, opened-merged, discussed-merged, discussed-discussed, merged-discussed

Hope this helps



来源:https://stackoverflow.com/questions/20718879/strange-number-of-subsequences

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!