Indexing sequence chunks using data.table

后端 未结 1 557
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-21 11:22

Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 s

相关标签:
1条回答
  • 2021-01-21 11:49

    If I understand your question correctly, you want to set the fix_min to FALSE when R == 0 or when R == 1 & (1 =< Seq < 6 | Seq > 6). Then the following should give you what you want:

    # recreating the data from your first code block
    set.seed(1)
    DT1 <- data.table(R=sample(0:1, 20000, rep=TRUE))[, smp:=.I
                                                      ][, Seq:=seq(.N), by=rleid(R)
                                                        ][, Seq2 := Seq[.N], by=rleid(R)]
    
    # adding the needed 'fix_min' column
    DT1[, fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0), by=rleid(R)
        ][R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2, fix_min := FALSE]
    

    Explanation:

    • data.table(R=sample(0:1, 20000, rep=TRUE)) creates the base of the data.table
    • [, smp:=.I] creates an index and adds it to the data.table
    • by=rleid(R) identifies the sequences; to see what it does try: data.table(R=sample(0:1, 20000, rep=TRUE))[, seq.id:=rleid(R)]
    • [, Seq:=seq(.N), by=rleid(R)] creates an index for each sequence and adds it to the data.table; the sequences are identified by rleid(R)
    • [, Seq2 := Seq[.N], by=rleid(R)] creates a variable that contains a value indicating the length of the sequence
    • fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0) creates a logical vector with TRUE values where R==1 & the length of the sequence is larger than one (Seq[.N] > 1) excluding the values where the sequence number is a multiple of 6 (Seq%%6!=0)
    • R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2 filters the data.table as follows: only rows where R==1 & the sequence value is 7, 13, 19, etc (Seq%%6==1) & the length of the sequence is 7, 13, 19, etc and only selects the last row (Seq==Seq2) from the sequences that meet the other conditions. With fix_min := FALSE you set them to FALSE.
    0 讨论(0)
提交回复
热议问题