Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 s
If I understand your question correctly, you want to set the fix_min
to FALSE
when R == 0
or when R == 1 & (1 =< Seq < 6 | Seq > 6)
. Then the following should give you what you want:
# recreating the data from your first code block
set.seed(1)
DT1 <- data.table(R=sample(0:1, 20000, rep=TRUE))[, smp:=.I
][, Seq:=seq(.N), by=rleid(R)
][, Seq2 := Seq[.N], by=rleid(R)]
# adding the needed 'fix_min' column
DT1[, fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0), by=rleid(R)
][R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2, fix_min := FALSE]
Explanation:
data.table(R=sample(0:1, 20000, rep=TRUE))
creates the base of the data.table[, smp:=.I]
creates an index and adds it to the data.tableby=rleid(R)
identifies the sequences; to see what it does try: data.table(R=sample(0:1, 20000, rep=TRUE))[, seq.id:=rleid(R)]
[, Seq:=seq(.N), by=rleid(R)]
creates an index for each sequence and adds it to the data.table; the sequences are identified by rleid(R)
[, Seq2 := Seq[.N], by=rleid(R)]
creates a variable that contains a value indicating the length of the sequencefix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0)
creates a logical vector with TRUE
values where R==1
& the length of the sequence is larger than one (Seq[.N] > 1
) excluding the values where the sequence number is a multiple of 6
(Seq%%6!=0
)R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2
filters the data.table as follows: only rows where R==1
& the sequence value is 7
, 13
, 19
, etc (Seq%%6==1
) & the length of the sequence is 7
, 13
, 19
, etc and only selects the last row (Seq==Seq2
) from the sequences that meet the other conditions. With fix_min := FALSE
you set them to FALSE
.