问题
I want to cluster sequences with optimal matching with TraMineR::seqdist()
from data that contains missings, i.e. sequences containing gaps.
library(TraMineR)
data(ex1)
sum(is.na(ex1))
# [1] 38
sq <- seqdef(ex1[1:13])
sq
# Sequence
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 *-D-D-D-D-D-D-D-D-D-D
# s4 A-A-*-*-B-B-B-B-D-D
# s5 A-*-A-A-A-A-*-A-A-A
# s6 *-*-*-C-C-C-C-C-C-C
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*
sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)
# A-> B-> C-> D->
# A-> 0 2.000 2 2.000
# B-> 2 0.000 2 1.823
# C-> 2 2.000 0 2.000
# D-> 2 1.823 2 0.000
When I run seqdist()
dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)
I'm receiving
Error: 'with.missing' must be TRUE when 'seqdata' or 'refseq' contains missing values
but when I set option with.missing=TRUE
I'm receiving
[>] including missing values as an additional state
[>] 7 sequences with 5 distinct states
[>] checking 'sm' (one value for each state, triangle inequality)
Error: [!] size of substitution cost matrix must be 5x5
So, how can we compute the dissimilarities between sequences using seqdist()
with the output of seqsubm()
the right way when the data contain missings i.e. sequences contain gaps?
Note: I'm not very sure if this makes sense at all. So far I just exclude observations with missings but due to my data I loose lots of observations by that. Therefore it would be worthwhile to know if there's such an option.
回答1:
There are different strategies to computing distances when you have gaps.
1) A first solution is to consider the missing state as an additional state. This is what seqdist
does when you set with.missing=TRUE
. In that case the sm
matrix should contain costs of substituting any state with the missing state. Using seqsubm
you just need to specify with.missing=TRUE
for that function too. By default, the substitution costs to substitute 'missing' are set as the fixed value miss.cost
(2 by default).
sm <- seqsubm(sq, method='TRATE', with.missing=TRUE)
round(sm,digits=3)
# A-> B-> C-> D-> *->
# A-> 0 2.000 2 2.000 2
# B-> 2 0.000 2 1.823 2
# C-> 2 2.000 0 2.000 2
# D-> 2 1.823 2 0.000 2
# *-> 2 2.000 2 2.000 0
To get substitution costs for 'missing' based on transition probabilities
sm <- seqsubm(sq, method='TRATE', with.missing=TRUE, miss.cost.fixed=FALSE)
round(sm,digits=3)
# A-> B-> C-> D-> *->
# A-> 0.000 2.000 2.000 2.000 1.703
# B-> 2.000 0.000 2.000 1.823 1.957
# C-> 2.000 2.000 0.000 2.000 1.957
# D-> 2.000 1.823 2.000 0.000 1.957
# *-> 1.703 1.957 1.957 1.957 0.000
Using the latter sm
, we get the distances between sequences
dist.om <- seqdist(sq, method="OM", indel=1, sm=sm, with.missing=TRUE)
round(dist.om, digits=2)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0.00 22.87 21.91 18.41 6.41 17.00 17.03
# [2,] 22.87 0.00 13.76 11.56 19.91 19.87 22.57
# [3,] 21.91 13.76 0.00 14.25 18.96 18.91 21.57
# [4,] 18.41 11.56 14.25 0.00 13.70 15.70 18.14
# [5,] 6.41 19.91 18.96 13.70 0.00 15.70 16.62
# [6,] 17.00 19.87 18.91 15.70 15.70 0.00 16.70
# [7,] 17.03 22.57 21.57 18.14 16.62 16.70 0.00
Of course, sequences would then be close from each other just because they share many missing states (*). Therefore, you may want to retain only sequences with say less than 10 % of their elements missing.
2) A second solution is to delete the gaps, which you do in seqdef
. (However, note that this changes alignment.)
## Here, we drop seq 7 that contains only missing values
sq <- seqdef(ex1[-7,1:13], left='DEL', gaps='DEL')
sq
# Sequence
# s1 A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 D-D-D-D-D-D-D-D-D-D
# s4 A-A-B-B-B-B-D-D
# s5 A-A-A-A-A-A-A-A
# s6 C-C-C-C-C-C-C
sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)
# A-> B-> C-> D->
# A-> 0.000 1.944 2 2.000
# B-> 1.944 0.000 2 1.823
# C-> 2.000 2.000 0 2.000
# D-> 2.000 1.823 2 0.000
dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)
round(dist.om, digits=2)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 19.61 20.00 13.78 2.00 17
# [2,] 19.61 0.00 12.76 9.59 17.61 17
# [3,] 20.00 12.76 0.00 13.29 18.00 17
# [4,] 13.78 9.59 13.29 0.00 11.78 15
# [5,] 2.00 17.61 18.00 11.78 0.00 15
# [6,] 17.00 17.00 17.00 15.00 15.00 0
来源:https://stackoverflow.com/questions/53701467/how-to-compute-dissimilarities-between-sequences-when-sequences-contain-gaps