问题
Using the biofam dataset that comes as part of TraMineR
:
library(TraMineR)
data(biofam)
lab <- c("P","L","M","LM","C","LC","LMC","D")
biofam.seq <- seqdef(biofam[,10:25], states=lab)
head(biofam.seq)
Sequence
1167 P-P-P-P-P-P-P-P-P-LM-LMC-LMC-LMC-LMC-LMC-LMC
514 P-L-L-L-L-L-L-L-L-L-L-LM-LMC-LMC-LMC-LMC
1013 P-P-P-P-P-P-P-L-L-L-L-L-LM-LMC-LMC-LMC
275 P-P-P-P-P-L-L-L-L-L-L-L-L-L-L-L
2580 P-P-P-P-P-L-L-L-L-L-L-L-L-LMC-LMC-LMC
773 P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P
I can perform a cluster analysis:
library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))
However, in this process, the unique id's from biofam.seq have been replaced by a list of numbers 1 through N:
head(cluster3, 10)
[1] Type 1 Type 2 Type 2 Type 2 Type 2 Type 3 Type 3 Type 2 Type 1
[10] Type 2
Levels: Type 1 Type 2 Type 3
Now, I want to know which sequences are within each cluster, so that I can apply other functions to get the mean length, entropy, subsequence, dissimilarity, etc. within each cluster. What I need to do is:
- Map the old ids to the new ids
- Insert the sequences in each cluster into separate sequence objects
- Run the statistics I want on each of the new sequence objects
How can I accomplish 2 and 3 in the list above?
回答1:
The state sequence object for the first cluster, for example, can simply be obtained with
bio1.seq <- biofam.seq[cluster3=="Type 1",]
summary(bio1.seq)
回答2:
I think this will answer your questions. I used the code I found here http://www.bristol.ac.uk/cmm/software/support/workshops/materials/solutions-to-r.pdf to create biofam.seq
, since none of what you suggested was working for me.
# create data
library(TraMineR)
data(biofam)
bf.states <- c("Parent", "Left", "Married", "Left/Married", "Child",
"Left/Child", "Left/Married/Child", "Divorced")
bf.shortlab <- c("P","L","M","LM","C","LC", "LMC", "D")
biofam.seq <- seqdef(biofam[, 10:25], states = bf.shortlab,
labels = bf.states)
# cluster
library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))
First, I use split
to create a list of indices for each cluster, which I then use in a lapply
loop to create a list of sub-sequences from biofam.seq
:
# create a list of sequences
idx.list <- split(seq_len(nrow(biofam)), cluster3)
seq.list <- lapply(idx.list, function(idx)biofam.seq[idx, ])
Finally, you can run analytics on each sub-sequence by using lapply
or sapply
# compute statistics on each sub-sequence (just an example)
cluster.sizes <- sapply(seq.list, FUN = nrow)
where FUN
can be any function you would normally run on a single sequence.
来源:https://stackoverflow.com/questions/21342706/how-to-identify-sequences-within-each-cluster