问题
I have several thousand gene trees that I am trying to ready for analysis with codeml. The tree below is a typical example. What I want to do is automate the collapsing of tips or nodes that appear to be duplicates. For instance, descendants of node 56 are tips 26, 27, 28 etc all the way to 36. Now all of these other than tip 26 appear to be duplicates. How can I collapse them all into a single tip, leaving just tips 28 and one representative of the other tips as the descendants of node 56?
I know how to manually do this one by one, but I am trying to automate the process so that a function can identify which tips need to be collapsed and then reduce them to a single representative tip. So far I have been looking at the cophenetic function which calculates the distances between the tips. However, I am not sure how to use that information to collapse tips.
Here is the newick string for the below tree:
((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;
回答1:
One option is to drop tips that have a length beneath the threshold.
drop_dupes <- function(tree,thres=1e-5){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
drop.tip(tree,tree$tip.label[toDrop])
}
plot(drop_dupes(tree))
来源:https://stackoverflow.com/questions/38570074/phylogenetics-in-r-collapsing-descendant-tips-of-an-internal-node