How do I generate all possible Newick Tree permutations for a set of species given an outgroup?
For those who don\'t know what Newick tree format is, a good description
This was a hard question! Here is the journey I took.
First observation is that the outgroup is always a single node tacked onto the end of the newick string. Let's call the rest of the species the ingroup and try to generate all the permutations of these. Then simply add the outgroup.
from itertools import permutations
def ingroup_generator(species, n):
for perm in permutations(species, n):
yield tuple([tuple(perm), tuple(s for s in species if s not in perm)])
def format_newick(s, outgroup=''):
return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup)
species = ["A","B","C","D","E"]
outgroup = "E"
ingroup = [s for s in species if s != outgroup]
itertools_newicks= []
for n in range(1, len(ingroup)):
for p in ingroup_generator(ingroup, n):
itertools_newicks.append(format_newick(p, outgroup))
for newick in itertools_newicks:
print newick
This returns 40 newick strings:
((A), (B, C, D),(E));
((B), (A, C, D),(E));
((C), (A, B, D),(E));
((D), (A, B, C),(E));
((A, B), (C, D),(E));
((A, C), (B, D),(E));
((A, D), (B, C),(E));
((B, A), (C, D),(E));
((B, C), (A, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((C, B), (A, D),(E));
((C, D), (A, B),(E));
((D, A), (B, C),(E));
((D, B), (A, C),(E));
((D, C), (A, B),(E));
((A, B, C), (D),(E));
((A, B, D), (C),(E));
((A, C, B), (D),(E));
((A, C, D), (B),(E));
((A, D, B), (C),(E));
((A, D, C), (B),(E));
((B, A, C), (D),(E));
((B, A, D), (C),(E));
((B, C, A), (D),(E));
((B, C, D), (A),(E));
((B, D, A), (C),(E));
((B, D, C), (A),(E));
((C, A, B), (D),(E));
((C, A, D), (B),(E));
((C, B, A), (D),(E));
((C, B, D), (A),(E));
((C, D, A), (B),(E));
((C, D, B), (A),(E));
((D, A, B), (C),(E));
((D, A, C), (B),(E));
((D, B, A), (C),(E));
((D, B, C), (A),(E));
((D, C, A), (B),(E));
((D, C, B), (A),(E));
Some of these are duplicates, but we will get to removing the duplicates later.
As bli noted in the comments, (((("A","B"),"C"),"D"),("E"));
and its variants should also be considered valid solutions.
The comments on BioStar pointed me in the right direction that this is the same as generating all the possible groupings of a binary tree. I found a nice Python implementation in this StackOverflow answer by rici:
# A very simple representation for Nodes. Leaves are anything which is not a Node.
class Node(object):
def __init__(self, left, right):
self.left = left
self.right = right
def __repr__(self):
return '(%s, %s)' % (self.left, self.right)
# Given a tree and a label, yields every possible augmentation of the tree by
# adding a new node with the label as a child "above" some existing Node or Leaf.
def add_leaf(tree, label):
yield Node(label, tree)
if isinstance(tree, Node):
for left in add_leaf(tree.left, label):
yield Node(left, tree.right)
for right in add_leaf(tree.right, label):
yield Node(tree.left, right)
# Given a list of labels, yield each rooted, unordered full binary tree with
# the specified labels.
def enum_unordered(labels):
if len(labels) == 1:
yield labels[0]
else:
for tree in enum_unordered(labels[1:]):
for new_tree in add_leaf(tree, labels[0]):
yield new_tree
Then,
enum_newicks= []
for t in enum_unordered(ingroup):
enum_newicks.append('({},({}));'.format(t, outgroup))
for newick in enum_newicks:
print newick
produces the following 15 newicks:
((A, (B, (C, D))),(E));
(((A, B), (C, D)),(E));
((B, (A, (C, D))),(E));
((B, ((A, C), D)),(E));
((B, (C, (A, D))),(E));
((A, ((B, C), D)),(E));
(((A, (B, C)), D),(E));
((((A, B), C), D),(E));
(((B, (A, C)), D),(E));
(((B, C), (A, D)),(E));
((A, (C, (B, D))),(E));
(((A, C), (B, D)),(E));
((C, (A, (B, D))),(E));
((C, ((A, B), D)),(E));
((C, (B, (A, D))),(E));
So now we already have 40 + 15 = 55 possible newick strings and we have to remove the duplicates.
I first dead end that I tried was to create a canonical representation of each newick string so that I could use these as keys in a dictionary. The idea was to recursively sort the strings in all the nodes. But first I had to capture all the (nested) nodes. I couldn't use regular expressions, because nested structures are by definition not regular.
So I used the pyparsing
package and come up with this:
from pyparsing import nestedExpr
def sort_newick(t):
if isinstance(t, str):
return sorted(t)
else:
if all(isinstance(c, str) for c in t):
return sorted(t)
if all(isinstance(l, list) for l in t):
return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))]
else:
return [sort_newick(l) for l in t]
def canonical_newick(n):
n = n.replace(',', '')
p = nestedExpr().parseString(n).asList()
s = sort_newick(p)
return str(s)
This gave for
from collections import defaultdict
all_newicks = itertools_newicks + enum_newicks
d = defaultdict(list)
for newick in all_newicks:
d[canonical_newick(newick)].append(newick)
for canonical, newicks in d.items():
print canonical
for newick in newicks:
print '\t', newick
print
A dictionary with 22 keys:
[[[['A'], [['C'], ['B', 'D']]], ['E']]]
((A, (C, (B, D))),(E));
[[[['B'], [['A'], ['C', 'D']]], ['E']]]
((B, (A, (C, D))),(E));
[[[['B'], [['A', 'C'], ['D']]], ['E']]]
((B, ((A, C), D)),(E));
[[['A', 'C', 'D'], ['B'], ['E']]]
((B), (A, C, D),(E));
((A, C, D), (B),(E));
((A, D, C), (B),(E));
((C, A, D), (B),(E));
((C, D, A), (B),(E));
((D, A, C), (B),(E));
((D, C, A), (B),(E));
[[['A', 'B'], ['C', 'D'], ['E']]]
((A, B), (C, D),(E));
((B, A), (C, D),(E));
((C, D), (A, B),(E));
((D, C), (A, B),(E));
[[[[['A'], ['B', 'C']], ['D']], ['E']]]
(((A, (B, C)), D),(E));
[[[['A', 'C'], ['B', 'D']], ['E']]]
(((A, C), (B, D)),(E));
[[['A'], ['B', 'C', 'D'], ['E']]]
((A), (B, C, D),(E));
((B, C, D), (A),(E));
((B, D, C), (A),(E));
((C, B, D), (A),(E));
((C, D, B), (A),(E));
((D, B, C), (A),(E));
((D, C, B), (A),(E));
[[[['A', 'D'], ['B', 'C']], ['E']]]
(((B, C), (A, D)),(E));
[[['A', 'B', 'C'], ['D'], ['E']]]
((D), (A, B, C),(E));
((A, B, C), (D),(E));
((A, C, B), (D),(E));
((B, A, C), (D),(E));
((B, C, A), (D),(E));
((C, A, B), (D),(E));
((C, B, A), (D),(E));
[[['A', 'C'], ['B', 'D'], ['E']]]
((A, C), (B, D),(E));
((B, D), (A, C),(E));
((C, A), (B, D),(E));
((D, B), (A, C),(E));
[[['A', 'B', 'D'], ['C'], ['E']]]
((C), (A, B, D),(E));
((A, B, D), (C),(E));
((A, D, B), (C),(E));
((B, A, D), (C),(E));
((B, D, A), (C),(E));
((D, A, B), (C),(E));
((D, B, A), (C),(E));
[[[['A'], [['B'], ['C', 'D']]], ['E']]]
((A, (B, (C, D))),(E));
[[[[['A', 'B'], ['C']], ['D']], ['E']]]
((((A, B), C), D),(E));
[[[[['B'], ['A', 'C']], ['D']], ['E']]]
(((B, (A, C)), D),(E));
[[[['C'], [['B'], ['A', 'D']]], ['E']]]
((C, (B, (A, D))),(E));
[[[['C'], [['A', 'B'], ['D']]], ['E']]]
((C, ((A, B), D)),(E));
[[[['A'], [['B', 'C'], ['D']]], ['E']]]
((A, ((B, C), D)),(E));
[[[['A', 'B'], ['C', 'D']], ['E']]]
(((A, B), (C, D)),(E));
[[[['B'], [['C'], ['A', 'D']]], ['E']]]
((B, (C, (A, D))),(E));
[[[['C'], [['A'], ['B', 'D']]], ['E']]]
((C, (A, (B, D))),(E));
[[['A', 'D'], ['B', 'C'], ['E']]]
((A, D), (B, C),(E));
((B, C), (A, D),(E));
((C, B), (A, D),(E));
((D, A), (B, C),(E));
But closer inspection revealed some problems. Let's look for example at the newicks '(((A, B), (C, D)),(E));
and ((D, C), (A, B),(E));
. In our dictionary d
they have a different canonical key, respectively [[[['A', 'B'], ['C', 'D']], ['E']]]
and [[['A', 'B'], ['C', 'D'], ['E']]]
. But in fact, these are duplicate trees. We can confirm this by looking at the Robinson-Foulds distance between two trees. If it is zero, the trees are identical.
We use the robinson_foulds
function from the ete3 toolkit package
from ete3 import Tree
tree1 = Tree('(((A, B), (C, D)),(E));')
tree2 = Tree('((D, C), (A, B),(E));')
rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True)
print rf # returns 0
OK, so Robinson-Foulds is a better way of checking for newick tree equality then my canonical tree approach. Let's wrap all newick strings in a custom MyTree
object where equality is defined as having a Robinson-Foulds distance of zero:
class MyTree(Tree):
def __init__(self, *args, **kwargs):
super(MyTree, self).__init__(*args, **kwargs)
def __eq__(self, other):
rf = self.robinson_foulds(other, unrooted_trees=True)
return not bool(rf[0])
trees = [MyTree(newick) for newick in all_newicks]
It would have been ideal if we could also define a __hash__()
function that returns the same value for duplicate trees, then set(trees)
would automatically remove all the duplicates.
Unfortunately, I haven't been able to find a good way to define __hash__(), but with __eq__
in place, I could make use of index():
unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])]
unique_newicks = [tree.write(format=9) for tree in unique_trees]
for unique_newick in unique_newicks:
print unique_newick
So, here we are at the end of our journey. I can't fully provide proof that this is the correct solution, but I am pretty confident that the following 19 newicks are all the possible distinct permutations:
((A),(B,C,D),(E));
((B),(A,C,D),(E));
((C),(A,B,D),(E));
((D),(A,B,C),(E));
((A,B),(C,D),(E));
((A,C),(B,D),(E));
((A,D),(B,C),(E));
((A,(B,(C,D))),(E));
((B,(A,(C,D))),(E));
((B,((A,C),D)),(E));
((B,(C,(A,D))),(E));
((A,((B,C),D)),(E));
(((A,(B,C)),D),(E));
((((A,B),C),D),(E));
(((B,(A,C)),D),(E));
((A,(C,(B,D))),(E));
((C,(A,(B,D))),(E));
((C,((A,B),D)),(E));
((C,(B,(A,D))),(E));
If we pairwise compare every newick to all other newicks, we get confirmation that there are no more duplicates in this list
from itertools import product
for n1, n2 in product(unique_newicks, repeat=2):
if n1 != n2:
mt1 = MyTree(n1)
mt2 = MyTree(n2)
assert mt1 != mt2