How do I generate all possible Newick Tree permutations for a set of species given an outgroup?
For those who don\'t know what Newick tree format is, a good description
Let's set aside the newick representation for the moment, and think of a possible python representation of the problem.
A rooted tree can be viewed as a recursive hierarchy of sets of (sets of (sets of ...)) leaves. Sets are unordered, which is quite adapted to describe clades in a tree: {{{"A", "B"}, {"C", "D"}}, "E"}
should be the same thing as {{{"C", "D"}, {"B", "A"}}, "E"}
.
If we consider the initial set of leaves {"A", "B", "C", "D", "E"}
, the trees with "E" as outgroup are the set of sets in the form {tree, "E"}
where tree
s are taken from the set of trees that can be built from the set of leaves {"A", "B", "C", "D"}
. We could try to write a recursive trees
function to generate this set of trees, and our total set of trees would be expressed as follows:
{{tree, "E"} for tree in trees({"A", "B", "C", "D"})}
(Here, I use the set comprehension notation.)
Actually, python doesn't allow sets of sets, because the elements of a set must be "hashable" (that is, python must be able to compute some "hash" values of objects to be able to check whether they belong or not to the set). It happens that python sets do not have this property. Fortunately, we can use a similar data structure named frozenset, which behaves quite like a set, but cannot be modified and is "hashable". Therefore, our set of trees would be:
all_trees = frozenset(
{frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})
trees
functionNow let's focus on the trees
function.
For each possible partition (decomposition into a set of disjoint subsets, including all elements) of the set of leaves, we need to find all possible trees (through a recursive call) for each part of the partition. For a given partition, we will then make a tree for each possible combination of subtrees taken across its parts.
For instance, if a partition is {"A", {"B", "C", "D"}}
, we will consider all possible trees that can be made from part "A"
(actually, just the leaf "A"
itself), and all possible trees that can be made from part {"B", "C", "D"}
(that is, trees({"B", "C", "D"})
). Then, the possible trees for this partition will be obtained by taking all possible pairs where one element comes from just "A"
, and the other from trees({"B", "C", "D"})
.
This can be generalized for partitions with more than two parts, and the product
function from itertools
seems to be useful here.
Therefore, we need a way to generate the possible partitions of a set of leaves.
Here I made a partitions_of_set
function adapted from this solution:
# According to https://stackoverflow.com/a/30134039/1878788:
# The problem is solved recursively:
# If you already have a partition of n-1 elements, how do you use it to partition n elements?
# Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset.
def partitions_of_set(s):
if len(s) == 1:
yield frozenset(s)
return
# Extract one element from the set
# https://stackoverflow.com/a/43804050/1878788
elem, *_ = s
rest = frozenset(s - {elem})
for partition in partitions_of_set(rest):
for subset in partition:
# Insert the element in the subset
try:
augmented_subset = frozenset(subset | frozenset({elem}))
except TypeError:
# subset is actually an atomic element
augmented_subset = frozenset({subset} | frozenset({elem}))
yield frozenset({augmented_subset}) | (partition - {subset})
# Case with the element in its own extra subset
yield frozenset({elem}) | partition
To check the obtained partitions, we make a function to make them easier to display (that will also be useful to make a newick representation of the trees later):
def print_set(f):
if type(f) not in (set, frozenset):
return str(f)
return "(" + ",".join(sorted(map(print_set, f))) + ")"
We test that the partitioning works:
for partition in partitions_of_set({"A", "B", "C", "D"}):
print(len(partition), print_set(partition))
Output:
1 ((A,B,C,D))
2 ((A,B,D),C)
2 ((A,C),(B,D))
2 ((B,C,D),A)
3 ((B,D),A,C)
2 ((A,B,C),D)
2 ((A,B),(C,D))
3 ((A,B),C,D)
2 ((A,D),(B,C))
2 ((A,C,D),B)
3 ((A,D),B,C)
3 ((A,C),B,D)
3 ((B,C),A,D)
3 ((C,D),A,B)
4 (A,B,C,D)
trees
functionNow we can write the tree
function:
from itertools import product
def trees(leaves):
if type(leaves) not in (set, frozenset):
# It actually is a single leaf
yield leaves
# Don't try to yield any more trees
return
# Otherwise, we will have to consider all the possible
# partitions of the set of leaves, and for each partition,
# construct the possible trees for each part
for partition in partitions_of_set(leaves):
# We need to skip the case where the partition
# has only one subset (the initial set itself),
# otherwise we will try to build an infinite
# succession of nodes with just one subtree
if len(partition) == 1:
part, *_ = partition
# Just to be sure the assumption is correct
assert part == leaves
continue
# We recursively apply *tree* to each part
# and obtain the possible trees by making
# the product of the sets of possible subtrees.
for subtree in product(*map(trees, partition)):
# Using a frozenset guarantees
# that there will be no duplicates
yield frozenset(subtree)
Testing it:
all_trees = frozenset(
{frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})})
for tree in all_trees:
print(print_set(tree) + ";")
Output:
(((B,C),A,D),E);
((((A,B),D),C),E);
((((B,D),A),C),E);
((((C,D),A),B),E);
(((A,D),B,C),E);
((A,B,C,D),E);
((((B,D),C),A),E);
(((A,B,C),D),E);
((((A,C),B),D),E);
((((C,D),B),A),E);
((((B,C),A),D),E);
(((A,B),C,D),E);
(((A,C),(B,D)),E);
(((B,D),A,C),E);
(((C,D),A,B),E);
((((A,B),C),D),E);
((((A,C),D),B),E);
(((A,C,D),B),E);
(((A,D),(B,C)),E);
((((A,D),C),B),E);
((((B,C),D),A),E);
(((A,B),(C,D)),E);
(((A,B,D),C),E);
((((A,D),B),C),E);
(((A,C),B,D),E);
(((B,C,D),A),E);
I hope the result is correct.
This approach was a bit tricky to get right. It took me some time to figure out how to avoid the infinite recursion (This happens when the partition is {{"A", "B", "C", "D"}}
).