The comparison of treesibling time consistent phylogenetic networks is graph isomorphismcomplete
Abstract
In [2] we gave a metric on the class of semibinary treesibling time consistent phylogenetic networks that is computable in polynomial time; in particular, the problem of deciding if two networks of this kind are isomorphic is in P. In this paper, we show that if we remove the semibinarity condition above, then the problem becomes much harder. More precisely, we proof that the isomorphism problem for generic treesibling time consistent phylogenetic networks is polynomially equivalent to the graph isomorphism problem. Since the latter is believed to be neither in P nor NPcomplete, the chances are that it is impossible to define a metric on the class of all treesibling time consistent phylogenetic networks that can be computed in polynomial time.
1 Introduction
After the realization that reticulation processes, like hybridizations, recombinations or lateral gene transfers, have been more relevant in the evolution of life on Earth than previously thought [6], there has been a growing interest in the development of algorithms for the reconstruction of phylogenetic networks: graphical models of evolutionary histories that go beyond phylogenetic trees by including hybrid nodes of indegree greater than one representing reticulation events. As the number of available such algorithms increases, the need of methods for the comparison of phylogenetic networks also increases, as they are used, for instance, to assess the reliability and robustness of these algorithms [12, 14].
One of the types of phylogenetic networks for which there exist reconstruction methods [9, 10] are the treesibling time consistent networks, TSTC networks, for short (see §2 for a formal definition). There have been several attempts to define a metric on the class of all TSTC networks on a given set of taxa [13], and we have recently given a metric on the class of all semibinary TSTC networks, where all hybrid nodes have indegree two [2], but none of the metrics for phylogenetic networks computable in polynomial time proposed so far satisfies the separation axiom (distance 0 means isomorphism) for generic TSTC networks: see [3, 4]. In this paper we show why it should come as no surprise: such a metric would solve in polynomial time the graph isomorphism problem.
The graph isomorphism problem is one of the most important decision problems for which the computational complexity is not known yet [7, 11]. It is believed to be neither in P nor NPcomplete, and subexponential time solutions for it are known. A problem is said to be graph isomorphismcomplete when it is polynomially equivalent to the graph isomorphism problem. In this paper we show that, for every set with more than two elements, the isomorphism problem for TSTC phylogenetic networks with taxa bijectively labeled in is graph isomorphismcomplete.
2 Preliminaries
Let be a nonempty rooted directed acyclic graph (a rDAG, for short). A node of is a leaf if it has outdegree , internal if its outdegree is , of tree type if its indegree is , of hybrid type if its indegree is , and elementary if it is a tree node of outdegree 1. A node is a child of another node (and, hence, is a parent of ) if . Two nodes and are siblings of each other if they share a parent. An arc in a rDAG is a tree arc when is a tree node, and a hybridization arc when is a hybrid node. The height of a node is the longest length of a directed path from to a leaf, and the depth of is the longest length of a directed path from the root to .
Given a finite set of labels, a rDAG is a rDAG with its leaves injectively labelled by . By an isomorphism of rDAGs we understand an isomorphism of directed graphs that preserves the labelling, that is, that maps each leaf in one network to the leaf with the same label in the other network (in particular, isomorphic rDAGs must have the same sets of actual leaf labels). In a rDAG, we shall always identify without any further reference every leaf with its label.
A phylogenetic network on a set of taxa is a rDAG such that:

No tree node is elementary.

Every hybrid node has outdegree , and its single child is a tree node.
We will say that a phylogenetic network is treesibling if every hybrid node has at least one sibling that is a tree node.
A temporal assignment [1] on a network is mapping such that:

If is a hybrid node and , then .

If is a tree node and , then .
We will say that a phylogenetic network is timeconsistent if it admits a temporal assignment. The following alternative characterization of time consistency will be used later. For a proof, see [1, 5].
Proposition 1
Let be a phylogenetic network, let be its set of hybridization arcs, and let be the directed graph with the same set of nodes as and set of arcs . Then, is time consistent if, and only if, does not have any cycle containing some tree arc of .
The underlying biological motivation for the definitions on phylogenetic networks introduced so far is the following. In a phylogenetic network, tree nodes model species (either extant, the leaves, or nonextant, the internal tree nodes), while hybrid nodes model reticulation events, where different species interact to create new species, the parents of the hybrid node being the species involved in this event and its single child being the resulting species. The tree children of a tree node represent direct descendants through mutation. The first condition in the definition of phylogenetic network says that every nonextant species is assumed to have at least two different direct descendants, be them by mutation or through some reticulation event. This is a very common restriction in any definition of phylogeny (be it a tree or a network), since species with only one child cannot be reconstructed from biological data.
The treesibling condition says then that, for every reticulation event, at least one of the species involved in it must have some descendant through mutation. This condition was introduced with the name class I in L. Nakhleh’s PhD Thesis [13], and it has reappeared in several phylogenetic network reconstruction methods [9, 10]. As far as the time consistency goes, we understand that the time assigned to a node represents the time when the corresponding species existed, or when the reticulation event took place. The first condition in time consistency means then that the species involved in a reticulation event must coexist in time in order to interact, while the second condition means that speciation takes some amount of time to take place.
3 Main Results
It is well known [7, 15] that the isomorphism problem for rDAGs is graph isomorphismcomplete. It turns out that the isomorphism problem for rDAGs with their leaves bijectively labeled in any given set of labels is also graph isomorphismcomplete: since we have not been able to find a proof of this easy result in the literature, we provide one here.
Proposition 2
For every nonempty set of labels, the isomorphism for rDAGs is graph isomorphismcomplete.
Proof. Without any loss of generality, we assume that .
Let us prove first that the isomorphism of rDAGs reduces to the isomorphism of rDAGs. For every rDAG , let be the rDAG obtained from by unlabelling its leaves and then, for each , if contained a leaf labeled with , then adding to this leaf treechildren leaves; see Fig. 1. The construction of from adds nodes and arcs, and therefore it is polynomial in the size of . And can be reconstructed from by simply replacing, for each , the node of height 1 with leaves by a leaf labeled with . Then, it is straightforward to check that, for every pair of rDAGs and over , as rDAGs if, and only if, as rDAGs.
Let us prove now that the isomorphism of rDAGs reduces to the isomorphism of rDAGs. For every rDAG , let be the rDAG obtained from by adding a new node , arcs from each leaf of to and finally adding one child leaf to labeled ; see Fig. 2. The construction of from adds 2 nodes and arcs, and therefore it is polynomial. And can be reconstructed from by simply removing its leaves and its only height 1 node, , as well as all arcs pointing to or to the leaves. It is straightforward to check that, for every pair of rDAGs and over , if, and only if, as rDAGs.