Figure 1. PHYRN concept and work flow. |
I have been communicating with Randen Patterson on and off over the last five years or so about his efforts to try and study the evolution of gene families when the sequence similarity in the gene family is so low that making multiple sequence alignments are very difficult. Recently, Randen moved to UC Davis so I have been talking / emailing with jim more and more about this issue. Of note, Randen has a new paper in PLoS One about this topic: Bhardwaj G, Ko KD, Hong Y, Zhang Z, Ho NL, et al. (2012) PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences. PLoS ONE 7(4): e34261. doi:10.1371/journal.pone.0034261.
Figure 8. Model for the Evolution of the DANGER Superfamily. |
I invited Randen and the first author Gaurav Bhardwaj to do a guest post here providing some of the story behind their paper for my ongoing series on this topic. I note - if you have published an open access paper on some topic related to this blog I would love to have a guest post from you too. I note - I personally love the fact that they used the "DANGER" family as an example to test their method.
Here is their guest post:
A fundamental problem to phylogenetic inference in the “twilight zone” (<25% pairwise identity), let alone the “midnight zone” (<12% pairwise identity), is the inability to accurately assign evolutionary relationships at these levels of divergence with statistical confidence. This lack of resolution arises from difficulties in separating the phylogenetic signal from the random noise at these levels of divergence. This obviously and ultimately stymies all attempts to truly resolve the Tree of Life. Since most attempts at phylogenetic inferences in twilight/midnight zone have relied on MSA, and with no clear answer on the best phylogenetic methods to resolve protein families in twilight/midnight zone, we have presented rest of this blog post as two questions representative of these problems.
Question1: Is MSA required for accurate phylogenetic inference?
Our Opinion: MSA is an excellent tool for the inference from conserved data sets, but it has been shown by others and us, that the quality of MSA degrades rapidly in the twilight zone. Further, the quest for an optimal MSA becomes increasingly difficult with increased number of taxa under study. Although, quality of MSA methods has improved in last two decades, we have not made significant improvements towards overcoming these problems. Multiple groups have also designed alignment-free methods (see Hohl and Ragan, Syst. Biol. 2007), but so far none of these methods has been able to provide better phylogenetic accuracy than MSA+ML methods. We recently published a manuscript in PLoS One entitled “PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences” introducing a hybrid profile-based method. Our approach focuses on measuring phylogenetic signal from homologous biological patterns (functional domains, structural folds, etc), and their subsequent amplification and encoding as phylogenetic profile. Further, we adopt a distance estimation algorithm that is alignment-free, and thus bypasses the need for an optimal MSA. Our benchmarking studies with synthetic (from ROSE and Seqgen) and biological datasets show that PHYRN outperforms other traditional methods (distance, parsimony and Maximum Liklihood), and provides significantly accurate phylogenies even in data sets exhibiting ~8% average pairwise identity. While this still needs to be evaluated in other simulations (varying tree shapes, rates, models), we are convinced that these types of methods do work and deserve further exploration.
Question 2: How can we as a field critically and fairly evaluate phylogenetic methods?
Our Opinion: A similar problem plagued the field of structural biology whereby there were multiple methods for structural predictions, but no clear way of standardizing or evaluating their performance. An additional problem that applies to phylogenetic inference is that, unlike crystal structures of proteins, phylogenies do not have a corresponding “answer” that can be obtained. Synthetic data sets have tried to answer this question to a certain extent by simulating protein evolution and providing true evolutionary histories that can be used for benchmarking. However, these simulations cannot truly replicate biological evolution (e.g. indel distribution, translocations, biologically relevant birth-death models, etc). In our opinion, we need a CASP-like model (solution adopted by our friends in computational structural biology), where same data sets (with true evolutionary history known only to organizers) are inferred by all the research groups, and then submitted for a critical evaluation to the organizers. To convert this thought to reality, we hereby announce CAPE (Critical Assessment of Protein Evolution) for Summer 20132. We are still in pre-production stages, and we welcome any suggestions, comments and inputs about data sets, scoring and evaluating methods.
Bhardwaj, G., Ko, K., Hong, Y., Zhang, Z., Ho, N., Chintapalli, S., Kline, L., Gotlin, M., Hartranft, D., Patterson, M., Dave, F., Smith, E., Holmes, E., Patterson, R., & van Rossum, D. (2012). PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences PLoS ONE, 7 (4) DOI: 10.1371/journal.pone.0034261