Showing posts with label phylogenomics. Show all posts
Showing posts with label phylogenomics. Show all posts

Saturday, March 17, 2012

The Axis of Evol: Getting to the Root of DNA Repair with Philogeny


The Axis of Evol: Getting to the Root of DNA Repair with Philogeny 

In 2005 I wrote an essay about my time in graduate school that was potentially going to be included in a special issue of Mutation Research in honor of my PhD advisor Phil Hanawalt.  Alas, publishing my essay ran into complications in regard to the closed access policies of this journal.  So in the end, my essay was not published.  I had forgotten about it mostly until very recently.  And so I decided to convert the essay to a blog post.  The essay is sort of about what I did in grad. school and sort of about Phil ...

Abstract:
Phylogenomics is a field in which genome analysis and evolutionary reconstructions are integrated. This integration is important because genome data is of great value in evolutionary reconstructions, because evolutionary analysis is critical for understanding and interpreting genomic data, and because there are feedback loops between evolutionary and genome analysis such that they need to be done in an integrated manner. In this paper I describe how I developed my particular phylogenomic approach under the guidance of my Ph.D advisor Philip C. Hanawalt. Since I was the first to use the term phylogenomics in a publication, I have decided to rename the field (at least temporarily) Philogenomics.

1. Doctor of Philosophy
When I went to Stanford for graduate school, I was interested in combining evolutionary analysis and molecular biology in a way that would allow me to study molecular mechanisms through an evolutionary perspective. Although I had gone to Stanford ostensibly to work on butterfly population genetics, within two days of starting a rotation in Phil’s lab, I knew that that was where I wanted to work. This decision was somewhat traumatic, since the work on butterflies included spending the summers at 10,000 feet in the Rocky Mountains and possibly chasing butterflies like a Nabakov wanna-be all over the mountain ranges of the world. As an avid outdoor person, this was quite appealing. Nevertheless, I chose to spend 99% of my graduate work in the dingy confines of Herrin Hall, studying DNA repair. The choice of joining Phil’s lab did have one very positive affect – and that was on my relationship with my grandfather on my mother’s side. Benjamin Post was in many ways like a father to me, especially after my father passed away. He was a physicist from the “old school” and thought that most of biology was completely useless. Needless to say, when I told him I was going to graduate school in California (which he considered already one strike against me) to study butterflies, he decided I was simply a lost cause. Despite all his talk of Einstein and computers and math when I was a child, I might as well have been a poet from his point of view. To make matters worse, my grandfather was a crystallographer, and my brother was getting his Ph.D in crystallography at Harvard. When I informed my grandfather that I was going to be working on DNA repair, he seemed somewhat interested. And then I told him, my advisor, Phil Hanawalt, is relatively well known, and actually used to be considered a biophysicist. Then my grandfather really perked up. He said, “Hanawalt – is he related to Don Hanawalt?” It turns out, that my grandfather worked in the same field as Phil’s father (they both did powder diffraction) and knew him. So my grandfather said “You may not be doing real science, but at least you are doing it with the relative of a real scientist.” Thankfully, I was no longer the black sheep in the family. So, with my grandfather’s approval, I embarked on a career in DNA repair.


I would like to add that I was very torn in writing this article. On the one hand, Phil was the greatest advisor I could ever imagine, allowing me to pursue studies on the evolution of DNA repair and comparative genomic analysis, even though nobody else in the lab worked on such things and at times, nobody seemed interested in them either. Phil’s support allowed me to explore my own interests and develop my concepts for the idea of “Phylogenomics” or the combining of evolutionary reconstructions and genome analysis. On the other hand, this special issue is being published in an Elsevier journal. As a supporter of the Open Access movement on scientific publications (see http://www.plos.org) and the brother of one of the founders of the Public Library of Science, publishing in an Elsevier journal is like cavorting with the devil. But the pull of Phil is very strong (some strange sort of force actually) and despite the effects that this may have on my relationship with my brother, I have agreed to publish in this special issue, and thus can now say that I sold my soul for Phil Hanawalt. [[OOPS - Spoke too soon on this when I wrote it --- in the end I just could not sign on the dotted line]].
In this essay, I describe my development in Phil’s lab of the idea of “Phylogenomics” or the combination of evolutionary reconstructions and genome analysis. I would like to add that this is not an attempt to review the field of phylogenomics or all the studies that could be called phylogenomics of DNA repair. For that I recommend reading other papers by myself (some of which are discussed below) as well as those by Rick Wood [1-4]}, Janusz M Bujnicki [5], Eugene Koonin [6-14]}, Carlos Menck [15-18], Michael Lynch [19-21], Patrick Forterre [22-24], Nancy Moran [25-29], and others. This is just meant to review my angle on the phylogenomics of repair and Phil’s contribution to this.

2. RecAgnizing the value of evolutionary analysis in studies of DNA repair
A post-doc in Phil’s lab at the time I was there, Shi-Kau (now known as Scott) Liu was working on analysis of some studies of recA mutants he had done while working in Irwin Tessman’s lab. He asked me if I could help him with some comparative analyses of RecA protein sequences from different species, in the hopes that this might help interpret his experimental data. We then downloaded and aligned all available RecA protein sequences from different species of bacteria and compared the sequence variation to the recently solved crystal structure of a form of the E. coli RecA protein. Specifically we were looking for compensatory mutations in which there was a change in one amino-acid in the region there was a correlated change in another amino-acid in the same region (these were detected using an evolutionary method called character-state reconstruction).  Interestingly, in some regions of the crystal structure (e.g., the monomer-monomer contact regions) extensive compensatory mutations could be detected, suggesting that this region of the crystal was conserved between species. In other regions of the crystal (e.g., the filament-filament contact regions), no compensatory mutations could be detected suggesting either that this region of the structure was not conserved between species or that the filament contact regions were some artifact of crystallization. This was important to show since the mutations Shi-Kau was looking at were suppressors of another recA mutant (recA1202) and the suppressors we found did not make complete sense if the filament-filament contact regions of the crystal reflected perfectly what was going on in-vivo (30).
In this way, evolutionary reconstructions helped inform experimental studies in E. coli. While this concept was not necessarily novel, it is important to point out that most molecular sequence comparisons used for structure-function studies both then and now focus on sequence conservation (that is, what is identical or similar between sequences). This does not take full advantage of the evolutionary history of sequences since it does not specifically examine how the sequence conservation came to be (that is, it does not look at the amino-acid changes that occurred, just what is conserved). This made me realize that comparative analysis (identifying what is similar or different between genes or species) was fundamentally different from evolutionary reconstructions (which can identify how and possibly even why the similarities and differences came into being). I should point out that to do the compensatory mutation analysis well requires lots of sequences and this was one of the hidden reasons behind why I have pushed for ten years for people studying the evolutionary relationships among microbes to use recA as a marker as they use rRNA (31).

3. Sniffing around at homologs of repair genes
Shortly after the recA analysis was complete, another problem being addressed in the Hanawalt lab presented an even more powerful test for evolutionary reconstructions. Kevin Sweder, another post-doc in the lab, was working on yeast strains with defects in homologs of human DNA repair genes. It was at this time that many of the human DNA repair genes were being cloned and shown to be members of the helicase superfamily of proteins. Many of these could further be assigned to one particular subfamily within the helicase superfamily – the subfamily that contained the yeast SNF2 protein. Proteins in the SNF2 family could be readily identified because their helicase-like domains were all much more similar to each other than any were to other helicase-domain containing proteins. Yet many scientists, including Kevin, were presented with a problem. As the yeast genome was being completed, blast searches could identify that yeast encoded many proteins in the SNF2 family. However, these same blast searches could not readily identify which yeast gene was the orthologs of which human gene. For those who do not know, homologous genes or proteins come in two primary forms – paralogs, which are genes related by gene duplications (e.g., alpha and beta globin) and orthologs, which are the same form of a gene in different species (e.g., human and mouse alpha-globin). Thus if one wanted to use yeast as a model to study a human disease due to a mutation in a SNF2 homolog, it would be helpful to know which yeast gene was the ortholog of the human gene of interest. Since paralogs are related to each other by duplication events and since duplication events are an evolutionary event, I figured that an evolutionary tree of the SNF2 family proteins might help divide the gene family into groups of orthologs.
Indeed, this is exactly what we found – the SNF2 family could be divided into many subfamilies, each of which contained a human and a yeast gene and thus these genes could be considered orthologs of each others. In our analysis we found something even more striking. For every subfamily in the SNF2 superfamily, if the function of more than one member of the subfamily was known (e.g., the human and yeast genes), the function was always conserved. Also, all different subfamilies appeared to have different functions (32). Thus one could predict the function of a gene by which subfamily in which it resided. As with the analysis of RecA, it should be pointed out that the phylogenetic tree-based assignment of genes to subfamilies was more useful than blast searches because blast is simply a way to identify similarity among genes/proteins. The tree allows one to group genes into correct subfamilies even if rates and patterns of evolution have changed over time and are different in different groups. Again, this is a distinction between comparative analysis and evolutionary analysis.

4. A gut feeling leads to the idea of “Phylogenomics”
With the SNF2 analysis as a backdrop, I proceeded to proselytize to anyone who would listes, that phylogenetic trees of genes were going to revolutionize genomic sequencing proteins by allowing one to predict the functions of many unknown genes. Genome sequencing projects of course product lots of sequence data and little functional information. Although most of the people in the Hanawalt lab (except maybe Phil) could not have cared less about my evolutionary rantings, fortunately for me, one person called my bluff. Rick Myers, a professor in the Stanford Medical School and one of the heads of the Stanford Human Genome Center was asked to write a News and Views for Nature Medicine about the recent publications of the genomes of E. coli O157:H7 and Helicobacter pylori. So Rick challenged me and said I should try and come up with a real example of how the people who worked on these genomes screwed something up by not doing an evolutionary analysis. Fortunately, it was easy to find an interesting case to study in one of the genomes. In the H. pylori paper, the authors had predicted that the species should have mismatch repair but then reported something quite unusual – the genome encoded a homolog of MutS but did not encode a homolog of MutL. I suppose this should have raised a red-flag to them since all species known to have mismatch repair required homologs of both of these proteins for the process. While some species had other bells and whistles (e.g., the use of MutH and Dam in gamma proteobacteria), the use of MutS and MutL was absolutely conserved. An evolutionary tree of the MutS homologs available at the time including the one in H. pylori also suggested a red-flag should have been raised before predicting that this species possessed mismatch repair.
The MutS family in prokaryotes could be divided into two separate subfamilies, which I called MutS1 and MutS2. All genes known to be involved in mismatch repair were in the MutS1 family. No gene in the MutS2 family had a known function. The H. pylori gene was in the MutS2 family. So this species had no MutL and a MutS homolog in a novel subfamily. To us, this suggested that it would be a bad idea to predict the presence of mismatch repair in this species (33). Later, I showed that there was a general trend – all prokaryotes with just a MutS2-like protein did not have a MutL-homolog, and all species with a MutS1-like protein did (34-36). Experimental work has now shown that the MutS2 of H. pylori is not involved in MMR and that this species apparently does not have any MMR (37). This is important because this apparently causes this species to have an exceptionally high mutation rate, which in turn can effect how one designs vaccines and drugs and diagnostics to target it. It should be pointed out that the role of the MutS2 homologs is not known although they have been knocked out in many species and as of yet none have a role in MMR. Thus predicting function by evolutionary analysis (or more specifically, not incorrectly predicting function) can be of great practical value.
      It is from this analysis that I came up with the idea of “Phylogenomics” or the integration of evolutionary reconstructions and genome analysis (34-36). These approaches should be fully integrated because there is a feedback loop between them such that they cannot be done separately. For example, in the studies of MutS and MutL it is necessary to do a genome analysis to identify the presence or absence of homologs of these genes, then an evolutionary analysis to determine which forms of each of the genes are present, then a genome analysis again to determine the number and combination of different forms and then an evolutionary analysis to determine whether and when particular forms were gained and lost over evolutionary time, and so on. 

5. Lions and TIGRs and bears
Since leaving Phil’s lab I have been a faculty member at The Institute for Genomic Research (TIGR) and in that time we have found dozens of new uses for a phylogenomic approach and designed many new methods to implement phylogenomics. Such an approach has led to many interesting findings relating to DNA repair. Phylogenetic analysis of eukaryotic genomes has allowed us to identify many nuclear encoded genes that are homologs of DNA repair genes but appear to evolutionary derived from the organellar genomes and thus are good candidates for still having a role in DNA repair in the organelles (38). These include both putatively plastid-derived genes (encoding RecA, Mfd, Fpg, RecG, MutS2, Phr, Lon) and mitochondrial-derived genes (encoding RecA, Tag). Interestingly the presence of Mfd but not UvrABCD is also found in many endosymbiotic bacteria, although the explanation for what this Mfd might be doing is unclear. Phylogenomic analysis has allowed us to identify the loss of important DNA repair genes in various species such as the apparent loss of all the genes for non-homologous end joining in the causative agent of malaria, Plasmodium falciparum (39). An important component of this analysis was the finding that this species did not have an orthologs of DNA ligase IV, even though the original annotation of the genome had suggested it did (Figure 1). 

Figure 1. Phylogenetic tree of DNA ligase homologs showing the presence of an orthologs of DNA Ligase I in Plasmodium falciparum but no orthologs of DNA ligase IV, consistent with the absence of non homologous end joining. 

Among the other interesting repair-related features we have found are: the presence of two MutL homologs in an intracellular bacteria Wolbachia pipientis wMel (40), the presence of two UvrA homologs in Deinococcus radiodurans (41) and Chlorobium tepidum (42), the absence of MutS and MutL from Mycobacterium tuberculosis(43), and the presence of multiple ligases for each chromosome in Agrobacteriumtumefaciens (44). Continued surprises come from almost every genome.
However, all is not good in the world of phylogenomics. One of the biggest problems is that most of the experimental studies of DNA repair that have formed the basis of out knowledge in the field have been done in a narrow range of species. For example, there are estimated to be over 100 major divisions of bacteria (Phyla) and of these, most DNA repair studies have been restricted to three of these phyla (Proteobacteria, Firmicutes (also known as lowGC Gram-positives), and Actinobacteria (also known as highGC Gram positives). This means that if anything novel evolved in any of the other lineages, we would not know about it. This probably explains why, when we sequenced the genome of the radiation resistant bacteria D. radiodurans, analysis of the homologs of DNA repair genes in the genome did reveal many homologs of known repair genes but this list did not have many features that were unusual compared to non radiation resistant species (Table 1) and thus was not of much use in understanding what makes this species so resistant (41).


Table 1. Homologs of known DNA repair genes identified in the initial analysis of the D. radiodurans genome sequence

Process
Genes in D. radiodurans
Unusual features
Nucleotide Excision Repair
UvrABCD, UvrA2
UvrA2 not found in most species
Base Excision Repair
AlkA, Ung, Ung2, GT, MutM, MutY-Nths, MPG
More MutY-Nths than most species
AP Endonuclease
Xth
-
Mismatch Excision Repair
MutS, MutL
-
Recombination
   Initiation
   Recombinase
   Migration and resolution

RecFJNRQ, SbcCD, RecD
RecA
RuvABC, RecG
-
Replication
PolA, PolC, PolX, phage Pol
PolX not in many bacteria
Ligation
DnlJ
-
dNTP pools, cleanup
MutTs, RRase
-
Other
LexA, RadA, HepA, UVDE, MutS2
UvDE not in many bacteria



 This of course means that genome sequencing and analysis, even if done in a robust way, only works well if there is a core of experimental studies on which to base the analysis.
In the end, I would like to define a new word – philogenomics which is the combination of studies of evolution, genomics, DNA repair, thymine metabolism, and punning. The ultimate proof of a philogenomic approach, of course, will come when it figures out the mechanism underlying thymineless death. But that is another story.

6. Acknowledgements

I would like to thank Philip C. Hanawalt for his support during and after my Ph.D research in his lab. Everyone in the field knows he is a great scientist. What they may not all know is that he is an even better human being.

References

1]         Wood, R.D., DNA repair in eukaryotes. Annu Rev Biochem, 1996. 65: p. 135-167.
[2]       Wood, R.D., Nucleotide excision repair in mammalian cells. J. Biol. Chem., 1997. 272(38): p. 23465-23468.
[3]       Wood, R.D. and M.K. Shivji, Which DNA polymerases are used for DNA-repair in eukaryotes? Carcinogenesis, 1997. 18(4): p. 605-610.
[4]       Wood, R.D., et al., Human DNA repair genes. Science, 2001. 291(5507): p. 1284-9.
[5]       Kurowski, M.A., et al., Phylogenomic identification of five new human homologs of the DNA repair enzyme AlkB. BMC Genomics, 2003. 4(1): p. 48.
[6]       Aravind, L., D.R. Walker, and E.V. Koonin, Conserved domains in DNA repair proteins and evolution of repair systems. Nucleic Acids Res, 1999. 27(5): p. 1223-1242.
[7]       Kulaeva, O.I., et al., Identification of a DinB/UmuC homolog in the archeon Sulfolobus solfataricus. Mutat Res, 1996. 357(1-2): p. 245-53.
[8]       Gorbalenya, A.E. and E.V. Koonin, Superfamily of UvrA-related NTP-binding proteins. Implications for rational classification of recombination/repair systems. J Mol Biol, 1990. 213(4): p. 583-91.
[9]       Gorbalenya, A.E., et al., Two related superfamilies of putative helicases involved in replication, recombination, repair and expression of DNA and RNA genomes. Nucleic Acids Res, 1989. 17(12): p. 4713-4730.
[10]     Makarova, K.S., et al., A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res, 2002. 30(2): p. 482-96.
[11]     Makarova, K.S., et al., Genome of the extremely radiation-resistant bacterium Deinococcus radiodurans viewed from the perspective of comparative genomics. Microbiol Mol Biol Rev, 2001. 65(1): p. 44-79.
[12]     Aravind, L. and E.V. Koonin, Prokaryotic homologs of the eukaryotic DNA-end-binding protein Ku, novel domains in the Ku protein and prediction of a prokaryotic double-strand break repair system. Genome Res, 2001. 11(8): p. 1365-74.
[13]     Aravind, L. and E.V. Koonin, The alpha/beta fold uracil DNA glycosylases: a common origin with diverse fates. Genome Biol, 2000. 1(4): p. RESEARCH0007.
[14]     Aravind, L., K.S. Makarova, and E.V. Koonin, SURVEY AND SUMMARY: holliday junction resolvases and related nucleases: identification of new families, phyletic distribution and evolutionary trajectories. Nucleic Acids Res, 2000. 28(18): p. 3417-32.
[15]     Menck, C.F., Shining a light on photolyases. Nat Genet, 2002. 32(3): p. 338-9.
[16]     Simpson, A.J., et al., The genome sequence of the plant pathogen Xylella fastidiosa. The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature, 2000. 406(6792): p. 151-7.
[17]     Morgante, P.G., et al., Functional XPB/RAD25 redundancy in Arabidopsis genome: characterization of AtXPB2 and expression analysis. Gene, 2005. 344: p. 93-103.
[18]     Martins-Pinheiro, M., et al., Different patterns of evolution for duplicated DNA repair genes in bacteria of the Xanthomonadales group. BMC Evol Biol, 2004. 4(1): p. 29.
[19]     Estes, S., et al., Mutation accumulation in populations of varying size: the distribution of mutational effects for fitness correlates in Caenorhabditis elegans. Genetics, 2004. 166(3): p. 1269-79.
[20]     Denver, D.R., et al., Mutation rates, spectra, and hotspots in mismatch repair-deficient Caenorhabditis elegans. Genetics, 2005.
[21]     Denver, D.R., S.L. Swenson, and M. Lynch, An evolutionary analysis of the helix-hairpin-helix superfamily of DNA repair glycosylases. Mol Biol Evol, 2003. 20(10): p. 1603-11.
[22]     Forterre, P., Displacement of cellular proteins by functional analogues from plasmids or viruses could explain puzzling phylogenies of many DNA informational proteins. Mol Microbiol, 1999. 33(3): p. 457-65.
[23]     Cohen, G.N., et al., An integrated analysis of the genome of the hyperthermophilic archaeon Pyrococcus abyssi. Mol Microbiol, 2003. 47(6): p. 1495-512.
[24]     Bouyoub, A., et al., A putative SOS repair gene (dinF-like) in a hyperthermophilic archaeon. Gene, 1995. 167(1-2): p. 147-149.
[25]     Moran, N.A. and A. Mira, The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol, 2001. 2(12): p. RESEARCH0054.
[26]     Dale, C., et al., Loss of DNA recombinational repair enzymes in the initial stages of genome degeneration. Mol Biol Evol, 2003. 20(8): p. 1188-94.
[27]     van Ham, R.C., et al., Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci U S A, 2003. 100(2): p. 581-6.
[28]     Moran, N.A. and J.J. Wernegreen, Lifestyle evolution in symbiotic bacteria: insights from genomics. Trends in Ecology and Evolution, 2000. 15(8): p. 321-326.
[29]     Moran, N.A., Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci U S A, 1996. 93(7): p. 2873-8.
[30]     Liu SK, Eisen JA, Hanawalt PC, Tessman IW. 1993. recA mutations that reduce the constitutive coprotease activity of the RecA1202(PrtC) protein: possible involvement of interfilament association in proteolytic and recombination activities. J. Bacteriol. 175: 6518-6529.
[31]     Eisen JA. 1995. The RecA protein as a model molecule for molecular systematic studies of bacteria: comparison of trees of RecAs and 16S rRNAs from the same species. J. Mol. Evol. 41: 1105-1123.
[32]     Eisen JA, Sweder KS, Hanawalt PC. 1995. Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions. Nucl. Acids Res. 23: 2715-2723.
[33]     Eisen JA, Kaiser D, Myers RM. 1997. Gastrogenomic delights: a movable feast. Nature Medicine 3: 1076-1078.
[34]     Eisen JA. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8: 163-167.
[35]     Eisen JA. 1998. A phylogenomic study of the MutS family of proteins. Nucl. Acids Res. 26: 4291-4300.
[36]     Eisen JA. Hanawalt PC. 1999. A phylogenomic study of DNA repair genes, proteins, and processes. Mut. Res. 435: 171-213.
[37]     Bjorkholm B, Sjolund M, Falk PG, Berg OG, Engstrand L, Andersson DI. 2001. Mutation frequency and biological cost of antibiotic resistance in Helicobacter pylori. Proc Natl Acad Sci U S A. 98(25):14607-12.
[38]     Britt AB, Eisen JA. 2000. DNA repair and recombination. In 'Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.'Nature 408: 796-815.
[39]     Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419: 498-511.
[40]     Wu M, Sun L, Vamathevan J, Riegler M, Deboy R, BrownlieJ, McGraw E, Mohamoud Y, Lee P, BerryK, Khouri HM, Paulsen IT, Nelson KE, Martin W, Esser C, Ahmadinejad N, Wiegand C, Durkin AS, Nelson WC, Beanan MJ, Brinkac LM, DaughertySC, Dodson RJ, Gwinn M, Kolonay JF, Madupu R, Craven MB, Utterback T, WeidmanJ, Nierman WC, Aken SV, Tettelin H, O’Neill S, Eisen JA. 2004. Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined genome massively infected with mobile genetic elements. PLOS Biology 2: 327-341.
[41]     White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ, Haft DH, Gwinn ML, Nelson WC, Richardson DL, Moffat KS, Qin H, Jiang L, Pamphile W, Crosby M, Shen M, Vamathevan JJ, Lam P, McDonald L, Utterback T, Zalewski C, Makarova KS, Aravind L, Daly MJ, Minton KW, Fleischmann RD, Ketchum KA, Nelson KE, Salzberg SL, Smith HO, Venter JC, Fraser CM. 1999. Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science 286: 1571-1577.
[42]     Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, Dodson RJ, Deboy R, Gwinn ML, Nelson WC, Haft DH, Hickey EK, Peterson JD, Durkin AS, Kolonay JL, Yang F, Holt I, Umayam LA, Mason T, Brenner M, Shea TP, Parksey D, Nierman WC, Feldblyum TV, Hansen CL, Craven MB, Radune D, Vamathevan J, Khouri H, White O, Venter JC, Gruber TM, Ketchum KA, Tettelin H, Bryant DA, Fraser CM. 2002. The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium. Proc. Natl. Acad. Sci. USA 99: 9509-9514.
[43]     Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, DeBoy R, Dodson R, Gwinn M, Haft D, Hickey E, Kolonay JF, Nelson WC, Umayam LA, Ermolaeva M, Salzberg SL, Delcher A, Utterback T, Weidman J, Khouri H, Gill J, Mikula A, Bishai W, Jacobs WR Jr, Venter JC, Fraser CM. 2002. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J. Bacteriol.184: 5479-5490.
[44]     Wood DW, Setubal JC, Kaul R, Monks DE, Kitajima JP, Okura VK, Zhou Y, Chen L, Wood GE, Almeida Jr. NF, Woo L, Chen Y, Paulsen IT, Eisen JA, Karp PD, Bovee Sr. D, Chapman P, Clendenning J, Deatherage G, Gillet W, Grant C, Kutyavin T, Levy R, Li M-J, McClelland R, Palmieri A, Raymond C, Rouse G, Saenphimmachak C, Wu Z, Romero P, Gordon D, Zhang S, Yoo H, Tao Y, Biddle P, Jung M, Krespan W, Perry M, Gordon-Kamm B, Liao L, Kim S, Hendrick C, Zhao Z-Y, Dolan M, Chumley F, Tingey SC, Tomb J-F, Gordon MP, Olson MV, Nester EW. 2001. The genome of the natural genetic engineer Agrobacterium tumefaciens C58. Science 294: 2317-2323.




Tuesday, February 7, 2012

New openaccess paper from my lab on "Zorro" software for automated masking of sequence alignments


A new Open Access paper from my lab was just published in PLoS One: Accounting For Alignment Uncertainty in Phylogenomics. Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment Uncertainty in Phylogenomics. PLoS ONE 7(1): e30288. doi:10.1371/journal.pone.0030288

The paper describes the software "Zorro" which is used for automated "masking" of sequence alignments.  Basically, if you have a multiple sequence alignment you would like to use to infer a phylogenetic tree, in some cases it is desirable to block out regions of the alignment that are not reliable.  This blocking is called "masking."

Masking is thought by many to be important because sequence alignments are in essence a hypothesis about the common ancestry of specific residues in different genes/proteins/regions of the genome.  This "positional homology" is not always easy to assign and for regions where positional homology is ambiguous it may be better to ignore such regions when inferring phylogenetic trees from alignments.

Historically, masking has been done by hand/eye looking for columns in a multiple sequence alignment that seem to have issues and then either eliminating those columns or giving them a lower weight and using a weighting scheme in the phylogenetic analysis.

What Zorro does is it removes much of the subjectivity of this process and generates automated masking patterns for sequence alignments.  It does this by assigning confidence scores to each column in a multiple seqeunce alignment. These scores can then be used to account for alignment accuracy in phylogenetic inference pipelines.

The software is available at Sourceforge: ZORRO – probabilistic masking for phylogenetics.  It was written primarily by Martin Wu (who is now a Professor at the University of Virginia) and Sourav Chatterji with a little help here and there from Aaron Darling I think.  The development of Zorro was part of my "iSEEM" project that was supported by the Gordon and Betty Moore Foundation.

In the interest of sharing, since the paper is fully open access, I am posting it here below the fold. UPDATE 2/9 - decided to remove this since it got in the way of getting to the comments ...

Thursday, January 19, 2012

Draft post cleanup #21: Tracking progress on the vertebrate tree of life

Yet another post in my "draft blog post cleanup" series. Here is #21; from March 2010:

A very interesting paper came out recently from colleagues of mine at UC Davis:  Rapid progress on the vertebrate tree of life.  I did not know they were working on this but perhaps should have.  It has some fun/interesting analysis of the accumulation of phylogenetic knowledge over time.  For example see Figure 1

Cumulative phylogenetic information amassed for the last 16 years. The accumulation of sequences for vertebrates in GenBank (a), papers using the term 'phylogeny' or 'phylogenetics' in the Web of Science database (b) and phylogenetic resolution (measured as the proportion of nodes with at least 50% bootstrap support) in the vertebrate tree of life resulting from these research efforts (c). In all cases, the data are cumulative from the start of each analysis. Phylogenetic resolution is calculated as in Table 1. Trend lines are exponential in (a), and second order polynomial in (b) and (c).



The rest of the paper is worth a look.

And alas I stopped there ... I think I wanted to get Brad Shaffer and Bob Thomson's comments on the paper but never got around to it.  Two years later the paper still is worth a look ...

Thursday, December 15, 2011

Very nice new #PLoSGenetics paper on "Functional Phylogenomics" of Seed Plants

Update2 - 12/22 - Data available here.  Thanks to the authors for clearing things up quickly.

Update1 -  12/19 - Data for this paper seems to be unavailable - not sure why - but looking into this after a TWEET from Karen Cranston. The paper says data is available at: http://nypg.bio.nyu.edu/main/ but I could not find any there.  Note - this is one reason that all data sets should be made available at the journal or third party sites.

Original post:

OK never mind that the terminology of "functional phylogenomics" is a tiny bit vexing to me (long story - some other time perhaps). The paper behind it - PLoS Genetics: A Functional Phylogenomic View of the Seed Plants is very cool.

Here's what the authors did (a very coarse summary)

1. Identified sets of orthologs between plant species using the OrthologID system (which has a phylogenetic underpinning) (the data input for this appeared to have mostly been Unigene EST clusters)

2. Constructed a "total evidence" phylogeny for these taxa (using a few approaches) 


3. Use this phylogeny to reinterpret some general features of the evolution of plants 

4. Searched for gene ontology categories (in annotated genes from these organisms) that agreed with the phylogeny. In essence, this seems to be a search for shared-derived traits (i.e., synapomorphies) in particular clades. 




5. Generated hypothesis about functional evolution in particular clades.

Overall, there is a lot that is really fascinating in here and this approach seems very powerful (though I note - I think something akin to this though not as comprehensive or as careful has been done for other groups but not sure).  Check out the paper for more detail ...
Lee EK, Cibrian-Jaramillo A, Kolokotronis S-O, Katari MS, Stamatakis A, et al. (2011) A Functional Phylogenomic View of the Seed Plants. PLoS Genet 7(12): e1002411. doi:10.1371/journal.pgen.1002411

Tuesday, September 20, 2011

Special Guest Post & Discussion Invitation from Matthew Hahn on Ortholog Conjecture Paper



I am very excited about today's post.  It is the first in what I hope will be many - posts from authors of interesting papers describing the "Story behind the paper".  I write extensive detailed posts about my papers and also have tried to interview others about their papers if they are relevant to this blog.  But Matthew Hahn approached me recently about the possibility of him writing up some details on his recent paper on the functions of orthologs vs. paralogs.  So I said "sure" and set up a guest account for him to write up his comments and details of the paper.  


For those of you who do not know, Matt is on the faculty at U. Indiana.  He was a post doc at UC Davis so I have a particular bias in favor of him.  But his recent paper has generated some controversy (I posted some links about it here).  So it is great to get some more detail from him.  In addition, I note, I am also using this approach to try and teach people how easy it is to write a blog post by getting them guest accounts on Blogger and letting them write up something with links, pictures, etc.  So hopefully we can get more scientists blogging too.


Anyway - without any further ado - here is Matt's post:
-----------------------------------------------------------------------
Following Jonathan’s excellent example of how explaining the history of a project helps to illuminate how the process of science actually happens, I thought I’d start by giving a bit of history behind our study, and the paper that we recently published in PLoS Computational Biology (http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002073). And then I’ll address the critics…

How this all got started
It all started a bit more than three years ago, in the summer of 2008. Pedja (as Predrag Radivojac is known to friends) was giving a talk to a group of us on protein function prediction that he also presented as a tutorial at the Automated Function Prediction SIG at ISMB 2008. Pedja and I had already collaborated on a small project involving the evolution of phosphoryation sites, but I really had no idea about his work on function prediction, and little idea in general about how function prediction was done. Reviewing different ways to accomplish transfer-by-similarity, he eventually got around to evolutionary (phylogenomic) approaches. Here is what I remember of this specific exchange during his talk:


Pedja: …and of course these methods only use orthologs for prediction, because orthologs have more similar functions than do paralogs.
me (from audience): Who says?
Pedja: Umm, you say. I mean, the evolutionary biologists say.
me: No, we don’t. I don’t know of any data that says any such thing.
Pedja: Whatever, Matt. We’ll talk about this later.

Well, we did talk about it later, and it turned out that although this claim is made in tons of papers, there is basically no data to support it. In the best cases a real example of one gene family will be cited, but there are very few of these. In the worst cases, the authors will just cite some random paper about gene duplicates (or Fitch's original paper defining orthologs and paralogs). Of course I agree that patterns of sequence evolution might lead you to conclude this relationship was true, but there was no experimental data.


In fact, as we say in our paper, rarely did anyone recognize that this claim needed to be tested, or even that it was a claim that could be tested. At the time Eugene Koonin was the only person to say this: “The validity of the conjecture on functional equivalency of orthologs is crucial for reliable annotation of newly sequenced genomes and, more generally, for the progress of functional genomics. The huge majority of genes in the sequenced genomes will never be studied experimentally, so for most genomes transfer of functional information between orthologs is the only means of detailed functional characterization” (http://www.ncbi.nlm.nih.gov/pubmed/16285863). I really liked the way that Eugene had said this, and started to refer to the idea that orthologs were more functionally similar than paralogs as the “ortholog conjecture.” So to be clear: I completely made up this phrase, but used the most evocative word from the Koonin paper.

Luckily for Pedja and me we had just gotten a small internal grant to work on genome annotation and we had an incoming master’s student (Nathan Nehrt) who was willing to work on a project intending to test the ortholog conjecture.


Interlude: the crappy state of things in the study of the evolution of function

In order to test anything about how function evolves between orthologs and paralogs—or between any genes—one of course needs some kind of data on gene function in multiple species. And this turns out to be a big problem.

Because, as Koonin says in the earlier quote, the vast majority of experimental data comes from a very few species, and these species are not exactly closely related. Here is an approximate phylogeny of the major eukaryotic model organisms:


It’s obvious from this figure that if you need both 1) lots of functional data from two species, and 2) a pretty good idea of exactly what the homologous relationships are between the genes you’re studying, you’re going to have to study human and mouse.

This is actually a pretty bleak picture for people who study molecular evolution (as I do). While we have tons and tons of sequence data both within and between species, and a very good idea about how these sequences evolve, and fancy models with which to analyze these sequences…we know next to diddly-squat about general patterns relating these sequence differences to functional differences. There are lots of interesting things to be gleaned from studies of sequence evolution, but it really would be nice to know something about the relationship between sequence and function.


What we found

What exactly does the ortholog conjecture predict? In my mind, at least, it predicts something like this:


In this completely fictitious graph the relationship between protein function and sequence similarity is a declining one, only it declines faster for paralogs than it does for orthologs. Also, just possibly, gene duplicates start out with slightly diverged function the minute they appear. Anyway, those were our predictions.

But here is what we found (Figure 1 in Nehrt et al. 2011):


(Panel A uses the Biological Process ontology and panel B uses the Molecular Function ontology.)

There are really two different, equally surprising results here. First, there is no relationship between sequence divergence and functional divergence for orthologs (among 2,579 one-to-one orthologs between human and mouse). Absolutely none—it’s a straight horizontal line. Second, there is a relationship for paralogs (among 21,771 comparisons), exactly as we predicted there would be. So according to our results, paralogs actually have more conserved function than do orthologs. Our interpretation of the data was that the most important determinant of function was the organismal context in which a gene/protein found itself: given the same amount of sequence divergence, two proteins in the same organism would be more functionally similar. For orthologs, this means that the sequence divergence of our target gene was not the most important thing, but rather the sum total of divergence in all of the genes that contribute to its cellular context. Which is why all the orthologs have on average similar functional divergence—they are all exactly the same age and hence have approximately the same levels of divergence in these interactors (in this case sequence divergence for paralogs is a much better indicator of their splitting time).

Without going through every result in the paper and our interpretation of every result, suffice it to say that after about a year-and-a-half of working on this (around February 2010), we were satisfied that we had something we were willing to submit. I even seem to remember showing the above figure to Jonathan on a visit to UC-Davis! So we did submit the paper, first to PNAS and then, after rejection, to PLoS Computational Biology, where it was rejected again.

The content of the reviews was approximately the same at both journals. Basically, people were not convinced of our results mostly because the functional relationships were all based on data in the Gene Ontology database. To be specific, the functional data we used came from experiments conducted in 12,204 different papers. We didn’t use any predicted functions, only functions assigned using experimental data. And we did A LOT of work to try to eliminate problems that might have affected our results, including repeating the main analysis using only GO terms common to both the human and mouse datasets. But there can still be bias hidden within these functional assignments because someone always has to interpret the experiment—to say that a yeast two-hybrid experiment means that a gene has function X. And because of these biases, people weren’t buying it.

To get a measure of functional similarity that did not depend on the interpretation of any experiments, we decided to repeat the entire analysis using microarray data, using the correlation in expression levels across 25 tissues as the measure of functional similarity. By this time Nathan was graduating and moving on to Maricel Kann’s lab as a research programmer, so we recruited one of Pedja’s Ph.D. students, Wyatt Clark, to pick up where Nathan had left off. (Wyatt had actually been a student in my undergraduate Evolution course a few years earlier, so we figured he knew something…) After repeating all of the GO-based analyses himself—always better to double-check, right?—Wyatt got all of the microarray data in order and produced this figure (Figure 4 in Nehrt et al. 2011):
So a year after we first submitted a paper, we submitted a new version to PLoS CB including the array analysis, and this was enough to convince the reviewers that our results were not merely due to some strange bias in GO.


The fallout, and some responses

First, let me say that I had some idea that this would be a controversial-ish paper, and that we’d get at least some blowback. For about the first 20 versions of the manuscript (including some submitted versions) I put the words “ortholog conjecture” in quotes in the title, never an endearing move. (Pedja finally convinced me to take them out of the latest submissions.) But I also thought people would be happier that an untested assumption had finally been tested—and we have definitely gotten some positive feedback along these lines, including several groups that told us they have data that support our findings. By coincidence my lab had another paper come out the same week as this one (http://www.ncbi.nlm.nih.gov/pubmed/21636278), and I mistakenly thought it would generate much more attention. I still think the biological importance of the results in that one are much greater than the ortholog conjecture results, but either because we didn’t publish in an open-access journal (Jonathan is always right) or simply because the function-prediction community is more active on the interweb tubes, there have been a surprising number of critical responses (partially collected here: http://phylogenomics.blogspot.com/2011/09/some-links-on-ortholog-conjecture-paper.html). So here are some responses to general critiques.


The ortholog conjecture says only that orthologs are similar.
Okay, this one is a bit unfair, as only one person has said this. The real problem here is that Michael Galperin seems to have deeply misunderstood what we mean by the ortholog conjecture. According to him the ortholog conjecture is “the assumption that orthologs (genes with a common origin that were vertically inherited from the same gene in the last common ancestor of the host organisms) typically retain the same function or have closely related ones.” Umm, no. In fact, if you really think this is what the ortholog conjecture says, then our results support it—human and mouse orthologs do typically have closely related functions. But we are explicitly testing for a difference between orthologs and paralogs, not whether or not orthologs retain any functions. At no point did we say (or even hint) that orthologs should not be used for functional prediction. The whole point of our analysis and conclusions is that we should stop ignoring paralogs, which would give us a ton more data to use for the prediction of functions.


The assignments of orthology and paralogy are incorrect.
This is an easy one: we did in fact get the definitions of in- and out-paralogs correct (and laid them out in Figure S1). According to Sonnhammer and Koonin: “Our definition of ‘outparalogs’ is: paralogs in the given lineage that evolved by gene duplications that happened before the radiation (speciation) event” (http://www.ncbi.nlm.nih.gov/pubmed/12446146). For the purposes of our study, this means that outparalogs are defined as any paralogs that diverged before the speciation event between human and mouse and inparalogs diverged after this speciation event. Outparalogs do not indicate only paralogs in two different species, though by necessity in our dataset inparalogs are only found in the same species (all in human or all in mouse). Therefore, with respect to our conclusion that the most important determinant of function is which genome you are found in (i.e. context), it wouldn’t matter if we had incorrect gene trees: we would never confuse two genes in the same species (either inparalogs or some of the outparalogs) with two genes in different species (all orthologs and the remaining outparalogs).


You should have inferred functions yourselves
This is a fair suggestion, and not having enough time to annotate functions for 40,000 proteins would be a pretty weak excuse for doing good science. Instead...I'll just say that it turns out professional curators are much better at assigning functions than even the original study authors (see http://www.ncbi.nlm.nih.gov/pubmed/20829821). Curators have a much broader view of the whole set of terms available in any ontology, and a much more consistent idea of how to apply these terms. My favorite line from the above cited article: "...because of the relatively low accuracy of the authors' submissions, the use of authors' annotations did not result in saving of curators' time..."


GO is not appropriate for this analysis because it is biased.
This is the most frustrating criticism of our study, if only because it’s partly true: GO is biased. In our paper we actually detail several of these biases, including the observation that the same set of authors will give two proteins more-similar functions than will two different sets of authors. We tried very hard to attempt to control for these biases, though of course one cannot account for all of them. The most uncharitable part of this critique, however, has to be the fact that people conveniently forgot to say that our array analysis was completely distinct from the GO-based analysis (though it has its own issues), and that Burkhard Rost’s analysis of protein-protein interaction (http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020079) was also completely free of any bias in GO and was consistent with all of our results.

More annoying than this, you’d think from some of the critiques of GO that it was some sort of fly-by-night operation that no one should ever depend on. I mean, c’mon—there are human curators and human experimenters and of course they’re all biased so badly one could never compare functions between proteins much less between species. What were we thinking? (Only that the original GO paper has been cited >7000 times.) Funnily enough, at several points during the course of this work Pedja suggested—only half-jokingly—that we should just assume the ortholog conjecture was correct and write a paper about how GO must be wrong. Seriously, though: one would think from the excuses people came up with for the problems inherent in GO that we should simply stop using it to, you know, predict function in other species. And we were applying it to two relatively closely related mammals, one of which is explicitly a model for the other.


What next?

Our paper laid out several explicit hypotheses about the evolution of function that arose from our findings. Unfortunately, testing any of these hypotheses will require a ton more functional data, in more than one species. I know there are multiple groups working to collect these sorts of labor-intensive datasets, and Pedja and I are thinking about doing it ourselves (with collaborators, of course!). Massive datasets that reveal protein function will always be a lot harder to collect than sequence data, especially ones free from biases.


So let’s get to it…


---------------------------

Note - Toni Gabaldón was trying to post a detailed response but Blogger kept cutting him off with a character limit.  So I have posted his response below.

I appreciate the effort by Matthew Hahnn on explaining the story behind his paper on the so-called "Ortholog conjecture" and on facing some of the criticism. This paper attracted my interest as that of many others that work on or just use orthology. For instance it was chosen by one of my postdocs for our "Journal Club" meeting. And it was discussed during our last "Quest for Orthologs" meeting in Cambridge. I think is raising a necessary discussion and therefore I think is a good paper. This does not mean that I fully agree with the interpretation and conclusions ;-). I hope to modestly contribute to this debate with the following post.

I think one of the causes that this paper has caused so much debate is that the conclusions seem to challenge common practice (inferring function from orthologs), and could be interpreted as the need of changing the strategies of genome annotation. I think, however, that one should interpret carefully these results before start annotating based on paralogous proteins. As I will discuss below one of the problems is that we need to agree in what is the conjecture to then agree in how to test it. I see three main points that can be a source of confusion: i) the issue of what is actually stated by this conjecture, ii) the issue of annotation, and iii) the issue of time

1) What is the "ortholog conjecture"?
Or in other terms, when should we expect orthologs to be more likely to share function than paralogs?. Always? Of course not. All of us would agree that two recently duplicated paralogs are likely to be more similar in function than two distant orthologs, so it is obvious that the conjecture is not simply "orthologs are more similar in function than paralogs". In reality the expectation that orthologs are more likely to be similar in function than paralogs, as least this is how I interpret it, is directly related to the effect that duplication have on functional divergence. If gene duplication has some effect on functional divergence (even in not 100% of the cases), then, given all other things equal (divergence time, story of speciation/duplication events - except fpr the duplication defining the orthologs) one would expect orthologs to be more likely to conserve function.

I think this complexity is not well considered (by many authors, in general). Hahn refeers to the famous review of orthology by Koonin (2005) as the source for the term "ortholog conjecture". However, In that paper this conjecture is discussed always within the context of genes accross two particular species, whether in Hahn's paper it is taken as well to other contexts. Thus, the proper context in which to test this conjecture is only between orthologs and between-species paralogs. As we can see,  Red and purple lines in Hahn paper in figure2 do not show any clear difference.

 Secondly, Koonin was very cautions in his paper, stating that he was referring to "equivalent functions" and not exactly the same "function", correctly implying that the functional contexts would be different in the two different species. This brings me to the next point.

ii) annotation
If the expectation of functional conservation of orthologs refers to a given pair of species, then it makes no sense to test that expectation between paralogs within the same species and orthologs in different species. We were interested in this issue and it took us some effort to control for this "species" influence on the comparison, if you are interested you can read our paper on divergence of expression profiles between orthologs and paralogs (http://www.ncbi.nlm.nih.gov/pubmed/21515902)

As Hahn founds, and it was anticipated by Koonin in that review, there is a huge influence of the "species context", a big constraint of what fraction of the function is shared. Indeed I think is the dominant signal in Hahn's paper. Why is that? One possibility is that the functional context determines the function, I agree. However, we should not discard biases in how different communities working around a model species define processes and function, also the type of experiments that are usually done. For instance experimental inference from KO mutants might be common from mouse, but I guess is not the case in humans (!!). I think this may be having a big influence and might even be the dominant signal in Hahns paper.

Finally function has many levels and I expect subfunctionalization mostly affect lower levels (i.e. more specific). Biases may also
 exist in the level of annotation between species or between families of different size (contributing more or less to the orthologs/paralogs class).

Microarray data are less likely to be subject to biases (although some may exist), at least they should be expected to be free of "human interpretation biases" and so Hahn and colleaguies did well, in my opinion, of testing that dataset. It is important to note that for microarrays and for orthologs and between-species paralogs (which I think is the right frame for testing the conjecture) ortholgs are more likely to share an expression context. This is compatible to what we found in the paper mentioned above, and compatible with the orthology conjecture as stated by koonin (accross species)


iii) time
 Finally, one aspect which I think is fundamental is the notion of "divergence time". Since paralogs can emerge at different time-scales they are composed by a heterogeneous set of protein pairs. Most of comparisons of orthologs and paralogs (Hahn's as well) use sequence divergence as a proxy of time. However this is only a poor estimate, specially when duplications (as in here) are involved (we explored this issue in the past: http://www.ncbi.nlm.nih.gov/pubmed/21075746). This means that for a given divergence time paralogs may have larger sequence divergence than orthologs at the same divergence time, or otherwise (if gene conversion is playing a role). Is the conjecture based on sequence divergence or on divergence time?, I think the initial sense of using orthology to annotate accross species is based on the notion of comparing things at the same evolutionary distance. Thus basing our conclusions on divergence times might not be the proper way of doing it.

CONCLUSIONS AND PROPOSAL FOR RE-STATEMENT

To conclude, and with the intention of going beyond this particular paper,
I would finish by saying that the key to the problem lies on how we interpret the so-called "ortholog conjecture" or how are our expectations on how function evolves. What I get from re-reading Eugene Koonin's paper and how I am using that "assumption" in my day-to-day work is the following:

"Orthologs in two given species are more likely to share equivalent functions than paralogs between these two species"

Therefore the notion of "accross the same pair of species" is important and thus only part of the comparisons made by Hahn and colleagues could directly test this. Looking at the microarray and between-species comparisons data, the conjecture may even hold true!!

I, however, do think that the conjecture as stated above is limited and does not capture the complexity of orthology relationships. Indeed us, and many other researchers, are tuning the confidence of the orthology-based annotation based on whether the orthologs are one-to-one, one-to-many or many-to-many, even when orthologs are "super-orthologs" (with no duplication event in the lineages separating the two orthologs).

Since, the underlying assumption of the ortholog conjecture is that duplication may (not necessarily always) promote functional shifts, then many-to-many orthology relationships will tend to include  orthologous pairs with different functions.

 Thus I would re-state the conjecture (or expectation) as follows:

 "In the absence of additional duplication events in the lineages separating them, two orthologous genes from two given species are more likely to share equivalent functions than two paralogs between these two species"

 This would be a more conservative expectation, which is closer to the current use of orthology-based annotation that tends to identify one-to-one orthologs, rather than any type.

 When duplications start appearing in subsequent lineages thus creating one- or many-to-many orthology relationships, the situation is less clear. Following the assumption that duplications may promote functional divergence. Then one could expand the conjecture by "the more duplications in the evolutionary history separating two genes, the lower the expectation that these two genes would share equivalent functions".

 I wrote this contribution on the fly, and surely there are ways of expressing this in more appropriate terms. In any case I hope I made clear the idea that the conjecture emerges from the notion of duplications causing functional shifts and that our expectations will be clearer if expressed on those terms. This goes on the lines of what Jonathan Eisen mentioned on considering the whole phylogenetic story to annotate genes.

 Under this perspective, the real important hypothesis is that "duplications tend promote functional shifts", I think this is based on solid grounds and has been tested intensively in the past.

 Cheers,

Toni Gabaldón

http://treevolution.blogspot.com