Tattoo World life: genomics

Showing posts with label genomics. Show all posts

Thursday, May 10, 2012

Quick post - new paper of interest on "The Infinitely Many Genes Model ..."

This paper seems of potential interest: The Infinitely Many Genes Model for the Distributed Genome of Bacteria by Franz Baumdicker, Wolfgang R. Hess, and Peter Pfaffelhuber

Abstract:

The distributed genome hypothesis states that the gene pool of a bacterial taxon is much more complex than that found in a single individual genome. However, the possible fitness advantage, why such genomic diversity is maintained, whether this variation is largely adaptive or neutral, and why these distinct individuals can coexist, remains poorly understood. Here, we present the infinitely many genes (IMG) model, which is a quantitative, evolutionary model for the distributed genome. It is based on a genealogy of individual genomes and the possibility of gene gain (from an unbounded reservoir of novel genes, e.g., by horizontal gene transfer from distant taxa) and gene loss, for example, by pseudogenization and deletion of genes, during reproduction. By implementing these mechanisms, the IMG model differs from existing concepts for the distributed genome, which cannot differentiate between neutral evolution and adaptation as drivers of the observed genomic diversity. Using the IMG model, we tested whether the distributed genome of 22 full genomes of picocyanobacteria (Prochlorococcus and Synechococcus) shows signs of adaptation or neutrality. We calculated the effective population size of Prochlorococcus at 1.01 × 1011 and predicted 18 distinct clades for this population, only six of which have been isolated and cultured thus far. We predicted that the Prochlorococcus pangenome contains 57,792 genes and found that the evolution of the distributed genome of Prochlorococcus was possibly neutral, whereas that of Synechococcus and the combined sample shows a clear deviation from neutrality.

Wish they had gone beyond these two cyanobacteria ... but still seems of possible interest.

Baumdicker, F., Hess, W., & Pfaffelhuber, P. (2012). The Infinitely Many Genes Model for the Distributed Genome of Bacteria Genome Biology and Evolution, 4 (4), 443-456 DOI: 10.1093/gbe/evs016

Wednesday, March 28, 2012

Calling on Nature Publishing Group to return all money received for genome papers and article corrections

Well, let's see if Nature Publishing Group actually does the right thing here. A few days ago I showed that they were charging for access to "genome sequencing" papers that were supposed to be freely available (see Hey Nature Publishing Group - When are you going to live up to your promises about "free" genome papers? #opengate #aaaaaarrgh). And in researching this I then discovered that Nature Publishing Group has been charging for access to corrections of articles (see Nature's access absurdity: Human Genome Paper free but access to corrections will costs $64 and Corrections Scamming at Nature: Tantalizing clues, to see errors just pay more money #Seriously?).

Multiple people from NPG have posted on my blog and twitter that they are working on "fixing" these issues. By which I think they mean "We will make these freely available again." But this is not a full fix. NPG really needs to do a self audit and return ALL money that anyone has paid for access to these articles. Charging for something that is supposed to be free is not a good thing ... and if they want to really fix the issue they need to give any money they got for these papers back. Note - I already called for them to do this last year when I wrote about the genome papers not being free. But I never heard back. Please help put the pressure on them to do the right thing this time.

Tuesday, March 27, 2012

A Solution to Nature Publishing Group's Inability to Keep Free Papers Free: Deposit them in Pubmed Central

Well, tick tock tick tock. I am still awaiting some explanation for Nature Publishing Group once again charging for access to genome papers that they promised would be available for free. See my last post for more details: The Tree of Life: Hey Nature Publishing Group - When are you going to live up to your promises about "free" genome papers? #opengate #aaaaaarrgh

In the meantime I have come up with a solution even if NPG folks cannot figure one out. It is very simple. How about Nature Publishing just deposit's all genome papers in Pubmed Central and thus even when the money making machine of Nature switches some setting and makes the papers not freely available at the Nature web site(s) for some time, the papers will still be officially free in Pubmed Central. I think this is probably the only solution I would trust given that this is at least the third time this has happened.

Monday, March 26, 2012

Hey Nature Publishing Group - When are you going to live up to your promises about "free" genome papers? #opengate #aaaaaarrgh

This is just ridiculous. Nature Publishing Group in 2007 announced that they were making all papers in their journals that reported genome sequences would be made freely available and would be given a Creative Commons license: Shared genomes : Article : Nature.

About a year ago I posted to twitter (using the hashtag #opengate) and my blog about how Nature Publishing Group was not following through on their promises. See for example

and more including some from others

Amazingly, and pleasantly, I note, in my complaining I exacted some responses from people from Nature Publishing Group who swore that these were just oversights and they would fix them. Well, alas, the money collecting machine of Nature Publishing Group is back.

For example, currently the following papers are not freely available even though at one point they were or they clearly fit in the "Shared genomes" definition Nature Publishing Group so happily promotes:

Shewanella oneidensis in Nature Biotech.
Desulfovibrio vulgaris in Nature Biotech.
Metagenomic contigs in Nature
Microbial sequencing review in Nature (this used to be freely available ...).

These above are all papers of mine, so I noticed them first (I noticed this when trying to create a Pintarest Baord for all my papers and not being able to get to a free page for these papers meant I couldn't add them to the Board. Could it be that Nature Publishing Group is just trying to get my goat? Let's see. A brief search found these papers by others - all also not freely available even though all clearly fit Nature's own definition of genome sequencing papers:

Here are some others

I think the funniest (and scariest) part may be the corrections and errata that are not freely available. And these are just the articles I found in a 15 minute search. I am sure there are more. Yes, Nature Publishing Group has made many genome papers freely available. That is great. Much better than many other publishers. But the cracks in your system are large and suggest that nobody there is actually dedicated to seeing through on the promises. Promises are meaningless. Follow through is the key. Come on Nature Publishing Group - how about assigning a "Free access ombudsman" or something like that who will make sure that free means free. I am sick of writing these posts. You should do your own QC ...

UPDATE: see some more recent blog posts of mine about this topic:

UPDATE 3-28-12 1 PM PST:
Well, if you look at the comments, Nature is apparently trying to fix this and most of the articles I listed above are now freely available (the corrections are still not free but they claim to be working on it). But a simple search of Nature finds there are still some papers that are closed off that shouldn't be:

It's not that hard to find these. It baffles me a bit how people at Nature don't seem to be able to find them. But maybe I am just really good at searching ...

Wednesday, March 21, 2012

Notes (from me and mostly others) from the JGI User Meeting #JGIUM7

OK, so the DOE Joint Genome Institute User Meeting is underway. Day 2 just finished. And I have been there for much of it but alas, not in some of the talks since I can't seem to get past the hallway/gathering area outside the talks. There are way way too many people there who I have not talked to or seen in a while ... So ... apologies to those who thought I might be live tweeting the whole meeting ... it just hasn't happened. But I did use the Storify web tool to make a "storification" of posts to twitter from the meeting - most of which were from other people. Here is the story in slideshow format.

I will update the storification tomorrow. If you want to see the full details in a scrolling winder see below

Saturday, March 17, 2012

The Axis of Evol: Getting to the Root of DNA Repair with Philogeny

The Axis of Evol: Getting to the Root of DNA Repair with Philogeny

In 2005 I wrote an essay about my time in graduate school that was potentially going to be included in a special issue of Mutation Research in honor of my PhD advisor Phil Hanawalt. Alas, publishing my essay ran into complications in regard to the closed access policies of this journal. So in the end, my essay was not published. I had forgotten about it mostly until very recently. And so I decided to convert the essay to a blog post. The essay is sort of about what I did in grad. school and sort of about Phil ...

Abstract:

Phylogenomics is a field in which genome analysis and evolutionary reconstructions are integrated. This integration is important because genome data is of great value in evolutionary reconstructions, because evolutionary analysis is critical for understanding and interpreting genomic data, and because there are feedback loops between evolutionary and genome analysis such that they need to be done in an integrated manner. In this paper I describe how I developed my particular phylogenomic approach under the guidance of my Ph.D advisor Philip C. Hanawalt. Since I was the first to use the term phylogenomics in a publication, I have decided to rename the field (at least temporarily) Philogenomics.

1. Doctor of Philosophy

When I went to Stanford for graduate school, I was interested in combining evolutionary analysis and molecular biology in a way that would allow me to study molecular mechanisms through an evolutionary perspective. Although I had gone to Stanford ostensibly to work on butterfly population genetics, within two days of starting a rotation in Phil’s lab, I knew that that was where I wanted to work. This decision was somewhat traumatic, since the work on butterflies included spending the summers at 10,000 feet in the Rocky Mountains and possibly chasing butterflies like a Nabakov wanna-be all over the mountain ranges of the world. As an avid outdoor person, this was quite appealing. Nevertheless, I chose to spend 99% of my graduate work in the dingy confines of Herrin Hall, studying DNA repair. The choice of joining Phil’s lab did have one very positive affect – and that was on my relationship with my grandfather on my mother’s side. Benjamin Post was in many ways like a father to me, especially after my father passed away. He was a physicist from the “old school” and thought that most of biology was completely useless. Needless to say, when I told him I was going to graduate school in California (which he considered already one strike against me) to study butterflies, he decided I was simply a lost cause. Despite all his talk of Einstein and computers and math when I was a child, I might as well have been a poet from his point of view. To make matters worse, my grandfather was a crystallographer, and my brother was getting his Ph.D in crystallography at Harvard. When I informed my grandfather that I was going to be working on DNA repair, he seemed somewhat interested. And then I told him, my advisor, Phil Hanawalt, is relatively well known, and actually used to be considered a biophysicist. Then my grandfather really perked up. He said, “Hanawalt – is he related to Don Hanawalt?” It turns out, that my grandfather worked in the same field as Phil’s father (they both did powder diffraction) and knew him. So my grandfather said “You may not be doing real science, but at least you are doing it with the relative of a real scientist.” Thankfully, I was no longer the black sheep in the family. So, with my grandfather’s approval, I embarked on a career in DNA repair.

I would like to add that I was very torn in writing this article. On the one hand, Phil was the greatest advisor I could ever imagine, allowing me to pursue studies on the evolution of DNA repair and comparative genomic analysis, even though nobody else in the lab worked on such things and at times, nobody seemed interested in them either. Phil’s support allowed me to explore my own interests and develop my concepts for the idea of “Phylogenomics” or the combining of evolutionary reconstructions and genome analysis. On the other hand, this special issue is being published in an Elsevier journal. As a supporter of the Open Access movement on scientific publications (see http://www.plos.org) and the brother of one of the founders of the Public Library of Science, publishing in an Elsevier journal is like cavorting with the devil. But the pull of Phil is very strong (some strange sort of force actually) and despite the effects that this may have on my relationship with my brother, I have agreed to publish in this special issue, and thus can now say that I sold my soul for Phil Hanawalt. [[OOPS - Spoke too soon on this when I wrote it --- in the end I just could not sign on the dotted line]].

In this essay, I describe my development in Phil’s lab of the idea of “Phylogenomics” or the combination of evolutionary reconstructions and genome analysis. I would like to add that this is not an attempt to review the field of phylogenomics or all the studies that could be called phylogenomics of DNA repair. For that I recommend reading other papers by myself (some of which are discussed below) as well as those by Rick Wood [1-4]}, Janusz M Bujnicki [5], Eugene Koonin [6-14]}, Carlos Menck [15-18], Michael Lynch [19-21], Patrick Forterre [22-24], Nancy Moran [25-29], and others. This is just meant to review my angle on the phylogenomics of repair and Phil’s contribution to this.

2. RecAgnizing the value of evolutionary analysis in studies of DNA repair

A post-doc in Phil’s lab at the time I was there, Shi-Kau (now known as Scott) Liu was working on analysis of some studies of recA mutants he had done while working in Irwin Tessman’s lab. He asked me if I could help him with some comparative analyses of RecA protein sequences from different species, in the hopes that this might help interpret his experimental data. We then downloaded and aligned all available RecA protein sequences from different species of bacteria and compared the sequence variation to the recently solved crystal structure of a form of the E. coli RecA protein. Specifically we were looking for compensatory mutations in which there was a change in one amino-acid in the region there was a correlated change in another amino-acid in the same region (these were detected using an evolutionary method called character-state reconstruction). Interestingly, in some regions of the crystal structure (e.g., the monomer-monomer contact regions) extensive compensatory mutations could be detected, suggesting that this region of the crystal was conserved between species. In other regions of the crystal (e.g., the filament-filament contact regions), no compensatory mutations could be detected suggesting either that this region of the structure was not conserved between species or that the filament contact regions were some artifact of crystallization. This was important to show since the mutations Shi-Kau was looking at were suppressors of another recA mutant (recA1202) and the suppressors we found did not make complete sense if the filament-filament contact regions of the crystal reflected perfectly what was going on in-vivo (30).

In this way, evolutionary reconstructions helped inform experimental studies in E. coli. While this concept was not necessarily novel, it is important to point out that most molecular sequence comparisons used for structure-function studies both then and now focus on sequence conservation (that is, what is identical or similar between sequences). This does not take full advantage of the evolutionary history of sequences since it does not specifically examine how the sequence conservation came to be (that is, it does not look at the amino-acid changes that occurred, just what is conserved). This made me realize that comparative analysis (identifying what is similar or different between genes or species) was fundamentally different from evolutionary reconstructions (which can identify how and possibly even why the similarities and differences came into being). I should point out that to do the compensatory mutation analysis well requires lots of sequences and this was one of the hidden reasons behind why I have pushed for ten years for people studying the evolutionary relationships among microbes to use recA as a marker as they use rRNA (31).

3. Sniffing around at homologs of repair genes

Shortly after the recA analysis was complete, another problem being addressed in the Hanawalt lab presented an even more powerful test for evolutionary reconstructions. Kevin Sweder, another post-doc in the lab, was working on yeast strains with defects in homologs of human DNA repair genes. It was at this time that many of the human DNA repair genes were being cloned and shown to be members of the helicase superfamily of proteins. Many of these could further be assigned to one particular subfamily within the helicase superfamily – the subfamily that contained the yeast SNF2 protein. Proteins in the SNF2 family could be readily identified because their helicase-like domains were all much more similar to each other than any were to other helicase-domain containing proteins. Yet many scientists, including Kevin, were presented with a problem. As the yeast genome was being completed, blast searches could identify that yeast encoded many proteins in the SNF2 family. However, these same blast searches could not readily identify which yeast gene was the orthologs of which human gene. For those who do not know, homologous genes or proteins come in two primary forms – paralogs, which are genes related by gene duplications (e.g., alpha and beta globin) and orthologs, which are the same form of a gene in different species (e.g., human and mouse alpha-globin). Thus if one wanted to use yeast as a model to study a human disease due to a mutation in a SNF2 homolog, it would be helpful to know which yeast gene was the ortholog of the human gene of interest. Since paralogs are related to each other by duplication events and since duplication events are an evolutionary event, I figured that an evolutionary tree of the SNF2 family proteins might help divide the gene family into groups of orthologs.

Indeed, this is exactly what we found – the SNF2 family could be divided into many subfamilies, each of which contained a human and a yeast gene and thus these genes could be considered orthologs of each others. In our analysis we found something even more striking. For every subfamily in the SNF2 superfamily, if the function of more than one member of the subfamily was known (e.g., the human and yeast genes), the function was always conserved. Also, all different subfamilies appeared to have different functions (32). Thus one could predict the function of a gene by which subfamily in which it resided. As with the analysis of RecA, it should be pointed out that the phylogenetic tree-based assignment of genes to subfamilies was more useful than blast searches because blast is simply a way to identify similarity among genes/proteins. The tree allows one to group genes into correct subfamilies even if rates and patterns of evolution have changed over time and are different in different groups. Again, this is a distinction between comparative analysis and evolutionary analysis.

4. A gut feeling leads to the idea of “Phylogenomics”

With the SNF2 analysis as a backdrop, I proceeded to proselytize to anyone who would listes, that phylogenetic trees of genes were going to revolutionize genomic sequencing proteins by allowing one to predict the functions of many unknown genes. Genome sequencing projects of course product lots of sequence data and little functional information. Although most of the people in the Hanawalt lab (except maybe Phil) could not have cared less about my evolutionary rantings, fortunately for me, one person called my bluff. Rick Myers, a professor in the Stanford Medical School and one of the heads of the Stanford Human Genome Center was asked to write a News and Views for Nature Medicine about the recent publications of the genomes of E. coli O157:H7 and Helicobacter pylori. So Rick challenged me and said I should try and come up with a real example of how the people who worked on these genomes screwed something up by not doing an evolutionary analysis. Fortunately, it was easy to find an interesting case to study in one of the genomes. In the H. pylori paper, the authors had predicted that the species should have mismatch repair but then reported something quite unusual – the genome encoded a homolog of MutS but did not encode a homolog of MutL. I suppose this should have raised a red-flag to them since all species known to have mismatch repair required homologs of both of these proteins for the process. While some species had other bells and whistles (e.g., the use of MutH and Dam in gamma proteobacteria), the use of MutS and MutL was absolutely conserved. An evolutionary tree of the MutS homologs available at the time including the one in H. pylori also suggested a red-flag should have been raised before predicting that this species possessed mismatch repair.

The MutS family in prokaryotes could be divided into two separate subfamilies, which I called MutS1 and MutS2. All genes known to be involved in mismatch repair were in the MutS1 family. No gene in the MutS2 family had a known function. The H. pylori gene was in the MutS2 family. So this species had no MutL and a MutS homolog in a novel subfamily. To us, this suggested that it would be a bad idea to predict the presence of mismatch repair in this species (33). Later, I showed that there was a general trend – all prokaryotes with just a MutS2-like protein did not have a MutL-homolog, and all species with a MutS1-like protein did (34-36). Experimental work has now shown that the MutS2 of H. pylori is not involved in MMR and that this species apparently does not have any MMR (37). This is important because this apparently causes this species to have an exceptionally high mutation rate, which in turn can effect how one designs vaccines and drugs and diagnostics to target it. It should be pointed out that the role of the MutS2 homologs is not known although they have been knocked out in many species and as of yet none have a role in MMR. Thus predicting function by evolutionary analysis (or more specifically, not incorrectly predicting function) can be of great practical value.

It is from this analysis that I came up with the idea of “Phylogenomics” or the integration of evolutionary reconstructions and genome analysis (34-36). These approaches should be fully integrated because there is a feedback loop between them such that they cannot be done separately. For example, in the studies of MutS and MutL it is necessary to do a genome analysis to identify the presence or absence of homologs of these genes, then an evolutionary analysis to determine which forms of each of the genes are present, then a genome analysis again to determine the number and combination of different forms and then an evolutionary analysis to determine whether and when particular forms were gained and lost over evolutionary time, and so on.

5. Lions and TIGRs and bears

Since leaving Phil’s lab I have been a faculty member at The Institute for Genomic Research (TIGR) and in that time we have found dozens of new uses for a phylogenomic approach and designed many new methods to implement phylogenomics. Such an approach has led to many interesting findings relating to DNA repair. Phylogenetic analysis of eukaryotic genomes has allowed us to identify many nuclear encoded genes that are homologs of DNA repair genes but appear to evolutionary derived from the organellar genomes and thus are good candidates for still having a role in DNA repair in the organelles (38). These include both putatively plastid-derived genes (encoding RecA, Mfd, Fpg, RecG, MutS2, Phr, Lon) and mitochondrial-derived genes (encoding RecA, Tag). Interestingly the presence of Mfd but not UvrABCD is also found in many endosymbiotic bacteria, although the explanation for what this Mfd might be doing is unclear. Phylogenomic analysis has allowed us to identify the loss of important DNA repair genes in various species such as the apparent loss of all the genes for non-homologous end joining in the causative agent of malaria, Plasmodium falciparum (39). An important component of this analysis was the finding that this species did not have an orthologs of DNA ligase IV, even though the original annotation of the genome had suggested it did (Figure 1).

Figure 1. Phylogenetic tree of DNA ligase homologs showing the presence of an orthologs of DNA Ligase I in Plasmodium falciparum but no orthologs of DNA ligase IV, consistent with the absence of non homologous end joining.

Among the other interesting repair-related features we have found are: the presence of two MutL homologs in an intracellular bacteria Wolbachia pipientis wMel (40), the presence of two UvrA homologs in Deinococcus radiodurans (41) and Chlorobium tepidum (42), the absence of MutS and MutL from Mycobacterium tuberculosis(43), and the presence of multiple ligases for each chromosome in Agrobacteriumtumefaciens (44). Continued surprises come from almost every genome.

However, all is not good in the world of phylogenomics. One of the biggest problems is that most of the experimental studies of DNA repair that have formed the basis of out knowledge in the field have been done in a narrow range of species. For example, there are estimated to be over 100 major divisions of bacteria (Phyla) and of these, most DNA repair studies have been restricted to three of these phyla (Proteobacteria, Firmicutes (also known as lowGC Gram-positives), and Actinobacteria (also known as highGC Gram positives). This means that if anything novel evolved in any of the other lineages, we would not know about it. This probably explains why, when we sequenced the genome of the radiation resistant bacteria D. radiodurans, analysis of the homologs of DNA repair genes in the genome did reveal many homologs of known repair genes but this list did not have many features that were unusual compared to non radiation resistant species (Table 1) and thus was not of much use in understanding what makes this species so resistant (41).

Table 1. Homologs of known DNA repair genes identified in the initial analysis of the D. radiodurans genome sequence

Process	Genes in D. radiodurans	Unusual features
Nucleotide Excision Repair	UvrABCD, UvrA2	UvrA2 not found in most species
Base Excision Repair	AlkA, Ung, Ung2, GT, MutM, MutY-Nths, MPG	More MutY-Nths than most species
AP Endonuclease	Xth	-
Mismatch Excision Repair	MutS, MutL	-
Recombination Initiation Recombinase Migration and resolution	RecFJNRQ, SbcCD, RecD RecA RuvABC, RecG	-
Replication	PolA, PolC, PolX, phage Pol	PolX not in many bacteria
Ligation	DnlJ	-
dNTP pools, cleanup	MutTs, RRase	-
Other	LexA, RadA, HepA, UVDE, MutS2	UvDE not in many bacteria

This of course means that genome sequencing and analysis, even if done in a robust way, only works well if there is a core of experimental studies on which to base the analysis.

In the end, I would like to define a new word – philogenomics which is the combination of studies of evolution, genomics, DNA repair, thymine metabolism, and punning. The ultimate proof of a philogenomic approach, of course, will come when it figures out the mechanism underlying thymineless death. But that is another story.

6. Acknowledgements

I would like to thank Philip C. Hanawalt for his support during and after my Ph.D research in his lab. Everyone in the field knows he is a great scientist. What they may not all know is that he is an even better human being.

References

1] Wood, R.D., DNA repair in eukaryotes. Annu Rev Biochem, 1996. 65: p. 135-167.

[2] Wood, R.D., Nucleotide excision repair in mammalian cells. J. Biol. Chem., 1997. 272(38): p. 23465-23468.

[3] Wood, R.D. and M.K. Shivji, Which DNA polymerases are used for DNA-repair in eukaryotes? Carcinogenesis, 1997. 18(4): p. 605-610.

[4] Wood, R.D., et al., Human DNA repair genes. Science, 2001. 291(5507): p. 1284-9.

[5] Kurowski, M.A., et al., Phylogenomic identification of five new human homologs of the DNA repair enzyme AlkB. BMC Genomics, 2003. 4(1): p. 48.

[6] Aravind, L., D.R. Walker, and E.V. Koonin, Conserved domains in DNA repair proteins and evolution of repair systems. Nucleic Acids Res, 1999. 27(5): p. 1223-1242.

[7] Kulaeva, O.I., et al., Identification of a DinB/UmuC homolog in the archeon Sulfolobus solfataricus. Mutat Res, 1996. 357(1-2): p. 245-53.

[8] Gorbalenya, A.E. and E.V. Koonin, Superfamily of UvrA-related NTP-binding proteins. Implications for rational classification of recombination/repair systems. J Mol Biol, 1990. 213(4): p. 583-91.

[9] Gorbalenya, A.E., et al., Two related superfamilies of putative helicases involved in replication, recombination, repair and expression of DNA and RNA genomes. Nucleic Acids Res, 1989. 17(12): p. 4713-4730.

[10] Makarova, K.S., et al., A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res, 2002. 30(2): p. 482-96.

[11] Makarova, K.S., et al., Genome of the extremely radiation-resistant bacterium Deinococcus radiodurans viewed from the perspective of comparative genomics. Microbiol Mol Biol Rev, 2001. 65(1): p. 44-79.

[12] Aravind, L. and E.V. Koonin, Prokaryotic homologs of the eukaryotic DNA-end-binding protein Ku, novel domains in the Ku protein and prediction of a prokaryotic double-strand break repair system. Genome Res, 2001. 11(8): p. 1365-74.

[13] Aravind, L. and E.V. Koonin, The alpha/beta fold uracil DNA glycosylases: a common origin with diverse fates. Genome Biol, 2000. 1(4): p. RESEARCH0007.

[14] Aravind, L., K.S. Makarova, and E.V. Koonin, SURVEY AND SUMMARY: holliday junction resolvases and related nucleases: identification of new families, phyletic distribution and evolutionary trajectories. Nucleic Acids Res, 2000. 28(18): p. 3417-32.

[15] Menck, C.F., Shining a light on photolyases. Nat Genet, 2002. 32(3): p. 338-9.

[16] Simpson, A.J., et al., The genome sequence of the plant pathogen Xylella fastidiosa. The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature, 2000. 406(6792): p. 151-7.

[17] Morgante, P.G., et al., Functional XPB/RAD25 redundancy in Arabidopsis genome: characterization of AtXPB2 and expression analysis. Gene, 2005. 344: p. 93-103.

[18] Martins-Pinheiro, M., et al., Different patterns of evolution for duplicated DNA repair genes in bacteria of the Xanthomonadales group. BMC Evol Biol, 2004. 4(1): p. 29.

[19] Estes, S., et al., Mutation accumulation in populations of varying size: the distribution of mutational effects for fitness correlates in Caenorhabditis elegans. Genetics, 2004. 166(3): p. 1269-79.

[20] Denver, D.R., et al., Mutation rates, spectra, and hotspots in mismatch repair-deficient Caenorhabditis elegans. Genetics, 2005.

[21] Denver, D.R., S.L. Swenson, and M. Lynch, An evolutionary analysis of the helix-hairpin-helix superfamily of DNA repair glycosylases. Mol Biol Evol, 2003. 20(10): p. 1603-11.

[22] Forterre, P., Displacement of cellular proteins by functional analogues from plasmids or viruses could explain puzzling phylogenies of many DNA informational proteins. Mol Microbiol, 1999. 33(3): p. 457-65.

[23] Cohen, G.N., et al., An integrated analysis of the genome of the hyperthermophilic archaeon Pyrococcus abyssi. Mol Microbiol, 2003. 47(6): p. 1495-512.

[24] Bouyoub, A., et al., A putative SOS repair gene (dinF-like) in a hyperthermophilic archaeon. Gene, 1995. 167(1-2): p. 147-149.

[25] Moran, N.A. and A. Mira, The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol, 2001. 2(12): p. RESEARCH0054.

[26] Dale, C., et al., Loss of DNA recombinational repair enzymes in the initial stages of genome degeneration. Mol Biol Evol, 2003. 20(8): p. 1188-94.

[27] van Ham, R.C., et al., Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci U S A, 2003. 100(2): p. 581-6.

[28] Moran, N.A. and J.J. Wernegreen, Lifestyle evolution in symbiotic bacteria: insights from genomics. Trends in Ecology and Evolution, 2000. 15(8): p. 321-326.

[29] Moran, N.A., Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci U S A, 1996. 93(7): p. 2873-8.

[30] Liu SK, Eisen JA, Hanawalt PC, Tessman IW. 1993. recA mutations that reduce the constitutive coprotease activity of the RecA1202(PrtC) protein: possible involvement of interfilament association in proteolytic and recombination activities. J. Bacteriol. 175: 6518-6529.

[31] Eisen JA. 1995. The RecA protein as a model molecule for molecular systematic studies of bacteria: comparison of trees of RecAs and 16S rRNAs from the same species. J. Mol. Evol. 41: 1105-1123.

[32] Eisen JA, Sweder KS, Hanawalt PC. 1995. Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions. Nucl. Acids Res. 23: 2715-2723.

[33] Eisen JA, Kaiser D, Myers RM. 1997. Gastrogenomic delights: a movable feast. Nature Medicine 3: 1076-1078.

[34] Eisen JA. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8: 163-167.

[35] Eisen JA. 1998. A phylogenomic study of the MutS family of proteins. Nucl. Acids Res. 26: 4291-4300.

[36] Eisen JA. Hanawalt PC. 1999. A phylogenomic study of DNA repair genes, proteins, and processes. Mut. Res. 435: 171-213.

[37] Bjorkholm B, Sjolund M, Falk PG, Berg OG, Engstrand L, Andersson DI. 2001. Mutation frequency and biological cost of antibiotic resistance in Helicobacter pylori. Proc Natl Acad Sci U S A. 98(25):14607-12.

[38] Britt AB, Eisen JA. 2000. DNA repair and recombination. In 'Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.'Nature 408: 796-815.

[39] Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419: 498-511.

[40] WuM, SunL, VamathevanJ, RieglerM, DeboyR, BrownlieJ, McGrawE, Mohamoud Y,LeeP, BerryK, KhouriHM, PaulsenIT, Nelson KE, MartinW, Esser C, AhmadinejadN, WiegandC, DurkinAS, NelsonWC, BeananMJ, BrinkacLM, DaughertySC, DodsonRJ, GwinnM, KolonayJF, MadupuR, CravenMB, UtterbackT, WeidmanJ, NiermanWC, Aken SV, Tettelin H, O’Neill S, Eisen JA. 2004. Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined genome massively infected with mobile genetic elements. PLOS Biology 2: 327-341.

[41] White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ, Haft DH, Gwinn ML, Nelson WC, Richardson DL, Moffat KS, Qin H, Jiang L, Pamphile W, Crosby M, Shen M, Vamathevan JJ, Lam P, McDonald L, Utterback T, Zalewski C, Makarova KS, Aravind L, Daly MJ, Minton KW, Fleischmann RD, Ketchum KA, Nelson KE, Salzberg SL, Smith HO, Venter JC, Fraser CM. 1999. Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science 286: 1571-1577.

[42] Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, Dodson RJ, Deboy R, Gwinn ML, Nelson WC, Haft DH, Hickey EK, Peterson JD, Durkin AS, Kolonay JL, Yang F, Holt I, Umayam LA, Mason T, Brenner M, Shea TP, Parksey D, Nierman WC, Feldblyum TV, Hansen CL, Craven MB, Radune D, Vamathevan J, Khouri H, White O, Venter JC, Gruber TM, Ketchum KA, Tettelin H, Bryant DA, Fraser CM. 2002. The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium. Proc. Natl. Acad. Sci. USA 99: 9509-9514.

[43] Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, DeBoy R, Dodson R, Gwinn M, Haft D, Hickey E, Kolonay JF, Nelson WC, Umayam LA, Ermolaeva M, Salzberg SL, Delcher A, Utterback T, Weidman J, Khouri H, Gill J, Mikula A, Bishai W, Jacobs WR Jr, Venter JC, Fraser CM. 2002. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J. Bacteriol.184: 5479-5490.

[44] Wood DW, Setubal JC, Kaul R, Monks DE, Kitajima JP, Okura VK, Zhou Y, Chen L, Wood GE, Almeida Jr. NF, Woo L, Chen Y, Paulsen IT, Eisen JA, Karp PD, Bovee Sr. D, Chapman P, Clendenning J, Deatherage G, Gillet W, Grant C, Kutyavin T, Levy R, Li M-J, McClelland R, Palmieri A, Raymond C, Rouse G, Saenphimmachak C, Wu Z, Romero P, Gordon D, Zhang S, Yoo H, Tao Y, Biddle P, Jung M, Krespan W, Perry M, Gordon-Kamm B, Liao L, Kim S, Hendrick C, Zhao Z-Y, Dolan M, Chumley F, Tingey SC, Tomb J-F, Gordon MP, Olson MV, Nester EW. 2001. The genome of the natural genetic engineer Agrobacterium tumefaciens C58. Science 294: 2317-2323.

Thursday, March 15, 2012

Elaine Mardis rocks: nice talk on "Next generation sequencing"

I wish I had seen this before I gave my first lecture on Next Gen Sequencing Methods on Monday. I will post mine later but here is a really really nice talk by Elaine Mardis from Washington University on the same topic:

Tuesday, February 21, 2012

Slideshow w/ audio of my talk on "A Field Guide to the Microbes" from the AAAS Meeting #AAASMtg

I recorded the audio of my talk on "Towards a field guide to the microbes" from the AAAS meeting on Saturday AM. Here is a slideshow of the talk with audio synched to the slides (I did this using Keynote on a Mac with the "record Slideshow" function).

My slides from the talk are available at Slideshare.

Thursday, January 12, 2012

Draft post cleanup #14: Video of Talk of mine from 2005

Yet another post in my "draft blog post cleanup" series. Here is #14 from July 2010.

Embedded here is a video of a talk I gave in 2005 at the NIH entitled "More Questions Than Answers Insights into DNA Repair Processes from Genome Sequencing Projects"

Wednesday, January 11, 2012

Another seminar at #UCDavis 1/11 - Evan Eichler - #TooManyChoices

Well, this is going to be awkward. I really really want to hear this upcoming talk by Evan Eichler. But alas, Jane Lubchenco - head of NOAA - is talking at the same time. And sorry Evan, but Jane wins - this time (never heard her speak before).

UC Davis M.I.N.D. Institute's 2011-2012 Distinguished Lecturer Series

SPEAKER: Evan E. Eichler, Ph.D.
TOPIC: Copy Number Variation, Exome Sequencing and Autism
DATE: Wednesday, January 11, 2012
TIME: 4:30 pm - 6:00 pm
LOCATION: M.I.N.D. Institute Auditorium (2825 50th Street, Sacramento)

Biographical / Presentation Information (attached and pasted below):

Evan E. Eichler, Ph.D., is a Professor and Howard Hughes Medical Institute Investigator in the Department of Genome Sciences, University of Washington School of Medicine. Dr. Eichler is a leader in an effort to identify and sequence normal and disease-causing structural variation in the human genome. His research group provided the first genome-wide view of segmental duplications within human and other primate genomes. The long-term goal of his research is to understand the evolution and mechanisms of recent gene duplication and its relationship to copy number variation and human disease. A graduate in biology of the University of Saskatchewan, Canada, he received his Ph.D. in 1995 from the Department of Molecular and Human Genetics at Baylor College of Medicine, Houston. After a Hollaender post-doctoral fellowship at Lawrence Livermore National Laboratory, he joined the faculty of Case Western Reserve University in 1997 and later the Department of Genome Sciences at the University of Washington in 2004. He was a March of Dimes Basil O’Connor Scholar (1998-2001), appointed as an HHMI Investigator (2005), and awarded an AAAS Fellowship (2006) and the American Society of Human Genetics Curt Stern Award (2008). He is an editor of Genome Research and has served on various scientific advisory boards for both NIH and NSF.

Copy Number Variation, Exome Sequencing and Autism.
It has become apparent that genetic structural variation contributes significantly to both neurocognitive and neuropsychiatric disease. I will present a detailed study of the genomes of children with developmental delay compared to adult controls and show that as much as 14% of pediatric disease, including autism, epilepsy and intellectual disability, is caused by deletions and duplications of large segments of the genome involving multiple genes. These mutations can be either inherited or found in the parents of children depending on the size of the event. I will present evidence from exome sequencing of over 200 parent-child trios with sporadic autism and show how these data may be used to pinpoint novel genes underlying CNV (copy number variation) burden, as well as provide insight into new pathways. We find that some of the same disease-causing mutations can manifest very differently and in particular be more severe if they occur on a background of other compounding mutations. We predict that the overall burden of rare and private severe mutations will correlate with different outcomes ranging from autism, intellectual disability and epilepsy. We propose that the early development of the brain is particularly sensitive to the timing and expression of many different genes and that multiple genetic perturbations within specific pathways can lead to disease with varying severity.

Wednesday, January 4, 2012

Draft post cleanup #6 from 2005: Hydrogen producing microbe mea culpa

Yet another post in my "draft blog post cleanup" series. Here is #6. From 2005. (Yes, the bottom of my draft list). In fact, this would have been my second blog post if I had posted it ...

I had written

OK, so a few months ago we published a paper on a hydrogen producing microbe and issued a press release. I think the paper we published was pretty cool - lots of interesting science.

Then we (me and our public affairs person) wrote a press release about the project. We were fortunate enough to have the press release picked up by all sorts of bloggers and web commentary groups. Examples include Softpedia (article here) and probably most importantly Slashdot.

So - what was wrong? Well, I was starting to get more and more jaded with bad press releases about science papers. And I felt ours had at least one really lame part - my quote

So if you're interested in making clean fuels, this microbe makes an excellent starting point.

Well, WTF? I have never done anything with biofuels and I really knew nothing about them then. That quote should never have been in the press release and I am not sure I even said it.

Other parts of the PR are OK I think but I wish that quote had never been in there. I note - I do like the end though

What we want to have is a field guide for these microbes, like those available for birds and mammals," Eisen says. "Right now, we can't even answer simple questions. Do similar hot springs , a world apart, share similar microbes? How do microbes move between hot springs? Our new work will help us find out.

I agree with that. I have indeed been obsessed with a Field Guide to the Microbes for a long time ...

Friday, December 23, 2011

Reminder - Monthly Omics Office Hours at #UCDavis Genome Center - Schedule

For those at UC Davis interested in learning a bit about various omics issues - this may be of interest:

Omics Office Hour 2012 — UC Davis Genome Center

Email from the responsible parties:

The UC Davis Genome Center holds an Omics Office Hour from 9:00-10:00am each month in Room 3209 of the Medical Education building in Sacramento. These drop-in sessions are open to anyone in the SOM community with questions regarding Genomics, Epigenomics and Gene Expression, Proteomics, Metabolomics, Network Biology and Bioinformatics.

The mission of the Genome Center is to facilitate your "omics" research at UC Davis. Genome Center staff and faculty will be on hand for consultation in a friendly, informal setting. If you have ideas that you would like to explore, we would be happy to discuss it as well as the possibility of pilot grants.

The next session will be Friday, January 6, 9:00 am in Room 3209 of Med Edu Bldg.

NOTE: THE DECEMBER 23, 2011 MEETING HAS BEEN CANCELED!!!!!

For more details, please link to:
http://www.genomecenter.ucdavis.edu/outreach-and-giving/omics-office-hour-2012

The schedule is also available as a Google Calendar called "'Omics Office Hours". For anyone who wants to subscribe to the calendar, here are instructions:

For Google Calendars:
1- go to Google Calendar
2 - under "Other calendars" click Add/Add by URL
3 - past the iCal link shown below into the box (https://www.google.com/calendar/ical/o6rt68uree1205hictul75m614%40group.calendar.google.com/public/basic.ics)
4 - click Add Calendar
5 - DONE

For iCal:
1- just click on the link below (might require some advanced Mac skills)
- or -
1- open iCAL
2- in the menu select Calendar/Subscribe
3 - past the iCal link shown below into the box (https://www.google.com/calendar/ical/o6rt68uree1205hictul75m614%40group.calendar.google.com/public/basic.ics)
4 - click Subscribe
5 - DONE

Thursday, December 15, 2011

Very nice new #PLoSGenetics paper on "Functional Phylogenomics" of Seed Plants

Update2 - 12/22 - Data available here. Thanks to the authors for clearing things up quickly.

Update1 - 12/19 - Data for this paper seems to be unavailable - not sure why - but looking into this after a TWEET from Karen Cranston. The paper says data is available at: http://nypg.bio.nyu.edu/main/ but I could not find any there. Note - this is one reason that all data sets should be made available at the journal or third party sites.

Original post:

OK never mind that the terminology of "functional phylogenomics" is a tiny bit vexing to me (long story - some other time perhaps). The paper behind it - PLoS Genetics: A Functional Phylogenomic View of the Seed Plants is very cool.

Here's what the authors did (a very coarse summary)

1. Identified sets of orthologs between plant species using the OrthologID system (which has a phylogenetic underpinning) (the data input for this appeared to have mostly been Unigene EST clusters)

2. Constructed a "total evidence" phylogeny for these taxa (using a few approaches)

3. Use this phylogeny to reinterpret some general features of the evolution of plants

4. Searched for gene ontology categories (in annotated genes from these organisms) that agreed with the phylogeny. In essence, this seems to be a search for shared-derived traits (i.e., synapomorphies) in particular clades.

5. Generated hypothesis about functional evolution in particular clades.

Overall, there is a lot that is really fascinating in here and this approach seems very powerful (though I note - I think something akin to this though not as comprehensive or as careful has been done for other groups but not sure). Check out the paper for more detail ...

Lee EK, Cibrian-Jaramillo A, Kolokotronis S-O, Katari MS, Stamatakis A, et al. (2011) A Functional Phylogenomic View of the Seed Plants. PLoS Genet 7(12): e1002411. doi:10.1371/journal.pgen.1002411

Wednesday, December 7, 2011

Have a bite while talking about bits & bytes #UCDavis

Just found out about this ...

Bits & Bites lunch club at UC Davis

"Bits & bites is a new lunch club that aims to meet once a week at UC Davis and talk about various aspects of sequence analysis. The idea is to gather together people in a very informal environment and share expertise on various subjects relating to bioinformatics and genomics."

More detail from the site:

The plan will be to meet on Thursdays between 12:00 and 1:00 at various venues on the UC Campus, possibly including the Genome Center, and Life Sciences Addition – as well as possible forays into Davis. Occasionally – maybe once a month – we would try to host an invited speaker to give deeper insights into a specific topic.

To find out more details please join the bits & bites mailing list (a low traffic list which will mostly be used to announce the venue and discussion topics each week).

Sounds good to me.

#UCDavis Genome Center Omics Office Hours

The UC Davis Genome Center will be holding an Omics Office Hour from 9:00-10:00am each month in Room 5206 GBSF on the Davis campus. These drop-in sessions are open to anyone with questions regarding Genomics, Epigenomics and Gene Expression, Proteomics, Metabolomics, Network Biology and Bioinformatics.

The mission of the Genome Center is to facilitate your "omics" research at UC Davis. Genome Center staff and faculty will be on hand for consultation in a friendly, informal setting. If you have ideas that you would like to explore, we would be happy to discuss it as well as the possibility of pilot grants.

The next session will be Friday, December 9.

Tuesday, November 29, 2011

What's Hot in Biology 2011? Why, the Genomic Encyclopedia paper I am senior author on #Yay?

And now back to some science. Got an email a few days ago from Nikos Kyrpides pointing to this: What's Hot in Biology - 2011. Very cool - the paper on the "Genomic Encyclopedia of Bacteria and Archaea" project that I coordinated (and for which I am the senior author) has been identified as the hot biology paper of November/December 2011 by "Science Watch". Plus they have a reasonably detailed story about it "BRANCHING OUT WITH PHYLOGENETICALLY DRIVEN GENOME SEQUENCING" by Jeremy Cherfas. I note - the project was done at the DOE-Joint genome Institute and involved an enormous number of people there (I have an Adjunct Appointment there). It was done in collaboration with the DSMZ - a microbial culture collection in Germany.

The paper A phylogeny driven genomic encyclopedia of bacteria and archaea apparently has been getting a lot of citations, which I guess is how it got picked as being "hot".

If you want to know more about this project and paper see the following links:

Story Behind the Nature Paper on 'A phylogeny driven genomic encyclopedia of bacteria & archaea' #genomics #evolution
More coverage of the GEBA "Phylogeny Driven Genomic Encyclopedia"
Scientists Start a Genomic Catalog of Earth’s Abundant Microbes (article by Carl Zimmer in the New York Times)
Presenting a genomic encyclopedia of bacteria (and archaea) (article by John Timmer)

Some videos of talks or interviews about the project

Talk at DOE JGI User Meeting 2009

Talk at GME Meeting 2008

JGI Video about the project

Friday, October 7, 2011

The story behind Pseudomonas syringae comparative genomics / pathogenicity paper; guest post by David Baltrus (@surt_lab)

More fun from the community. Today I am very happy to have another guest post in my "Story behind the paper" series. This one comes to us from David Baltrus, an Assistant Professor at University of Arizona. For more on David see his lab page here and his twitter feed here. David has a very nice post here about a paper on the "Dynamic evolution of pathogenicity revealed by sequencing and comparative genomics of 19 Pseudomonas syringae isolates" which was published in PLoS Pathogens in July. There is some fun/interesting stuff in the paper, including analysis of the "core" and "pan" genome of this species. Anyway - David saw my request for posts and I am very happy that he responded. Without further ado - here is his story (I note - I added a few links and Italics but otherwise he wrote the whole thing ...).

---------------------------------------

I first want to than Jonathan for giving me this opportunity. I am a big fan of “behind the science” stories, a habit I fed in grad school by reading every Perspectives (from the journal Genetics) article that I could get a hold of. Science can be rough, but I remember finding solace in stories about the false starts and triumphs of other researchers and how randomness and luck manage to figure into any discovery. If anything I hope to use this space to document this as it is fresh in my mind so that (inevitably) when the bad science days roll around I can have something to look back on. In the very least, I'm looking forward to mining this space in the future for quotes to prove just how little I truly understood about my research topics in 2011. It took a village to get this paper published, so apologies in advance to those that I fail to mention. Also want to mention this upfront, Marc Nishimura is my co-author and had a hand in every single aspect of this paper.

Joining the Dangl Lab

This project really started way back in 2006, when I interviewed for a postdoc with Jeff Dangl at UNC Chapel Hill. In grad school I had focused on understanding microbial evolution and genetics but I figured that the best use of my postdoc would be to learn and understand genomics and bioinformatics. I was just about to finish up my PhD and was lucky enough to have some choices when it came around to choosing what to do next. I actually had no clue about Dangl’s research until stumbling across one of his papers in Genetics, which gave me the impression that he was interested in bringing an evolutionary approach to studies of the plant pathogen Pseudomonas syringae. I was interested in plant pathogens because, while I wanted to study host/pathogen evolution, my grad school projects on Helicobacter pylori showed me just how much fun it is dealing with the bureaucracy of handling human pathogens. There is extensive overlap in the mechanisms of pathogenesis between plant and human pathogens, but no one really cares how many Arabidopsis plants you infect or if you dispose of them humanely (so long as the transgenes remain out of nature!). By the time I interviewed with Jeff I was leaning towards joining a different lab, but the visit to Chapel Hill went very well and by the end I was primed for Dangl’s sales pitch. This went something along the lines of “look, you can go join another lab and do excellent work that would be the same kinds of things that you did in grad school...or you can come here and be challenged by jumping into the unknown”. How can you turn that down? Jeff sold me on continuing a project started by Jeff Chang (now a PI at Oregon State), on categorizing the diversity of virulence proteins (type III effector proteins to be exact) that were translocated into hosts by the plant pathogen Pseudomonas syringae. Type III effectors are one of the main determinants of virulence in numerous gram negative plant and animal pathogens and are translocated into host cells to ultimately disrupt immune functions (I'm simplifying a lot here). Chang had already created genomic libraries and had screened through random genomic fragments of numerous P. syringae genomes to identify all of the type III effectors within 8 or so phylogenetically diverse strains. The hope was that they would find a bunch of new effectors by screening strains from different hosts. Although this method worked well for IDing potential effectors, I was under the impression that it was going to be difficult to place and verify these effectors without more genomic information. I was therefore brought in to figure out a way to sequence numerous P. syringae genomes without burning through a Scrooge McDuckian money bin worth of grant money. We had a thought that some type of grand pattern would emerge after pooling all this data but really we were taking a shot in the dark.

Tomato leaves after 10 days infection by the tomato pathogen P.syringae DC3000 (left) as well as a less virulent strain (right). Disease symptoms are dependent on a type III secretion system.

Moments of Randomness that Shape Science

When I actually started the postdoc, next generation sequencing technologies were just beginning to take off. It was becoming routine to use 454 sequencing to generate bacterial genome sequences, although Sanger sequencing was still necessary to close these genomes. Dangl had it in his mind that there had to be a way to capitalize on the developing Solexa (later Illumina) technology in order to sequence P. syringae genomes. There were a couple of strokes of luck here that conspired to make this project completely worthwhile. I arrived at UNC about a year before the UNC Genome Analysis core facility came online. Sequencing runs during the early years of this core facility were subsidized by UNC, so we were able to sequence many Illumina libraries very cheaply. This gave us the opportunity to play around with sequencing options at low cost, so we could explore parameter space and find the best sequencing strategy. This also meant that I was able to learn the ins and outs of making libraries at the same time as those working in the core facility (Piotr Mieczkowski was a tremendous resource). Secondly, I started this postdoc without knowing a lick of UNIX or perl and knew that I was going to have to learn these if I had any hope of assembling and analyzing genomes. I was very lucky to have Corbin Jones and his lab 3 floors above me in the same building to help work through my kindergarden level programming skills. Corbin was really instrumental to all of these projects as well as in keeping me sane and I doubt that these projects would have turned out anywhere near as well without him. Lastly, plant pathogens in general, and P. syringae in particular, were poised to greatly benefit from next generation sequencing in 2006. While there was ample funding to completely sequence (close) genomes for numerous human pathogens, lower funding opportunities for plant pathogens meant that we were forced to be more creative if we were going to pull of sequencing a variety of P.syringae strains. This pushed us into trying a NGS approach in the first place. I suspect that it’s no coincidence that, independently of our group, the NGS assembler Velvet was first utilized for assembling P.syringae isolates.

The Frustrations of Library Making

Through a collaboration with Elaine Mardis’s group at Washington University St. Louis, we got some initial data back that suggested it would be difficult to make sense of bacterial genomes at that time using only Illumina (the paired end kits weren’t released until later). There simply wasn’t good enough coverage of the genome to create quality assemblies with the assemblers available at this time (SSAKE and VCAKE, our own (really Will Jeck’s) take on SSAKE). Therefore we decided to try a hybrid approach, combining low coverage 454 runs (initially separate GS Flex runs with regular reads and paired ends, and later one run with long paired ends) with Illumina reads to fill in the gaps and leveraging this data to correct for any biases inherent in the different sequencing technologies. Since there was no core facility at UNC when I started making libraries, I had to travel around in order to find the necessary equipment. The closest place that I could find a machine to precisely shear DNA was Fred Dietrich’s lab at Duke. More than a handful of mornings were spent riding a TTA bus from UNC to Duke, with a cooler full of genomic DNA on dry ice (most times having to explain to the bus drivers how I wasn’t hauling anything dangerous), spending a couple of hours on Fred’s hydroshear, then returning to UNC hoping that everything worked well. There really is no feeling like spending a half a day travelling/shearing only to find out that the genomic DNA ended up the wrong size. We were actually planning to sequence one more strain of P. syringae, and already had Illumina data, but left this one out because we filled two plates of 454 sequencing and didn’t have room for a ninth strain. In the end there were two very closely related strains (P.syringae aptata or P. syringae atrofaciens) left to make libraries for and the aptata genome sheared better on the last trip than atrofaciens. If you’ve ever wondered why researchers pick certain strains to analyze, know that sometimes it just comes down to which strain worked first. Sometimes there were problems even when the DNA was processed correctly. I initially had trouble making the 454 libraries correctly in that, although I would follow the protocol exactly, I would lose the DNA somewhere before the final step. I was able to trace down the problem to using an old (I have no clue when the Dangl lab bought it, but it looked as useable as salmon sperm ever does) bottle of salmon sperm DNA during library prep. There were also a couple of times that I successfully constructed Illumina libraries only to have the sequencing runs dominated by few actual sequences. These problems ultimately stemmed from trying to use homebrew kits (I think) for constructing Illumina libraries. Once these problems were resolved, Josie Reinhardt managed to pull everything together and create a pipeline for hybrid genome assembly and we published our first hybrid genome assembly in Genome Research. At that moment it was a thrill that we could actually assemble a genome for such a low cost. It definitely wasn’t a completely sequenced genome, but it was enough to make calls about the presence or absence of genes.

Waiting for the story to Emerge

There are multiple ways to perform research. We are all taught about how important it is to define testable hypothesis and to set up appropriate experiments to falsify these educated guesses. Lately, thanks to the age of genomics, it has become easier and feasible to accumulate as much genomic data as possible and find stories within that data. We took this approach with the Pseudomonas syringae genome sequences because we knew that there was going to be a wealth of information, and it was just a matter of what to focus on. Starting my postdoc I was optimistic that our sampling scheme would allow us to test questions about how host range evolves within plant pathogens (and conversely, identify the genes that control host range) because the strains we were going to sequence were all isolated from a variety of diseased hosts. My naive viewpoint was that we were going to be able to categorize virulence genes across all these strains, compare suites of virulence genes from strains that were pathogens of different hosts, and voila...we would understand host range evolution. The more I started reading about plant pathology the more I became convinced that this approach was limited. The biggest problem is that, unlike some pathogens, P. syringae can persist in a variety of environments with strains able to survive our flourish or on a variety of hosts. Sure we had strains that were known pathogens of certain host plants, but you can’t just assume that these are the only relevant hosts. Subjective definitions are not your friend when wading into the waters of genomic comparisons.

We were quite surprised that, although type III effectors are gained and lost rapidly across P.syringae and our sequenced strains were isolated from diverse hosts, we only managed to identify a handful of new effector families. I should also mention here that Artur Romanchuk came on board and did an extensive amount of work analyzing gene repertoires across strains. A couple of nice stories did ultimately emerge by comparing gene sequences across strains and matching these up with virulence in planta (we are able to show how mutation and recombination altered two different virulence genes across strains), but my two favorite stories from this paper came about from my habit of persistently staring at genome sequences and annotations. As I said above, a major goal of this paper was to categorize the suites of a particular type of virulence gene (type III effectors) across P. syringae. I was staring at gene repertoires across strains when I noticed that two of the strains had very few of these effectors (10 or so) compared to most of the other strains (20-30). When I plotted total numbers of effectors across strains, a phylogenetic pattern arose where genomes from a subset of closely related P. syringae strains possessed lower numbers of effectors. I then got the idea to survey for other classes of virulence genes, and sure enough, strains with the lowest numbers of effectors all shared pathways for the production of well characterized toxin genes (Non ribosomal peptide synthase (NRPS) toxins are secreted out of P. syringae cells and are virulence factors, but are not translocated through the type III secretion system). One exception did arise across this handful of strains (a pea pathogen isolate from pathovar pisi) in that this strain has lost each of these conserved toxin pathways and also contain the highest number of effectors within this phylogenetic group. The relationship between effector number and toxin presence remains a correlation at the present time, but I’m excited to be able to try and figure out what this means in my own lab.

Modified Figure 3 from the paper. Strain names are listed on the left and are color coded for phylogenetic similarity. Blue boxes indicate that the virulence gene/toxin pathway is present, green indicates that the pathway is likely present but sequence was truncated or incomplete, while box indicates absence. I have circled the group II strains, which have the lowest numbers of type III effectors while also having two conserved toxin pathways (syringomycin and syringolin). Note that the Pisi strain (Ppi R6) lacks these toxin pathways.

The other story was a complete stroke of luck. P. syringae genomes are typically 6Mb (6 million base pairs) in size, but one strain that we sequenced (a cucumber pathogen) contained an extra 1Mb of sequence. Moreoever, the two largest assembled contigs from this strain were full of genes that weren’t present in any other P. syringae strain. After some similarity comparisons, I learned that there was a small bit of overlap between each of these contigs and performed PCR to confirm this. Then, as a hunch, I designed primers facing out of each end of the contig and was able to confirm that this extra 1Mb of sequence was circular in conformation and likely separate from the chromosome. I got a bit lucky here because there was a small bit (500bp or so) of sequence that was not assembled with either of these two contigs that closed the circle (a lot more and I wouldn’t have gotten the PCR to work at all). We quickly obtained 3 other closely related strains and were able to show that only a subset of strains contain this extra 1Mb and that it doesn’t appear to be directly involved in virulence on cucumber. So it turns out that a small number (2 so far) of P. syringae strains have acquired and extra 1Mb of DNA, and we don’t quite know what any of these ~700 extra genes do. There are no obvious pathways present aside from additional chromosomal maintenance genes, extra tRNAs in the same ratio as the chromosomal copies, and a couple of secretion systems. So somehow we managed to randomly pick the right strain to capture a very recent event that increased the genome size of this one strain by 15% or so. We’ve made some headway on this megaplasmid story since I started my lab, but I’ll save that for future blog posts.

Modified Figure S12 from the paper. Strains that contain the 1Mb megaplasmid (Pla7512 and Pla107) are slightly less virulent during growth in cucumber than strains lacking the megaplasmid (PlaYM8003, PlaYM7902). This growth defect is also measurable in vitro. In case you are wondering, I used blue and yellow because those were the dolors of my undergrad university, the University of Delaware.Reviewer Critiques

We finally managed to get this manuscript written up by the summer of 2010 and submitted it to PLoS Biology. I figured that (as always) it would take a bit of work to address reviewer’s critiques, but we would nonetheless be able to publish without great difficulty. I was at a conference on P. syringae at Oxford in August of 2010 when I got the reviews back and learned that our paper had gotten rejected. Everyone has stories about reviewer comments and so I’d like to share one of my own favorites thus far. I don’t think it ever gets easier to read reviews when your paper has been rejected, but I was knocked back the main critique of one reviewer:

“I realize that the investigators might not typically work in the field of bacterial genomics, but when looking at divergent strains (as opposed to resequencing to uncover SNPs among strains) it is really necessary to have complete, not draft, genomes. I realize that this might sound like a lot to ask, but if they look at comparisons of, for example, bacterial core and pan-genomes, such as the other paper on this that they cite (and numerous other examples exist), they are based on complete genome sequences. If this group does not wish to come up to the standards applied to even the most conventional bacterial genomics paper, it is their prerogative; however, they should be aware of the expectations of researchers in this field.”

So this reviewer was basically asking us to spend an extra 50k to finish the genomes for these strains before they were scientifically useful. Although I do understand the point, this paper was never about getting things perfect but about demonstrating what is possible with draft genomes. I took the part about working in the field of bacterial genomics a bit personally I have to admit, c'mon that's harsh, but I got over that feeling by downing a few pints in Oxford with other researchers that (judging by their research and interest in NGS) also failed to grasp the importance of spending time and money to close P. syringae genomes. We managed to rewrite this paper to address most of the other reviewers critiques and finally were able to submit to PLoS Pathogens.

Baltrus DA, Nishimura TM, Reinhardt JA, Romanchuk A, Chang JH, Mukhtar MS, Cherkis K, Roach J, Grant SR, Jones CD, Dangl JL “Dynamic evolution of pathogenicity revealed by sequencing and comparative genomics of 19 Pseudomonas syringae isolates” PLoS Pathogens 7(7):e1002132

Baltrus Lab Website

Dangl Lab Website

Jones Lab Website