Tattoo World life: software

Showing posts with label software. Show all posts

Friday, February 10, 2012

New publication from members of my lab (e.g., @ryneches) & lab of Marc Facciotti on ChIP-seq based mapping of archaeal transcription factors

New publication from members of my lab and the lab of Marc Facciotti on a workflow for ChIP-seq based mapping of archaeal transcription factors. The paper includes a description of new software from Russell Neches in my lab called pique for peak calling.

See: A workflow for genome-wide mapping of archaeal transcription factors with ChIP-seq

Russell's pique software is available on github here: https://github.com/ryneches/pique.

Figure 3.

The Pique software package processes ChIP-seq coverage data to predict protein-binding sites. Strand-specific coverage data are output as tracks for the Gaggle Genome Browser, and putative-binding sites (peaks) are output as ‘bookmark files’. (A) Screenshot of data browsing in the Gaggle Genome Browser. Green box outlines the navigation window for clicking through bookmarks of predicted binding sites. Details of each site can be displayed (inset). The Gaggle toolbar (shown with black arrow) can be used to broadcast selected data to other ‘geese’ in the gaggle package, programs such as R, cytoscape, BLAST or KEGG. (B) Schematic overview of bioinformatics workflow.

Wilbanks, E., Larsen, D., Neches, R., Yao, A., Wu, C., Kjolby, R., & Facciotti, M. (2012). A workflow for genome-wide mapping of archaeal transcription factors with ChIP-seq Nucleic Acids Research DOI: 10.1093/nar/gks063

Tuesday, February 7, 2012

New openaccess paper from my lab on "Zorro" software for automated masking of sequence alignments

A new Open Access paper from my lab was just published in PLoS One: Accounting For Alignment Uncertainty in Phylogenomics. Wu M, Chatterji S, Eisen JA (2012) Accounting For Alignment Uncertainty in Phylogenomics. PLoS ONE 7(1): e30288. doi:10.1371/journal.pone.0030288

The paper describes the software "Zorro" which is used for automated "masking" of sequence alignments. Basically, if you have a multiple sequence alignment you would like to use to infer a phylogenetic tree, in some cases it is desirable to block out regions of the alignment that are not reliable. This blocking is called "masking."

Masking is thought by many to be important because sequence alignments are in essence a hypothesis about the common ancestry of specific residues in different genes/proteins/regions of the genome. This "positional homology" is not always easy to assign and for regions where positional homology is ambiguous it may be better to ignore such regions when inferring phylogenetic trees from alignments.

Historically, masking has been done by hand/eye looking for columns in a multiple sequence alignment that seem to have issues and then either eliminating those columns or giving them a lower weight and using a weighting scheme in the phylogenetic analysis.

What Zorro does is it removes much of the subjectivity of this process and generates automated masking patterns for sequence alignments. It does this by assigning confidence scores to each column in a multiple seqeunce alignment. These scores can then be used to account for alignment accuracy in phylogenetic inference pipelines.

The software is available at Sourceforge: ZORRO – probabilistic masking for phylogenetics. It was written primarily by Martin Wu (who is now a Professor at the University of Virginia) and Sourav Chatterji with a little help here and there from Aaron Darling I think. The development of Zorro was part of my "iSEEM" project that was supported by the Gordon and Betty Moore Foundation.

In the interest of sharing, since the paper is fully open access, ~~I am posting it here below the fold~~. UPDATE 2/9 - decided to remove this since it got in the way of getting to the comments ...

Sunday, January 29, 2012

One old, one new - a few phylogeny papers worth checking out

Just a quick one here. A few days ago in my lab we were discussing some challenges with doing phylogenetic diversity (PD) measurements in very very large phylogenetic trees. PD is a measure of total branch length in a phylogenetic tree for a group of taxa ... and we use it for many purposes.

For many of our applications we have been using an algorithm described by Mike Steele "Phylogenetic diversity and the Greedy Algorithm". But alas, is is not keeping up with the massive tree sets we are dealing with. Fortunately Aaron Darling in my lab found a alternative paper with a perfect sounding title for us: Phylogenetic Diversity within Seconds from Minh, Klaere, and von Haeseler. This seems like it will do the trick. I note - Kudos to Systematic Biology for making some older papers freely available. Not sure of their general policies on this but good to see.

Anyway - back to the grind ...

Monday, January 9, 2012

Draft post cleanup #11: Tree Hugging

Yet another post in my "draft blog post cleanup" series. Here is #11 from September.

Just a quick one. In August a nice review paper came out on phylogenetic analysis software: Learning to Become a Tree Hugger | The Scientist. By Amy Maxmen it is a "A guide to free software for constructing and assessing species relationships". Definitely worth checking out.

Among the key links & tools discussed:

Saturday, August 6, 2011

New paper from my lab (& the Facciotti lab): Mauve Assembly Metrics #Halophiles #Genomics

Just a quick post here. A new paper from my lab has come out in Bioinformatics. The paper is relatively simple. Titled "Mauve Assembly Metrics" it reports work of Aaron Darling and Andrew Tritt (with some minor contributions from me and Marc Facciotti). Aaron wrote the program Mauve when he was a student in Nicole Perna's lab at Wisconsin: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Over the years he (and others) have continued to develop the program and written a few papers too including for example, the development of progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. This new paper reports basically a system/scripts to measure assembly quality. Here is the abstract:

High throughput DNA sequencing technologies have spurred the development of numerous novel methods for genome assembly. With few exceptions, these algorithms are heuristic and require one or more parameters to be manually set by the user. One approach to parameter tuning involves assembling data from an organism with an available high quality reference genome, and measuring assembly accuracy using some metrics. We developed a system to measure assembly quality under several scoring metrics, and to compare assembly quality across a variety of assemblers, sequence data types, and parameter choices. When used in conjunction with training data such as a high quality reference genome and sequence reads from the same organism, our program can be used to manually identify an optimal sequencing and assembly strategy for de novo sequencing of related organisms.

Check out the paper: Mauve Assembly Metrics. Download the scripts/code http://ngopt.googlecode.com and Mauve and play around and let me know what you think.

Note this paper was supported by a grant from the National Science Foundation (ER 0949453). That grant is focused on comparative genomics (sequencing and analysis) of halophlic archaea. Stay tuned for more on that project as we are writing up a series of papers ....

Some related links:

Tattoo World life