Schatz Laboratory - Publications

Posters

2016
36.	Detection of Structural Variants using third generation sequencing. Presented by Fritz Sedlazeck et al Advances in Genome Biology and Technology (AGBT), Orlando, FL. Feb 12, 2016.

35.	GenomeScope: Fast genome analysis from unassembled short reads. Presented by Greg Vurture et al Advances in Genome Biology and Technology (AGBT), Orlando, FL. Feb 12, 2016.

2015
34.	Genome and transcriptome of the regeneration-competent flatworm, Macrostomum lignano. Presented by Kaja Wasik et al Genome Informatics, Cold Spring Harbor, NY. Oct 28-30, 2015.

33.	NextGenMap-LR: Highly accurate read mapping of third generation sequencing reads for improved structural variation analysis. Presented by Phillip Rescheneder et al Genome Informatics, Cold Spring Harbor, NY. Oct 28-30, 2015.

32.	The Resurgence of Reference Quality Genomes Genome Informatics, Cold Spring Harbor, NY. Oct 28-30, 2015.

31.	Detection of Structural Variants using third generation sequencing. Presented by Fritz Sedlazeck et al Genome Informatics, Cold Spring Harbor, NY. Oct 28-30, 2015.

2014
30.	Near perfect assemblies of eukaryotic genomes using PacBio long read sequencing Biology of Genomes, Cold Spring Harbor, NY. May 10, 2014.

29.	Accurate detection of de novo and transmitted INDELs within exome-capture data using micro-assembly Advances in Genome Biology and Technology (AGBT), Marco Island, FL. Feb 14, 2014.

28.	Error correction and assembly complexity of single molecule sequencing reads: How long is long enough? Authored by Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W. Richard McCombie, and Michael Schatz Advances in Genome Biology and Technology (AGBT), Marco Island, FL. Feb 14, 2014.

2013
27.	An improved method for hybrid correction of long-read, low-identity sequencing data Authored by James Gurtowski, Hayan Lee, and Michael Schatz Genome Informatics, Cold Spring Harbor, NY. Oct 31, 2013.

26.	COIN-VGH: Sensitive and specific virtual genomic hybridization to pinpoint denovo variations Authored by Laura Gomez-Romero et al. Biology of Genomes, Cold Spring Harbor, NY. May 7-11, 2013.

25.	De novo genome metassembly Authored by Alejandro H. Wences, Paul Baranay, Michael C Schatz Biology of Genomes, Cold Spring Harbor, NY. May 7-11, 2013.
2012
24.	FASTG: Representing the true information content of a genome assembly Authored by Iain MacCallum et al. Biology of Genomes, Cold Spring Harbor, NY. May 8-12, 2012.

23.	Assessing the role of de novo gene-killers in the incidence of autism Authored by Ivan Iossifov et al. Biology of Genomes, Cold Spring Harbor, NY. May 8-12, 2012.

22.	Detection and validation of de novo mutations in exome-capture data using micro-assembly Authored by Giuseppe Narzisi et al. Biology of Genomes, Cold Spring Harbor, NY. May 8-12, 2012.

21.	Hybrid Error Correction and De Novo Assembly of Single-Molecule Sequencing Read Authored by Adam Phillippy et al. AGBT, Marco Island, FL, Feb 15-18, 2012

20.	Combining Sequences from Different Sequencing Platforms (Hiseq, Miseq, PacBio) to Improve de novo Genome Assembly. Authored by Eric Antoniou et al. AGBT, Marco Island, FL, Feb 15-18, 2012

2011

19.	Jnomics-A cloud-scale sequence analysis suite. Authored by Matt Titmus, Sneh Lata, Eric Antonious, James Gurtowski, W. Richard McCombie, and Michael Schatz Genome Informatics, Cold Spring Harbor NY, Nov 3, 2011.

18.	Metassembler: A pipeline for improving de novo genome assembly. Authored by Paul Baranay, Scott Emrich, and Michael Schatz. Genome Informatics, Cold Spring Harbor NY, Nov 3, 2011.

17.	Rare de novo and transmitted mutations in autistic spectrum disorders. Authored by Michael Ronemus et al. ASHG, Montreal, CA. Oct 11-15.

16.	Combining short (Illumina) and long (PacBio) NGS reads to improve de novo genome assemblies Authored by Michael Schatz, Melissa delaBastide, Stephanie Muller, Laura Gelley, Eric Antoniou, and W. Richard McCombie ASHG, Montreal, CA. Oct 11-15.

15.	Genome Mappability Analyzer:Characterizing the dark matter of the human genome Authored by Hayan Lee, and Michael Schatz Personal Genomes, CSHL, Cold Spring Harbor, NY. Sept 30 - Oct 2, 2011.

14.	MicroSeq:High-throughput, genome-wide microsatellite genotyping in individual genomes Authored by Mitch Bekritsky, Jennifer Troge, Dan Levy, Michael Wigler, and Michael Schatz Personal Genomes, CSHL, Cold Spring Harbor, NY. Sept 30 - Oct 2, 2011.

2010

13.	Quality guided correction and filtration of errors in short reads. Authored by David Kelley, Michael Schatz, and Steven Salzberg ISMB 2010, Boston, MA. July 11-13, 2010. High-throughput sequencing technologies such as that offered by Illumina have permeated nearly all areas of biological research. Illuminas technology produces sequencing reads of 35-125 bp that may have base calling errors at rates as high as 1-2%. These errors create difficulties for downstream sequence analysis tasks, such as detecting overlaps for genome assembly or aligning reads to a reference genome for SNP detection. Due to its lower cost, deep coverage of the genome is generally possible using Illumina sequencing. Past work has shown that errors can be identified in a set of reads with deep coverage by first counting k-mers, and then considering k-mers with coverage less than some threshold to be artifacts of sequencing errors. Previous methods to correct reads with errors have searched for a minimal set of edits to the read that ensure all k-mers have sufficient coverage. We demonstrate that due to biases with respect to where errors occur in the read and the likelihood of specific nucleotide to nucleotide errors, such approaches are prone to mistakes that introduce corrected sequencing reads that fail to represent any true genomic fragment. We introduce a program named Quake that corrects errors in reads using a more robust model of sequencing errors than previous approaches. By using read quality values and learning the rates at which nucleotides are called as errors to different nucleotides, Quake achieves near perfect accuracy on simulated data. We also demonstrate the role of error correction with Quake in improving assembly and SNP detection.

12.	De Novo Assembly of Large Genomes with Cloud Computing. Co-authored with Dan Sommer, David Kelley, and Mihai Pop. Biology of Genomes 10, Cold Spring Harbor, NY. May 14, 2010. The first step towards analyzing a previously unsequenced organism is to assemble the genome by merging together the sequencing reads into progressively longer contig sequences. New assemblers such as Velvet, Euler-USR, and SOAPdenovo attempt to reconstruct the genome by constructing, simplifying, and traversing the de Bruijn graph of the reads. These assemblers have successfully assembled small genomes from short reads, but have had limited success scaling to larger mammalian-sized genomes, mainly because they require memory and compute resources that are unobtainable for most users. Addressing this limitation, we are developing a new assembly program Contrail (http://contrail-bio.sf.net), which uses the Hadoop/MapReduce distributed computing framework to enable de novo assembly of large genomes. MapReduce was developed by Google to simplify their large data processing needs by scaling computation across many computers, and the open-source version called Hadoop (http://hadoop.apache.org) is becoming a de facto standard for large data analysis, especially in so called cloud computing environments where compute resources are rented on demand. For example, we have also successfully leveraged Hadoop and the Amazon Elastic Compute Cloud for Crossbow (http://bowtie-bio.sf.net/crossbow) to accelerate short read mapping and genotyping, allowing quick (< 4 hours), cheap (< $100), and accurate (> 99% accuracy) genotyping of an entire human genome from 38-fold short read coverage. Similar to other leading short read assemblers, Contrail relies on the graph-theoretic framework of de Bruijn graphs. However, unlike these programs Contrail uses Hadoop to parallelize the assembly across many tens or hundreds of computers, effectively removing memory concerns and making assembly feasible for even the largest genomes. Preliminary results show contigs produced by Contrail are of similar size and quality to those generated by other leading assemblers when applied to small (bacterial) genomes, which scales far better to large genomes. We are also developing extensions to Contrail to efficiently compute a traditional overlap-graph based assembly of large genomes within Hadoop, a strategy that will be especially valuable as read lengths increase to 100bp and beyond.

2009

11.	Whole Genome Resequencing Analysis in the Clouds. Co-authored with Ben Langmead, Jimmy Lin, Mihai Pop, and Steven Salzberg. SC09. Portland, OR. Nov. 17, 2009. Biological researchers have a critical need for highly efficient methods for analyzing vast quantities of DNA resequencing data. For example, the 1000 Genomes Project aims to characterize the variations within 1000 human genomes by aligning and analyzing billions of short DNA sequences from each individual. Each genome will utilize ~100GB of compressed sequence data and require ~400 hours computation. Crossbow is our new high-throughout pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, a highly accurate Bayesian genotyping algorithm, within the distributed processing framework Hadoop to accelerate the computation using many compute nodes. Our results show the pipeline is extremely efficient, and can accurately analyze an entire genome in one day on a small 10-node local cluster, or in one afternoon and for less than $250 in the Amazon EC2 cloud. Crossbow is available open-source at http://bowtie-bio.sourceforge.net/crossbow.

10.	Commodity Computing in Genomics Research. Co-authored with Mihai Pop and Dan Sommer. NSF CLuE PI Meeting. Mountain View CA. Oct 5, 2009. In the next few years the data generated by DNA sequencing instruments around the world will exceed petabytes, surpassing the amounts of data generated by large-scale physics experiments such as the Hadron Collider. Even today, the terabytes of data generated every few days by each sequencing instrument test the limits of existing network and computational infrastructures. Our project is aimed at evaluating whether cloud computing technologies and the MapReduce/Hadoop infrastructure can enable the analysis of the large data-sets being generated. We will report on initial results in two specific applications: human genotyping and genome assembly using next generation sequencing data.

9.	Human SNPs from short reads in hours using cloud computing. Co-authored with Ben Langmead, Jimmy Lin, Mihai Pop, and Steven Salzberg. Awarded Best Poster at 3rd Annual Young Investigator Symposium on Genomics & Bioinformatics. Johns Hopkins University. Baltimore, MD. Sept 25, 2009. As growth in short read sequencing throughput vastly outpaces improvements in microprocessor speed, there is a critical need to accelerate common tasks, such as short read alignment and SNP calling, via large-scale parallelization. Crossbow is a software tool that combines the speed of the short read aligner Bowtie and the accuracy of the SOAPsnp consensus and SNP caller within a cloud computing environment. Crossbow aligns reads and makes highly accurate SNP calls from a dataset comprising 38-fold coverage of the human genome in under 1 day on a local 40 core cluster, and under 3 hours using a 320-core cluster rented from Amazon’s Elastic Compute Cloud (EC2) service. Crossbow’s ability to run on EC2 means that users need not own or operate an expensive computer cluster in order to run Crossbow. Crossbow is available at http://bowtie-bio.sf.net/crossbow under the Artistic license.

8.	Human SNPs from short reads in hours using cloud computing. Co-authored with Ben Langmead, Jimmy Lin, Mihai Pop, and Steven Salzberg. WABI 2009. Philadelphia, PA. Sept 12, 2009.

7.	Towards a de novo short read assembler for large genomes using cloud computing. Biology of Genomes 09, Cold Spring Harbor, NY. May 7, 2009. The massive volume of data and short read lengths from next generation DNA sequencing machines has spurred development of a new class of short read genome assemblers. Several of the new assemblers, such as Velvet and Euler-USR, model the assembly problem as constructing, simplifying, and traversing the de Bruijn graph of the read sequences, where nodes in the graph represent k-mers in the reads, with edges between nodes for consecutive k-mers. This approach has many advantages for these data, such as efficient computation of overlapping reads and robust handling of sequencing errors, and has demonstrated success for assembling small to moderately sized genomes. However, this approach is computationally challenging to scale to mammalian-sized genomes because it requires constructing and manipulating a graph far larger than can fit into memory. MapReduce was developed at Google for parallel computation on their extremely large data sets, including their database of more than 1 trillion web pages. Computation in MapReduce is structured into 2 main phases: the map phase and the reduce phase, which act together to construct a large distributed hash table of key-value pairs in a map phase, and then evaluate a function on each bucket of the hash table in the reduce phase. The power of MapReduce is dozens or hundreds of map and reduce instances can execute in parallel, enabling efficient computation even on terabyte and petabyte sized data sets. Drawing on the success of CloudBurst, a MapReduce-based short read mapping algorithm capable of mapping millions of reads to the human genome with high sensitivity, we have developed a MapReduce-based short read assembler that shows tremendous potential for enabling de novo assembly of mammalian-sized genomes. The deBruijn graph is constructed with MapReduce by emitting and then grouping key-value pairs (ki,ki+1) between successive k-mers in the read sequences. After construction, MapReduce is used again to remove spurious nodes and edges from the graph caused by sequencing error in the reads, and to compress simple chains of nodes into long sequence nodes representing the unambiguous regions of the genome between repeat boundaries. The resulting graph is a small fraction of the size of the original deBruijn graph, and is output in a format compatible with other short read assemblers for additional analysis.

6.	Improving the genome sequence of D. simulans via co-assembly of multiple strains. Co-authored with Adam Phillippy, Timur Chabuk, Bo Liu and Steven Salzberg. Biology of Genomes 09, Cold Spring Harbor, NY. May 7, 2009. Using the closely related D. melanogaster as a reference, we aggressively co-assembled the seven sequenced strains of D. simulans using our comparative assembly program, AMOScmp. The AMOScmp contigs were then passed to Celera Assembler to assemble and scaffold alongside reads that failed to match the reference genome. Our improved co-assembly increases depth of coverage threefold over the original assembly and contains thousands of additional genes. These results show that comparative assembly is a promising means for assembling diverse population samples and outperforms traditional assembly in quality and the number of genes it is able to successfully recover. In addition, when combined with overlap-based assembly, comparative assembly can succeed even for reference genomes of a different species.

5.	A whole-genome assembly of the domestic cow, B. taurus. Co-authored with Aleksey Zimmin, Steven Salzberg, et al. Biology of Genomes 09, Cold Spring Harbor, NY. May 7, 2009.

4.	Better Modules in Protein-Protein Interaction Networks. Co-authored with Saket Navlakha and Carl Kingsford. Pacific Symposium on Biocomputing, Hawaii. Jan. 10, 2009.

2008 and earlier

3.	Genome Assembly Forensics: Finding the Elusive Mis-assembly. Co-authored with Adam Phillippy and Mihai Pop. Biology of Genomes 08, Cold Spring Harbor, NY. May 9, 2008. Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality. Our new automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled. This approach is necessary for accurately detecting mis-assemblies because each of the individual characteristics has unavoidable natural variation, but, when considered together, have greatly increased analysis power. Furthermore, our pipeline can easily be adjusted to analyze assemblies utilizing new sequencing technologies where some metrics are unreliable or not available, such as base pair quality or mate pairs. Our validation pipeline provides a robust measure of assembly quality that goes beyond the simple measures commonly reported. Evaluation of the pipeline has shown it to be highly sensitive for mis-assembly detection, and has revealed mis-assemblies in both draft and finished genomes. This is particularly troubling as scientists move away from the "gene by gene" paradigm and attempt to understand the global organization of genomes. Without a correct genome sequence or even a clear understanding of the errors present, such studies may draw incorrect conclusions. Our goals are to help scientists locate mis-assembled regions of an assembly and help them correct those regions by focusing their efforts where it is needed most. amosvalidate is compatible with many common assembly formats and is released open-source at http://amos.sourceforge.net.

2.	High-throughput sequence alignment using Graphics Processing Units. Co-authored with Cole Trapnell. 15th Annual Microbial Genomes Conference. College Park, MD. Sept 17, 2007.

1.	Improving Genome Assembly without Sequencing. Co-authored with Steven Salzberg, Arthur Delcher, and Pawel Gajer. RECOMB 2005. Cambridge, MA. May 5, 2005.

2016

2015

2014

2013

2012

2011

2010

2009

2008 and earlier