Algorithms for studying the structure and function of genomes

Michael Schatz, Cold Spring Harbor Laboratory
Host: Randal Burns

One of the most important problems in biology and medicine is determining the relationship between the sequence of a genome and the traits that it has, as this can provide insight into the risks for disease, the nature of development, and the forces of evolution. One of the most powerful methods we have for addressing this question is large-scale computational analysis of genome sequences within different tissues, populations, and species.

These questions have motivated us to develop several new, highly scalable algorithms and high performance systems for constructing and analyzing large collections of strings, trees, and graphs of biological sequence data. One of our main strengths is in the problem of de novo genome assembly, where the genome of a species is reconstructed from billions of short DNA sequencing reads through the construction and transformations of large sequence graphs. Recently we have been very focused on solving this problem using new single molecule sequencing technologies from PacBio and Oxford Nanopore that produce much longer reads, approaching 100,000 bp instead of mere hundreds, but suffer from very high error rates (15% to 40% error). Despite their low fidelity, we have developed the algorithms to overcome most errors and have used the data to assemble several very high quality microbial, plant, and animal genomes.

We have applied these and other techniques to gain new insights into the genetics of autism, the progression of cancer, and the evolution of plant and animal genomes. Looking forward, we have begun developing the computational theory and scalable systems to construct ‘the graph of life’: a graph encoding how a set of genomes relate to each other. In our preliminary work, we developed a new algorithm SplitMEM to analyze dozens of microbial genomes at once, but our ultimate goal is to scale these ideas to assemble a graph of all sequence variations in the human population.

Speaker Biography

Michael Schatz is an Associate Professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. His research focuses on developing scalable algorithms and systems for biological sequence analysis. Schatz received his Ph.D. in Computer Science from the University of Maryland in 2010, and his B.S. in Computer Science from Carnegie Mellon University in 2000, with 4 years at the Institute for Genomic Research (TIGR) in between. For more information, please visit his website at: http://schatzlab.cshl.edu