Algorithms for Analyzing Intraspecific Sequence Variation

Srinath Sridhar, Carnegie Mellon University
Host: Randal Burns

Analysis of human genetic variation has gained significant momentum due to the success of many large-scale sequencing projects. Within the last five years, millions of single nucleotide polymorphisms (SNPs) have been genotyped over thousands of individuals belonging to several different ethnic groups. These large-scale efforts have now made it possible to analyze genetic variation within humans at very fine-scales.

In this talk, we develop maximum parsimony phylogenetic (evolutionary) tree reconstruction algorithms that are specifically catered to work on SNP data. Such a phylogeny should cluster closely related individuals (a sub-population) together. Therefore, these techniques are widely used to answer questions of human migration and genome evolution. Mathematically, we work with two variants of the phylogeny reconstruction problem, both of which are NP-complete. The first variant is equivalent to finding a Steiner minimum tree on a hypercube and the second is a generalization of this problem. Traditionally such problems are solved using hill-climbing heuristics. We show that the problems can be solved in polynomial time when the size of the phylogeny (Steiner minimum tree) is ‘small’ with respect to the dimensions of the hypercube (’near-perfect’), a standard-assumption of SNPs.

We will also briefly outline related work on clustering sub-populations by solving a max-cut instance on graphs, another well-known NP-complete problem. Again, we show that under practical assumptions of SNP data the max-cut can be found in polynomial time and that it can be proved to be the correct clusters.

We provide extensive empirical analysis to show that our methods work well in practice.

Speaker Biography

Srinath Sridhar graduated completed his BS in Computer Science from University of Texas at Austin with highest honors in 2003. He received his PhD in Computer Science from Carnegie Mellon University in 2007. His research interests include applied algorithms broadly with specific focus on computational genomics.