Topic Modeling with Structured Priors for Text-Driven Science

Michael Paul, Johns Hopkins University

Topic models, statistical models that describe low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. This talk will introduce topic models that can encode additional structures such as factorizations, hierarchies, and correlations of topics, and can incorporate supervision and domain knowledge. This is achieved by formulating the Bayesian priors over parameters as functions of underlying components, which can be constrained in various ways to induce different structures. This approach is first introduced through a topic model called factorial LDA, which models a factorized structure in which topics are conceptually arranged in multiple dimensions. Factorial LDA can be used to model multiple types of information, for example topic and sentiment in reviews. We then introduce a family of structured-prior topic models called SPRITE, which creates a unifying representation that generalizes factorial LDA as well as other existing topic models, and creates a powerful framework for building new models. I will also show how these topic models can be used in various scientific applications, such as extracting medical information from forums, measuring healthcare quality from patient reviews, and monitoring public opinion in social media.

Speaker Biography

Michael Paul is a PhD candidate in Computer Science at Johns Hopkins University. Beginning in August 2015, he will be an Assistant Professor of Information Science and Computer Science at the University of Colorado, Boulder. He earned a B.S. in CS from the University of Illinois at Urbana-Champaign in 2009 and an M.S.E. in CS from Johns Hopkins University in 2012. He has received PhD fellowships from Microsoft Research, the National Science Foundation, and the Johns Hopkins University Whiting School of Engineering. His research focuses on exploratory machine learning and natural language processing for the web and social media, with applications to computational epidemiology and public health informatics.