Large-Scale Paraphrasing for Text-to-Text Generation

Juri Ganitkevitch, Johns Hopkins University

We present our work on the extraction and estimation of syntactic paraphrases using commodity text data and automated linguistic annotation. Our initial approach leverages bilingual parallel data and builds on SCFG extraction techniques used in machine translation. We then extend our estimation methods to include contextual similarity metrics drawn from vast amount of plain text. We evaluate the quality of our paraphrases by applying a generalizable adaptation scheme that tunes our paraphraser to arbitrary text-to-text generation tasks, produce competitive results with only little data and work needed. We further discuss the scaling of our extraction method to large data sizes, and the building of the paraphrase database PPDB. a large-scale collection of paraphrases in 23 languages.

Speaker Biography

Juri’s work has transitioned from working on Language Modeling, to Machine Translation, to Paraphrasing, to Semantic Parsing. His person transitioned from Ukraine to Germany, France, and the U.S. He’ll likely keep transitioning in one way or another.

Advisor: Chris Callison-Burch