Taming Complex Databases through Schema Summaries

Real scientific databases, biological databases in particular, can often be very complex and their schemas can comprise thousands of tables, elements, attributes, etc. Any biologist wishing to interact with such a complex database first has the daunting task of understanding the database schema. In this talk, I will propose the concept of schema summary, which can provide a succinct overview of the underlying complex schema and significantly reduce the human effort required to understand the database. I will define criteria for good schema summaries, and describe efficient algorithms for producing them. User effort in locating schema elements needed to construct a structured query can be greatly reduced with a schema summary, which allows the user to explore only portions of the schema that are of interest. Nonetheless, as the query complexity increases, this approach of querying through exploration is no longer a viable option because a significant percentage of the schema will have to be explored. By leveraging schema summary and a novel schema-based semantics for matching meaningful data fragments with structure-free search conditions, I will propose a novel query model called Meaningful Summary Query. The MSQ query model allows the users to query a complex database through its schema summary, with embedded structure-free conditions. As a result, an MSQ query can be generated with the knowledge of the schema summary alone, and yet retrieve highly accurate results from the database.

Speaker Biography

Cong Yu is a Ph.D. Candidate in the Department of EECS (Electrical Engineering and Computer Science), University of Michigan, Ann Arbor. Before that, he obtained his Master and undergraduate degrees in Biology from University of Michigan and Fudan University, respectively. Cong Yu’s research interests are scientific data management, information integration and retrieval, and database query processing. His main research project is Schema Management for Complex and Heterogeneous Information Sources ( http://www.eecs.umich.edu/db/schemasummary/ ), which addresses various issues involved in managing complex, real world, information sources. He is a founding member of the MiMI project ( http://mimi.ncibi.org/), responsible for its overall schema design and the data transformation component. He is also a member of the TIMBER project ( http://www.eecs.umich.edu/db/timber), contributing to its indexing and full-text retrieval components. In his spare time, Cong likes to read books on various topics and is an avid Michigan football fan.