Taming Big Data By Streaming

Lin Yang, Johns Hopkins University

Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real world applications and stimulates new scalable tools for solving important problems in the big data era. In my PhD study, I focus on the following three aspects of the streaming model.

  1. Understanding the capability of the streaming model. For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying n-dimensional vector v (for very large n), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form \sum_{i=1}^n g(v_i) for nearly all functions g of one-variable and l(v) for all symmetric norms l. These results provide a deeper understanding of the streaming computation model.

  2. Tighter upper bounds. We provide better streaming k-median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ([\Delta]^d for sufficiently large \Delta and d). Our algorithms use k poly(d \log \Delta) space/update time and maintain with high probability an approximate k-median solution to the streaming dataset. All previous algorithms for computing an approximation for the k-median problem over dynamic data streams required space and update time exponential in d.

  3. Applications in Machine Learning. We develop an efficient algorithm for stochastic data streams, i.e., the data points are drawn from a distribution. Stochastic streaming algorithms have broader applications in machine learning tasks. We present a novel framework that dynamically obtains the spectral clusters of a stochastic stream that is generated from a random walk over a network. We show that our algorithm achieves tight space/sample bounds.

Speaker Biography

Lin Yang is currently a PhD student in the Computer Science Department of Johns Hopkins University. He has defended his Ph.D. degree in Physics in this May. He obtained his Ms.E. degree in Computer Science in 2015 and received his Bacheler’s degree from Tsinghua University, China. In computer science, he works in the area of sublinear time/space algorithms, in particular, algorithms for streaming data analysis. He studies the fundamental complexity of algorithms on data streams as well as designs new and better algorithm for important problems. Lin’s research papers are published in top conferences, such as STOC, ICML and PODS. He will be joining the Department of Operations Research and Financial Engineering of Princeton University, as a postdoc researcher. Lin Yang’s advisors are Professor Vova Braverman and Professor Alex Szalay.

Advisor: Vladimir Braverman