Approximating String Compressibility and the Distribution Support Size

Imagine having to choose between a few compression schemes to compress a very long file. Before deciding on the scheme, you might want to obtain a rough estimate of how well each scheme performs on your file. Another motivating scenario comes from areas, such as data-mining, natural language processing and genomics, where compression schemes are used as tools for measuring properties of strings such as similarity and entropy. In these applications, one typically needs only the length of the compressed version of a file, not the output itself.

In this talk, we consider the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We will concentrate on the run-length encoding and a version of Lempel-Ziv as our example compression schemes. We present algorithms and lower bounds for approximating compressibility with respect to both schemes.

The second problem we consider is approximating the support size of a distribution from a small number of samples, when each element in the distribution’s support appears with probability at least 1/n. Our motivation is tight relationship between this problem and estimating compressibility with respect to Lempel-Ziv. However, estimating distribution support size is also considered in other areas of computer science along with an important variant, estimating the number of distinct elements in a sequence of length n.

For all three problems (distribution support, distinct elements and Lempel-Ziv compressibility), we prove a nearly linear (in n) lower bound on the query complexity, applicable even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X and Y, with very different expectations and the following condition on the first k moments: E[X]/E[Y] = E[X2]/E{Y2] = … = E[X^k]/E[Y^k]. Our lower bound method is also applicable to other problems. In particular, it gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

Based on joint work with Dana Ron, Ronitt Rubinfeld and Adam Smith, published in RANDOM 2007 and joint work with Dana Ron, Amir Shpilka and Adam Smith, published in FOCS 2007.