Bilkent University

Department of Computer Engineering

Sublinear Algorithms on Massive Data Sets

**Tuğkan Batu**

University of Texas at Austin

Technological advances of the last decades such as cheaply available computing power, high-volume data storage, and the omnipresence of networking made it possible to collect and store enormous amounts of data. Many problems that arise in data mining, machine learning applications, and analysis of experimental data in scientific fields require that we understand global statistical properties of data. For this class of problems, one critical parameter that affects the efficiency of the algorithm is the size of the domain over which the data is distributed. Standard statistical techniques perform poorly when the data is over a large domain. We study the sample complexity of various basic statistical inference tasks as a function of the domain size of the underlying discrete probability distributions. All these problems take samples from the distributions as input. The tasks are (1) to distinguish identical pairs of distributions from pairs of distributions that have large statistical distance, (2) to test whether the input distribution is statistically close to an explicitly specified distribution, (3) to test whether the marginal distributions induced by a joint distribution are independent, (4) to estimate the entropy of a distribution. We give algorithms with sublinear sample complexity for each of these problems. We also show that that our results for the first three problems are tight up to polylogarithmic factors. We also investigate special classes of distributions, in which we can solve these problems much more efficiently. In addition to statistical problems mentioned above, many problems from various domains are given sublinear-time approximation algorithms in recent years. To illustrate the surprising power of sublinear-time computation, we show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings, runs in sublinear time and, with high probability, returns ``CLOSE'' if their edit distance is $O(n^c)$, where $c$ is a fixed parameter less than 1, and ``FAR'' if their edit distance is linear. Our algorithm thus provides a trade-off between accuracy and efficiency that is particularly useful when the input data is very large.

**DATE:
July 23, 2004, Friday @ 10:40
PLACE: EA-409**