CMSA Computer Science for Mathematicians: Learning-Based Support Size Estimation in Sublinear Time


View Calendar
February 23, 2021 11:30 am - 12:30 pm
via Zoom Video Conferencing

Shyam Narayanan - MIT

We consider the problem of estimating the number of distinct elements in a large data set from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $\pm \epsilon n$ from a sample of size $O(\log^2(1/\epsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to $\log (1/\eps) \cdot n^{1-\Theta(1/\log(1/\eps))}.$ We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.