On the concentration of the missing mass

Daniel Berend (Ben-Gurion University)
Aryeh Kontorovich (Ben-Gurion University)


A random variable is sampled from a discrete distribution. The missing mass is the probability of the set of points not observed in the sample. We sharpen and simplify McAllester and Ortiz's results (JMLR, 2003) bounding the probability of large deviations of the missing mass. Along the way, we refine and rigorously prove a fundamental inequality of Kearns and Saul (UAI, 1998).

Full Text: Download PDF | View PDF online (requires PDF plugin)

Pages: 1-7

Publication Date: January 9, 2013

DOI: 10.1214/ECP.v18-2359


  • Bhattacharyya, Chiranjib; Keerthi, S. Sathiya. Mean-field methods for a special class of belief networks. J. Artificial Intelligence Res. 15 (2001), 91--114 (electronic). MR1884078
  • Dubhashi, Devdatt; Ranjan, Desh. Balls and bins: a study in negative dependence. Random Structures Algorithms 13 (1998), no. 2, 99--124. MR1642566
  • Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 1963 13--30. MR0144363
  • Michael~J. Kearns and Lawrence~K. Saul. Large deviation methods for approximate probabilistic inference. In UAI, 1998.
  • Gábor Lugosi. Concentration-of-measure inequalities, rlhttp://www.econ.upf.es/~lugosi/anu.ps. 2003.
  • Maurer, Andreas. Thermodynamics and concentration. Bernoulli 18 (2012), no. 2, 434--454. MR2922456
  • McAllester, David; Ortiz, Luis. Concentration inequalities for the missing mass and for histogram rule error. J. Mach. Learn. Res. 4 (2004), no. 5, 895--911. MR2076001
  • David~A. McAllester and Robert~E. Schapire. On the convergence rate of good-turing estimators. In COLT, 2000.
  • Andrew~Y. Ng and Michael~I. Jordan. Approximate inference algorithms for two-layer bayesian networks. In NIPS, 1999.
  • XuanLong Nguyen and Michael~I. Jordan. On the concentration of expectation and approximate inference in layered networks. In NIPS, 2003.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.