Reservoir sampling

Table of Contents

Reservoir sampling

Prev: spectral-techniques Next: markov-chains

Reservoir Sampling

How does one sample k uniform random elements from a datastream where Nk, and:

  1. N is huge.
  2. N is unknown.

Resrvoir sampling works like this:

With a number k and a datastream x1,x2,,xn with a length greater than k:

At any time tk, the reservoir, R, consists of a uniformly random subset of k of the entries x1,,xt. This can be proven if ti, Pr[xiR]=|ki|, and xiR is independent of the contents of the reservoir at times t<i.

Basic Probability Tools:

Theorem 2.1 (Markov’s Inequality): For a real-valued random variable X s.t. X0, for any c>0:

Pr[XcE[X]]|1c|

Basically, if we know the expectation of a distribution of numbers, and that is non-negative, Markov’s inequality can tell us a basic fact about the distribution.

For example, the probability that a student’s GPA is more than twice the average GPA is at most |12|.

Theorem 2.2 (Chebyshev’s Inequality): For a real-valued random variable X, and any c>0:

Pr[|XE[X]]cVar[X]]|1c2|

This is useful for:

Example 2.3: How many people must we poll to estimate the percentage of people who support candidate C, where an accuracy of ±1%, with a probability of at least |34|

In order to estimate the expectation of a 0/1 random variable to error ±ϵ, we need roughly O(|1ϵ2|) samples.

This can be used to prove “central limit” style exponential bounds on tail probabilities, like how flipping fewer than 400 heads in 1000 tosses of a fair coin is miniscule.

Importance Sampling

Prev: spectral-techniques Next: markov-chains