Reservoir sampling

Prev: spectral-techniques Next: markov-chains

Reservoir Sampling

How does one sample uniform random elements from a datastream where , and:

  1. N is huge.
  2. N is unknown.

Resrvoir sampling works like this:

With a number and a datastream with a length greater than :

  • Put the first elements of the stream into a set
  • For :
    • With probability replace a random entry of R with .
  • At the end of the stream, return the reservoir R.

At any time , the reservoir, R, consists of a uniformly random subset of of the entries . This can be proven if , , and is independent of the contents of the reservoir at times .

Basic Probability Tools:

Theorem 2.1 (Markov’s Inequality): For a real-valued random variable X s.t. , for any :

Basically, if we know the expectation of a distribution of numbers, and that is non-negative, Markov’s inequality can tell us a basic fact about the distribution.

For example, the probability that a student’s GPA is more than twice the average GPA is at most .

Theorem 2.2 (Chebyshev’s Inequality): For a real-valued random variable X, and any :

This is useful for:

Example 2.3: How many people must we poll to estimate the percentage of people who support candidate C, where an accuracy of , with a probability of at least

In order to estimate the expectation of a 0/1 random variable to error , we need roughly samples.

This can be used to prove “central limit” style exponential bounds on tail probabilities, like how flipping fewer than 400 heads in 1000 tosses of a fair coin is miniscule.

Importance Sampling

Prev: spectral-techniques Next: markov-chains