Reservoir sampling

Prev: spectral-techniques Next: markov-chains

Reservoir Sampling

How does one sample $k$ uniform random elements from a datastream where $N ≫ k$ , and:

N is huge.
N is unknown.

Resrvoir sampling works like this:

With a number $k$ and a datastream $x_{1}, x_{2}, \dots, x_{n}$ with a length greater than $k$ :

Put the first $k$ elements of the stream into a set
For $i \geq ∣ \frac{k}{i} ∣$ :
- With probability $k / i$ replace a random entry of R with $x_{i}$ .
At the end of the stream, return the reservoir R.

At any time $t \geq k$ , the reservoir, R, consists of a uniformly random subset of $k$ of the entries $x_{1}, \dots, x_{t}$ . This can be proven if $t \geq i$ , $P r [x_{i} \in R] = ∣ \frac{k}{i} ∣$ , and $x_{i} \in R$ is independent of the contents of the reservoir at times $t < i$ .

Basic Probability Tools:

Theorem 2.1 (Markov’s Inequality): For a real-valued random variable X s.t. $X \geq 0$ , for any $c > 0$ :

$P r [X \geq c E [X]] \leq ∣ \frac{1}{c} ∣$

Basically, if we know the expectation of a distribution of numbers, and that is non-negative, Markov’s inequality can tell us a basic fact about the distribution.

For example, the probability that a student’s GPA is more than twice the average GPA is at most $∣ \frac{1}{2} ∣$ .

Theorem 2.2 (Chebyshev’s Inequality): For a real-valued random variable X, and any $c > 0$ :

$P r [∣ X - E [X]] \geq c Va r [X]] \leq ∣ \frac{1}{c ^{2}} ∣$

This is useful for:

Example 2.3: How many people must we poll to estimate the percentage of people who support candidate C, where an accuracy of $\pm 1%$ , with a probability of at least $∣ \frac{3}{4} ∣$

In order to estimate the expectation of a 0/1 random variable to error $\pm ϵ$ , we need roughly $O (∣ \frac{1}{ϵ ^{2}} ∣)$ samples.

This can be used to prove “central limit” style exponential bounds on tail probabilities, like how flipping fewer than 400 heads in 1000 tosses of a fair coin is miniscule.

Importance Sampling

Prev: spectral-techniques Next: markov-chains

Takashi's Notes

Explorer

Reservoir sampling

Reservoir sampling