Reservoir sampling
Prev: spectral-techniques Next: markov-chains
Reservoir Sampling
How does one sample uniform random elements from a datastream where , and:
- N is huge.
- N is unknown.
Resrvoir sampling works like this:
With a number and a datastream with a length greater than :
- Put the first elements of the stream into a set
- For :
- With probability replace a random entry of R with .
- At the end of the stream, return the reservoir R.
At any time , the reservoir, R, consists of a uniformly random subset of of the entries . This can be proven if , , and is independent of the contents of the reservoir at times .
Basic Probability Tools:
Theorem 2.1 (Markov’s Inequality): For a real-valued random variable X s.t. , for any :
Basically, if we know the expectation of a distribution of numbers, and that is non-negative, Markov’s inequality can tell us a basic fact about the distribution.
For example, the probability that a student’s GPA is more than twice the average GPA is at most .
Theorem 2.2 (Chebyshev’s Inequality): For a real-valued random variable X, and any :
This is useful for:
Example 2.3: How many people must we poll to estimate the percentage of people who support candidate C, where an accuracy of , with a probability of at least
In order to estimate the expectation of a 0/1 random variable to error , we need roughly samples.
This can be used to prove “central limit” style exponential bounds on tail probabilities, like how flipping fewer than 400 heads in 1000 tosses of a fair coin is miniscule.
Importance Sampling
Prev: spectral-techniques Next: markov-chains