Next: test-statistics-and-randomization-distributions
Much of our knowledge has to do with reasoning from specific cases to the general case.
We might design an experiment that samples some part of the population, asks them a question, and then generalize that to the whole population.
For example, asking 200 people on a telephone survey, to get information on the opinion of the entire city.
Our interest is in how the input(s) of a process affect the output(s). Input variables consist of:
For any interesting process, there are inputs such that:
\[ variability in input \to variability in output \]
If variability in an input factor \(x\) leads to variability in output \(y\), \(x\) is considered a source of variation.
Information on how inputs affect output can be gained from:
Population: Healthy, post-menopausal women in the U.S.
Input variables:
Output variables:
Question: How does estrogen treatment affect health outcomes?
Observational Study:
Experimental Study (Randomized Controlled):
The design was the following:
\[\max(a,b)=\begin{cases}x=1&(\text{ estrogen treatment})\\x=0&(\text{ no estrogen treatment})\end{cases}\]
Women were of different ages and treated at different clinics. Thusly, women were blocked together by age and clinic, and treatments were randomly assigned within each block of (age x treatment). This is called a randomized block design.
(50-59) | (60-69) | (70-79) | |
---|---|---|---|
clinic 1 | \(n_11\) | \(n_12\) | \(n_13\) |
clinic 2 | \(n_21\) | \(n_22\) | \(n_23\) |
… | … | … | … |
Women on treatment had a higher incidence rate of:
And a lower incidence rate of:
Question: Why did the two studies have different conclusions?
Consider the following explanation:
\(x\) = estrogen treatment \(e\) = “health consciousness” (not measured) \(y\) = health outcomes
e
v v
x -> y
(correlation)
Association between \(x\) and \(y\) may be due to an unmeasured variable, \(e\).
e
v v
(randomization)
x -------------------> y
(correlation = causation)
Randomization breaks the association between \(e\) and \(x\).
Observational studies can suggest good experiments to run, but cannot show causation.
Randomization can eliminate correlation between \(x\) and \(y\) due to a different cause, \(e\), a confounder.
“No Causation without randomization”
Replication: More repetition = better inference
The more experiments you run, the better your inference becomes
Randomization: Random assignment of treatments to experimental units. This removes potential for systematic bias. It also makes confounding by an unobserved variable unlikely, but not impossible.
Blocking: Randomization within blocks of homoegeneous experimental units. This allows for evenly distributed treatments across large potential sources of variation.