THE FORGE • Hypothesis Testing

Theoretical Distribution

Simulation Under H₀

Results

Decision

Hypotheses & Computations

The One-Proportion Z-Test

Tests whether a population proportion $\pi$ equals a hypothesized value $\pi_0$.

Formulas

$$\hat{p} = \frac{k}{n} \qquad SE = \sqrt{\frac{\pi_0(1-\pi_0)}{n}} \qquad Z = \frac{\hat{p} - \pi_0}{SE}$$

Validity Conditions

The normal approximation requires both $n\pi_0 \geq 10$ and $n(1-\pi_0) \geq 10$.

Simulation Approach

Under H$_0$, each simulated sample draws $k^* \sim \text{Binomial}(n, \pi_0)$. We compute $Z^*$ for each sample. The simulated p-value is the fraction of $|Z^*|$ values as extreme or more extreme than the observed $|Z|$.

Evidence Scale

p > 0.10: Weak evidence against H$_0$

0.05 < p < 0.10: Moderate evidence

0.01 < p < 0.05: Strong evidence

p < 0.01: Very strong evidence

Test whether a population proportion differs from a hypothesized value.

Sample Size (n)

Successes (k)

Null Proportion (π₀)

Alternative Hypothesis

Two-tailed (≠)

Right-tailed (>)

Left-tailed (<)

Significance Level (α)

Simulations

Theoretical Distribution

Simulation Under H₀

Results

Decision

Hypotheses & Computations

The One-Sample t-Test

Tests whether a population mean $\mu$ equals a hypothesized value $\mu_0$ when the population standard deviation is unknown.

Formulas

$$SE = \frac{s}{\sqrt{n}} \qquad t = \frac{\bar{x} - \mu_0}{SE} \qquad df = n - 1$$

Degrees of Freedom

The t-distribution has $n - 1$ degrees of freedom. As df increases, the t-distribution converges to the standard normal. With small samples, the heavier tails of the t-distribution account for the extra uncertainty from estimating $\sigma$ with $s$.

t vs. Z

Use Z when $\sigma$ is known (rare). Use t when $\sigma$ is unknown and estimated by $s$. For large n, the two are nearly identical.

Validity Conditions

The t-test requires the population to be approximately normal, or $n$ to be large enough for the CLT to apply. Tintle et al. use $n \geq 20$ as the guideline. The traditional CLT threshold is $n \geq 30$.

Tintle, N. et al. Introduction to Statistical Investigations (ISI).

Simulation Approach

Under H$_0$, we generate $n$ observations from N($\mu_0$, $s$) for each simulation. We compute the sample mean, sample SD, and $t^*$ for each. The simulated p-value is the fraction of $|t^*|$ values as extreme or more extreme than the observed $|t|$.

Test whether a population mean differs from a hypothesized value.

Sample Size (n)

Sample Mean (x̄)

Sample SD (s)

Null Mean (μ₀)

Alternative Hypothesis

Two-tailed (≠)

Right-tailed (>)

Left-tailed (<)

Significance Level (α)

Simulations

Theoretical Distribution

Simulation Under H₀

Results

Decision

Hypotheses & Computations

The Two-Proportion Z-Test

Tests whether two population proportions $\pi_1$ and $\pi_2$ are equal.

Why Pool Under H₀?

Under H$_0$: $\pi_1 = \pi_2$. If they are equal, our best estimate of that common proportion uses all the data from both groups. The pooled proportion $\hat{p} = \frac{k_1 + k_2}{n_1 + n_2}$ serves as the shared estimate for the standard error calculation.

Formulas

$$\hat{p} = \frac{k_1 + k_2}{n_1 + n_2} \qquad SE = \sqrt{\hat{p}(1 - \hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

$$Z = \frac{\hat{p}_1 - \hat{p}_2}{SE}$$

Four Validity Conditions

All four of the following must be $\geq 10$: $n_1\hat{p}$, $n_1(1-\hat{p})$, $n_2\hat{p}$, $n_2(1-\hat{p})$.

Independence

The two groups must be independent. If subjects are matched or paired, use a different test.

Test whether two population proportions are equal.

Group 1

n₁

Successes (k₁)

Group 2

n₂

Successes (k₂)

Alternative (p₁ vs p₂)

Two-tailed (≠)

Right-tailed (>)

Left-tailed (<)

Significance Level (α)

Simulations

Theoretical Distribution

Simulation Under H₀

Results

Decision

Hypotheses & Computations

Welch's Two-Sample t-Test

Tests whether two population means $\mu_1$ and $\mu_2$ are equal. Welch's version does not assume equal variances.

Formulas

$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \qquad t = \frac{\bar{x}_1 - \bar{x}_2}{SE}$$

Welch Degrees of Freedom

The Welch approximation for degrees of freedom accounts for unequal variances and unequal sample sizes.

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(s_1^2/n_1\right)^2}{n_1 - 1} + \frac{\left(s_2^2/n_2\right)^2}{n_2 - 1}}$$

Welch vs. Pooled

The pooled t-test assumes $\sigma_1 = \sigma_2$ and uses a single pooled variance estimate. Welch's test relaxes this assumption. Welch is the safer default and is what R's t.test() uses.

Validity Conditions

Each group needs approximate normality or a large enough sample. Tintle et al. use $n \geq 20$ per group. The traditional CLT threshold is $n \geq 30$ per group.

Tintle, N. et al. Introduction to Statistical Investigations (ISI).

Simulation Approach

Under H$_0$ (both groups have the same mean), we simulate from N(0, $s_1$) and N(0, $s_2$). For each pair of simulated samples, we compute the Welch t-statistic.

Test whether two population means are equal (Welch's t-test).

Group 1

n₁

x̄₁

s₁

Group 2

n₂

x̄₂

s₂

Alternative (μ₁ vs μ₂)

Two-tailed (≠)

Right-tailed (>)

Left-tailed (<)

Significance Level (α)

Simulations

Theoretical Distribution

Simulation Under H₀

Results

Decision

Hypotheses & Computations

The Paired t-Test

Tests whether the mean difference $\mu_d$ between paired observations equals zero. This reduces to a one-sample t-test on the differences.

Why Pairing Helps

Pairing removes between-subject variability. Instead of comparing two independent groups (each with their own variability), we analyze only the within-subject differences. This often reduces the standard error and increases power.

Formulas

$$SE = \frac{s_d}{\sqrt{n}} \qquad t = \frac{\bar{d}}{SE} \qquad df = n - 1$$

where $\bar{d}$ is the mean of the differences and $s_d$ is the standard deviation of the differences.

When to Use Paired vs. Two-Sample

Use the paired test when observations come in natural pairs: before/after measurements on the same subject, matched subjects, or repeated measures. Use the two-sample test when the groups are independent.

Validity Conditions

The differences need approximate normality or a large enough sample. Tintle et al. use $n \geq 20$ pairs. The traditional CLT threshold is $n \geq 30$.

Tintle, N. et al. Introduction to Statistical Investigations (ISI).

Simulation Approach

Under H$_0$ ($\mu_d = 0$), we generate $n$ differences from N(0, $s_d$) for each simulation. We compute $\bar{d}^*$, $s_d^*$, and $t^*$ for each. The simulated p-value is the fraction of $|t^*|$ values as extreme or more extreme than the observed $|t|$.

Test whether the mean difference between paired observations is zero.

Number of Pairs (n)

Mean Difference (d̄)

SD of Differences (sᵈ)

Alternative (μᵈ)

Two-tailed (≠ 0)

Right-tailed (> 0)

Left-tailed (< 0)

Significance Level (α)

Simulations

◆ THE BLUEPRINT

The One-Proportion Z-Test

Formulas

Validity Conditions

Simulation Approach

Evidence Scale

◆ THE BLUEPRINT

The One-Sample t-Test

Formulas

Degrees of Freedom

t vs. Z

Validity Conditions

Simulation Approach

◆ THE BLUEPRINT

The Two-Proportion Z-Test

Why Pool Under H₀?

Formulas

Four Validity Conditions

Independence

◆ THE BLUEPRINT

Welch's Two-Sample t-Test

Formulas

Welch Degrees of Freedom

Welch vs. Pooled

Validity Conditions

Simulation Approach

◆ THE BLUEPRINT

The Paired t-Test

Why Pairing Helps

Formulas

When to Use Paired vs. Two-Sample

Validity Conditions

Simulation Approach