Statistical Inference in Nutshell

Statistics
Visual and intuitive explanation of statistical inference.
Author

Jiho Kim

Published

May 3, 2024

Single Proportion

Confidence Interval

If the population follows a normal distribution or if the sample size n is sufficiently large so that n \hat{p} \geq 10 and n (1 - \hat{p}) \geq 10 for the sample proportion \hat{p}, then a confidence interval for the population proportion p can be calculated as: \hat{p} \pm z^* \sqrt{\dfrac{\hat{p} (1 - \hat{p})}{n}} where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha.

Example

Suppose we are given a sample size of n = 400, a sample proportion of \hat{p} = 0.84, and a significance level of \alpha = 0.05 (or a 1 - \alpha = 0.95 confidence level).

Without knowing whether the population follows a normal distribution, we verify that both n \hat{p} = (400)(0.84) = 336 \geq 10 and n (1 - \hat{p}) = (400)(0.16) = 64 \geq 10 are satisfied. Therefore, we can proceed with calculating the confidence interval:

n <- 400
p_hat <- 0.84
alpha <- 0.05

1z_star <- qnorm(1 - (alpha / 2))
standard_error <- sqrt((p_hat * (1 - p_hat)) / n)
2interval <- p_hat + c(-1, 1) * z_star * standard_error

cat("(", round(interval[1], 4), ", ", round(interval[2], 4), ")", sep = "")
1
Divide alpha by 2 to determine the area for one tail, then subtract it from 1 to obtain the area up to the right tail. Then find the z-score that corresponds to this area using the qnorm function.
2
Multiplying the critical value by c(-1, 1) is just a lazy yet convenient way to calculate the lower and upper bounds of the confidence interval simultaneously.
(0.8041, 0.8759)

Determination of Sample Size

To determine the sample size n required to estimate the population proportion p with a desired margin of error ME and a significance level \alpha, we can use the following formula: n = \lceil\bigg(\dfrac{z^*}{ME}\bigg)^2 \tilde{p} (1 - \tilde{p})\rceil where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha and \tilde{p} is an estimate of the population proportion p (use \tilde{p} = 0.5 if no estimate is available).

Hypothesis Testing

To test the null hypothesis H_0: p = p_0 against the alternative hypothesis H_1: p \neq p_0 (or a one-tailed alternative), we use the following test statistic: z = \dfrac{\hat{p} - p_0}{\sqrt{\dfrac{p_0 (1 - p_0)}{n}}} where n is the sample size and \hat{p} is the sample proportion, provided that the population follows a normal distribution or that the sample size n is sufficiently large so that n p_0 \geq 10 and n (1 - p_0) \geq 10.

There are two approaches to hypothesis testing: critical value approach and p-value approach.

Critical Value Approach

Suppose we test the null hypothesis H_0: p = 0.8 against the alternative hypothesis H_1: p \neq 0.8 with a significance level of \alpha = 0.05. Then we can calculate the critical value as follows:

alpha <- 0.05
z_star <- qnorm(1 - (alpha / 2))
z_star
[1] 1.959964

We collect a sample of size n = 400 with a sample proportion of \hat{p} = 0.84. Without knowing whether the population follows a normal distribution, we verify that both np_0 = (400)(0.8) = 320 \geq 10 and n(1 - p_0) = (400)(0.2) = 80 \geq 10 are satisfied. Therefore, we can proceed with calculating the test statistic:

n <- 400
p_hat <- 0.84
p_0 <- 0.8

z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
z
[1] 2

If the absolute value of the test statistic |z| is greater than the critical value z^*, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

abs(z) > z_star
[1] TRUE

Therefore, we reject the null hypothesis.

P-Value Approach

Suppose we test the null hypothesis H_0: p = 0.8 against the alternative hypothesis H_1: p \neq 0.8 with a significance level of \alpha = 0.05.

We collect a sample of size n = 400 with a sample proportion of \hat{p} = 0.84. Without knowing whether the population follows a normal distribution, we verify that both np_0 = (400)(0.8) = 320 \geq 10 and n(1 - p_0) = (400)(0.2) = 80 \geq 10 are satisfied. Therefore, we can proceed with calculating the test statistic:

n <- 400
p_hat <- 0.84
p_0 <- 0.8

z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
z
[1] 2

The p-value is the probability of observing a test statistic as extreme as the one calculated from the collected sample, assuming that the null hypothesis is true. The p-value can be calculated as follows:

p_value <- 2 * pnorm(-abs(z))
p_value
[1] 0.04550026

If the p-value is less than \alpha, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

alpha <- 0.05
p_value < alpha
[1] TRUE

Therefore, we reject the null hypothesis.

Difference in Proportions

Confidence Interval

If the population follows a normal distribution or if the sample sizes n_1 and n_2 are sufficiently large so that n_1 \hat{p}_1 \geq 10 and n_1 (1 - \hat{p}_1) \geq 10 and that n_2 \hat{p}_2 \geq 10 and n_2 (1 - \hat{p}_2) \geq 10 for the sample proportions \hat{p}_1 and \hat{p}_2, then a confidence interval for the difference in population proportions p_1 - p_2 can be calculated as: (\hat{p}_1 - \hat{p}_2) \pm z^* \sqrt{\dfrac{\hat{p}_1 (1 - \hat{p}_1)}{n_1} + \dfrac{\hat{p}_2 (1 - \hat{p}_2)}{n_2}} where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha.

Example

Suppose we are given sample sizes of n_1 = 400 and n_2 = 600, sample proportions of \hat{p}_1 = 0.84 and \hat{p}_2 = 0.78, and a significance level of \alpha = 0.05 (or a 1 - \alpha = 0.95 confidence level).

Without knowing whether the populations follow a normal distribution, we verify that n_1 \hat{p}_1 = (400)(0.84) = 336 \geq 10, n_1 (1 - \hat{p}_1) = (400)(0.16) = 64 \geq 10, n_2 \hat{p}_2 = (600)(0.78) = 468 \geq 10, and n_2 (1 - \hat{p}_2) = (600)(0.22) = 132 \geq 10 are satisfied. Therefore, we can proceed with calculating the confidence interval:

n1 <- 400
n2 <- 600
p1_hat <- 0.84
p2_hat <- 0.78
alpha <- 0.05

1z_star <- qnorm(1 - (alpha / 2))
standard_error <- sqrt(
    ((p1_hat * (1 - p1_hat)) / n1) + ((p2_hat * (1 - p2_hat)) / n2)
)
2interval <- (p1_hat - p2_hat) + c(-1, 1) * z_star * standard_error

cat("(", round(interval[1], 4), ", ", round(interval[2], 4), ")", sep = "")
1
Divide alpha by 2 to determine the area for one tail, then subtract it from 1 to obtain the area up to the right tail. Then find the z-score that corresponds to this area using the qnorm function.
2
Multiplying the critical value by c(-1, 1) is just a lazy yet convenient way to calculate the lower and upper bounds of the confidence interval simultaneously.
(0.0111, 0.1089)

Chi-Square Goodness of Fit

Chi-Square Test for Association

Single Mean

Difference in Means

Paired Difference in Means

Analysis of Variance (ANOVA)

Correlation, Simple Regression

Back to top