Statistical Inference in Nutshell

Statistics

Visual and intuitive explanation of statistical inference.

Author

Jiho Kim

Published

May 3, 2024

Single Proportion

Confidence Interval

If the population follows a normal distribution or if the sample size n is sufficiently large so that n \hat{p} \geq 10 and n (1 - \hat{p}) \geq 10 for the sample proportion \hat{p}, then a confidence interval for the population proportion p can be calculated as: \hat{p} \pm z^* \sqrt{\dfrac{\hat{p} (1 - \hat{p})}{n}} where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha.

Example

Suppose we are given a sample size of n = 400, a sample proportion of \hat{p} = 0.84, and a significance level of \alpha = 0.05 (or a 1 - \alpha = 0.95 confidence level).

Without knowing whether the population follows a normal distribution, we verify that both n \hat{p} = (400)(0.84) = 336 \geq 10 and n (1 - \hat{p}) = (400)(0.16) = 64 \geq 10 are satisfied. Therefore, we can proceed with calculating the confidence interval:

n <- 400
p_hat <- 0.84
alpha <- 0.05

1z_star <- qnorm(1 - (alpha / 2))
standard_error <- sqrt((p_hat * (1 - p_hat)) / n)
2interval <- p_hat + c(-1, 1) * z_star * standard_error

cat("(", round(interval[1], 4), ", ", round(interval[2], 4), ")", sep = "")

1: Divide alpha by 2 to determine the area for one tail, then subtract it from 1 to obtain the area up to the right tail. Then find the z-score that corresponds to this area using the qnorm function.
2: Multiplying the critical value by c(-1, 1) is just a lazy yet convenient way to calculate the lower and upper bounds of the confidence interval simultaneously.

(0.8041, 0.8759)

Determination of Sample Size

To determine the sample size n required to estimate the population proportion p with a desired margin of error ME and a significance level \alpha, we can use the following formula: n = \lceil\bigg(\dfrac{z^*}{ME}\bigg)^2 \tilde{p} (1 - \tilde{p})\rceil where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha and \tilde{p} is an estimate of the population proportion p (use \tilde{p} = 0.5 if no estimate is available).

Hypothesis Testing

To test the null hypothesis H_0: p = p_0 against the alternative hypothesis H_1: p \neq p_0 (or a one-tailed alternative), we use the following test statistic: z = \dfrac{\hat{p} - p_0}{\sqrt{\dfrac{p_0 (1 - p_0)}{n}}} where n is the sample size and \hat{p} is the sample proportion, provided that the population follows a normal distribution or that the sample size n is sufficiently large so that n p_0 \geq 10 and n (1 - p_0) \geq 10.

There are two approaches to hypothesis testing: critical value approach and p-value approach.

Critical Value Approach

Suppose we test the null hypothesis H_0: p = 0.8 against the alternative hypothesis H_1: p \neq 0.8 with a significance level of \alpha = 0.05. Then we can calculate the critical value as follows:

alpha <- 0.05
z_star <- qnorm(1 - (alpha / 2))
z_star

[1] 1.959964

We collect a sample of size n = 400 with a sample proportion of \hat{p} = 0.84. Without knowing whether the population follows a normal distribution, we verify that both np_0 = (400)(0.8) = 320 \geq 10 and n(1 - p_0) = (400)(0.2) = 80 \geq 10 are satisfied. Therefore, we can proceed with calculating the test statistic:

n <- 400
p_hat <- 0.84
p_0 <- 0.8

z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
z

[1] 2

If the absolute value of the test statistic |z| is greater than the critical value z^*, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

abs(z) > z_star

[1] TRUE

Therefore, we reject the null hypothesis.

P-Value Approach

Suppose we test the null hypothesis H_0: p = 0.8 against the alternative hypothesis H_1: p \neq 0.8 with a significance level of \alpha = 0.05.

n <- 400
p_hat <- 0.84
p_0 <- 0.8

z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
z

[1] 2

The p-value is the probability of observing a test statistic as extreme as the one calculated from the collected sample, assuming that the null hypothesis is true. The p-value can be calculated as follows:

p_value <- 2 * pnorm(-abs(z))
p_value

[1] 0.04550026

If the p-value is less than \alpha, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

alpha <- 0.05
p_value < alpha

[1] TRUE

Therefore, we reject the null hypothesis.

Difference in Proportions

Confidence Interval

If the population follows a normal distribution or if the sample sizes n_1 and n_2 are sufficiently large so that n_1 \hat{p}_1 \geq 10 and n_1 (1 - \hat{p}_1) \geq 10 and that n_2 \hat{p}_2 \geq 10 and n_2 (1 - \hat{p}_2) \geq 10 for the sample proportions \hat{p}_1 and \hat{p}_2, then a confidence interval for the difference in population proportions p_1 - p_2 can be calculated as: (\hat{p}_1 - \hat{p}_2) \pm z^* \sqrt{\dfrac{\hat{p}_1 (1 - \hat{p}_1)}{n_1} + \dfrac{\hat{p}_2 (1 - \hat{p}_2)}{n_2}} where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha.

Example

Suppose we are given sample sizes of n_1 = 400 and n_2 = 600, sample proportions of \hat{p}_1 = 0.84 and \hat{p}_2 = 0.78, and a significance level of \alpha = 0.05 (or a 1 - \alpha = 0.95 confidence level).

Without knowing whether the populations follow a normal distribution, we verify that n_1 \hat{p}_1 = (400)(0.84) = 336 \geq 10, n_1 (1 - \hat{p}_1) = (400)(0.16) = 64 \geq 10, n_2 \hat{p}_2 = (600)(0.78) = 468 \geq 10, and n_2 (1 - \hat{p}_2) = (600)(0.22) = 132 \geq 10 are satisfied. Therefore, we can proceed with calculating the confidence interval:

n1 <- 400
n2 <- 600
p1_hat <- 0.84
p2_hat <- 0.78
alpha <- 0.05

1z_star <- qnorm(1 - (alpha / 2))
standard_error <- sqrt(
    ((p1_hat * (1 - p1_hat)) / n1) + ((p2_hat * (1 - p2_hat)) / n2)
)
2interval <- (p1_hat - p2_hat) + c(-1, 1) * z_star * standard_error

cat("(", round(interval[1], 4), ", ", round(interval[2], 4), ")", sep = "")

1: Divide alpha by 2 to determine the area for one tail, then subtract it from 1 to obtain the area up to the right tail. Then find the z-score that corresponds to this area using the qnorm function.
2: Multiplying the critical value by c(-1, 1) is just a lazy yet convenient way to calculate the lower and upper bounds of the confidence interval simultaneously.

(0.0111, 0.1089)

Statistical Inference in Nutshell

Single Proportion

Confidence Interval

Example

Determination of Sample Size

Hypothesis Testing

Critical Value Approach

P-Value Approach

Difference in Proportions

Confidence Interval

Example

Chi-Square Goodness of Fit

Chi-Square Test for Association

Single Mean

Difference in Means

Paired Difference in Means

Analysis of Variance (ANOVA)

Correlation, Simple Regression