Visual and intuitive explanation of statistical inference.
Author
Jiho Kim
Published
May 3, 2024
Single Proportion
Confidence Interval
If the population follows a normal distribution or if the sample size n is sufficiently large so that n \hat{p} \geq 10 and n (1 - \hat{p}) \geq 10 for the sample proportion \hat{p}, then a confidence interval for the population proportion p can be calculated as: \hat{p} \pm z^* \sqrt{\dfrac{\hat{p} (1 - \hat{p})}{n}} where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha.
Example
Suppose we are given a sample size of n = 400, a sample proportion of \hat{p} = 0.84, and a significance level of \alpha = 0.05 (or a 1 - \alpha = 0.95 confidence level).
Without knowing whether the population follows a normal distribution, we verify that both n \hat{p} = (400)(0.84) = 336 \geq 10 and n (1 - \hat{p}) = (400)(0.16) = 64 \geq 10 are satisfied. Therefore, we can proceed with calculating the confidence interval:
Divide alpha by 2 to determine the area for one tail, then subtract it from 1 to obtain the area up to the right tail. Then find the z-score that corresponds to this area using the qnorm function.
2
Multiplying the critical value by c(-1, 1) is just a lazy yet convenient way to calculate the lower and upper bounds of the confidence interval simultaneously.
(0.8041, 0.8759)
Determination of Sample Size
To determine the sample size n required to estimate the population proportion p with a desired margin of error ME and a significance level \alpha, we can use the following formula: n = \lceil\bigg(\dfrac{z^*}{ME}\bigg)^2 \tilde{p} (1 - \tilde{p})\rceil where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha and \tilde{p} is an estimate of the population proportion p (use \tilde{p} = 0.5 if no estimate is available).
Hypothesis Testing
To test the null hypothesis H_0: p = p_0 against the alternative hypothesis H_1: p \neq p_0 (or a one-tailed alternative), we use the following test statistic: z = \dfrac{\hat{p} - p_0}{\sqrt{\dfrac{p_0 (1 - p_0)}{n}}} where n is the sample size and \hat{p} is the sample proportion, provided that the population follows a normal distribution or that the sample size n is sufficiently large so that n p_0 \geq 10 and n (1 - p_0) \geq 10.
There are two approaches to hypothesis testing: critical value approach and p-value approach.
Critical Value Approach
Suppose we test the null hypothesis H_0: p = 0.8 against the alternative hypothesis H_1: p \neq 0.8 with a significance level of \alpha = 0.05. Then we can calculate the critical value as follows:
alpha <-0.05z_star <-qnorm(1- (alpha /2))z_star
[1] 1.959964
We collect a sample of size n = 400 with a sample proportion of \hat{p} = 0.84. Without knowing whether the population follows a normal distribution, we verify that both np_0 = (400)(0.8) = 320 \geq 10 and n(1 - p_0) = (400)(0.2) = 80 \geq 10 are satisfied. Therefore, we can proceed with calculating the test statistic:
If the absolute value of the test statistic |z| is greater than the critical value z^*, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.
abs(z) > z_star
[1] TRUE
Therefore, we reject the null hypothesis.
P-Value Approach
Suppose we test the null hypothesis H_0: p = 0.8 against the alternative hypothesis H_1: p \neq 0.8 with a significance level of \alpha = 0.05.
We collect a sample of size n = 400 with a sample proportion of \hat{p} = 0.84. Without knowing whether the population follows a normal distribution, we verify that both np_0 = (400)(0.8) = 320 \geq 10 and n(1 - p_0) = (400)(0.2) = 80 \geq 10 are satisfied. Therefore, we can proceed with calculating the test statistic:
The p-value is the probability of observing a test statistic as extreme as the one calculated from the collected sample, assuming that the null hypothesis is true. The p-value can be calculated as follows:
p_value <-2*pnorm(-abs(z))p_value
[1] 0.04550026
If the p-value is less than \alpha, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.
alpha <-0.05p_value < alpha
[1] TRUE
Therefore, we reject the null hypothesis.
Difference in Proportions
Confidence Interval
If the population follows a normal distribution or if the sample sizes n_1 and n_2 are sufficiently large so that n_1 \hat{p}_1 \geq 10 and n_1 (1 - \hat{p}_1) \geq 10 and that n_2 \hat{p}_2 \geq 10 and n_2 (1 - \hat{p}_2) \geq 10 for the sample proportions \hat{p}_1 and \hat{p}_2, then a confidence interval for the difference in population proportions p_1 - p_2 can be calculated as: (\hat{p}_1 - \hat{p}_2) \pm z^* \sqrt{\dfrac{\hat{p}_1 (1 - \hat{p}_1)}{n_1} + \dfrac{\hat{p}_2 (1 - \hat{p}_2)}{n_2}} where z^* is chosen such that P(-z^* < Z < z^*) = 1 - \alpha for a given significance level \alpha.
Example
Suppose we are given sample sizes of n_1 = 400 and n_2 = 600, sample proportions of \hat{p}_1 = 0.84 and \hat{p}_2 = 0.78, and a significance level of \alpha = 0.05 (or a 1 - \alpha = 0.95 confidence level).
Without knowing whether the populations follow a normal distribution, we verify that n_1 \hat{p}_1 = (400)(0.84) = 336 \geq 10, n_1 (1 - \hat{p}_1) = (400)(0.16) = 64 \geq 10, n_2 \hat{p}_2 = (600)(0.78) = 468 \geq 10, and n_2 (1 - \hat{p}_2) = (600)(0.22) = 132 \geq 10 are satisfied. Therefore, we can proceed with calculating the confidence interval:
Divide alpha by 2 to determine the area for one tail, then subtract it from 1 to obtain the area up to the right tail. Then find the z-score that corresponds to this area using the qnorm function.
2
Multiplying the critical value by c(-1, 1) is just a lazy yet convenient way to calculate the lower and upper bounds of the confidence interval simultaneously.