8  Confidence Interval

Before diving into the formulas and theory of Confidence Intervals (CI), this chapter presents a video designed to help you grasp the concept in a simple and visual way. The video offers a clear picture of why CI matter and how they operate in data analysis. By understanding the core idea first, readers will be better prepared for the more detailed explanations that follow.

Confidence Interval

The video above provides a visual and intuitive introduction to the fundamental ideas of inferential statistics—especially uncertainty, estimation, and significance in data analysis. This overview serves as a systematic guide to understand how each chapter connects before you explore them in detail. The framework and methods align with recent standard references such as [1], [2], and [3], which provide updated discussions on probability distributions, sampling theory, and CI construction and interpretation.

8.1 CI using z-Distribution

When estimating a population mean and the population standard deviation is known, or when the sample size is large (typically \(n \ge 30\)), we can use the normal (\(z\)) distribution to construct a Confidence Interval. The \(z\)-distribution (standard normal) has fixed variance, unlike the \(t\)-distribution whose variance depends on sample size.

Because of this, the \(z\)-distribution is appropriate when the variability of the population is already known or well-estimated from big data.

Confidence Interval for a population mean - σ known

8.1.1 Manual of z-distribution

The analytics team wants to measure the average number of daily clicks on a new application feature. It is known that the population standard deviation is already known, because the historical dataset is very large. From the initial test, 50 user samples tried the new feature, and the following summary was obtained:

Summary data:

  • Sample mean:\(\bar{x} = 23.8\ \text{clicks/day}\)
  • Population standard deviation (known):\(\sigma = 4.5\ \text{clicks}\)
  • Sample size:\(n = 50\)

We want to calculate the 95% Confidence Interval for the population mean of daily clicks.

Formula for CI using z-distribution

\[ CI = \bar{x} \pm z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right) \]

Sample size

\[ n = 50 \]

Because \(n \ge 30\), the z-distribution is valid even if the population distribution is not known.

Sample mean

\[ \begin{array}{rl} \bar{x} &= \frac{1}{n} \sum_{i=1}^{n} x_i \\[2mm] &= \frac{1}{50} (x_1 + x_2 + \cdots + x_n) \\[1mm] &= 23.8 \\[1mm] \end{array} \]

Population standard deviation (known)

\[ \sigma = 4.5 \]

This is the main requirement for using the z-distribution.

Critical value \(z\) for 95% CI

  • Significance level: \(\alpha = 0.05,\quad \alpha/2 = 0.025\)
  • Standard normal distribution table: \(z_{0.025} = 1.96\)

Standard Error (SE)

\[ \begin{array}{rl} SE & = \frac{\sigma}{\sqrt{n}} \\ & = \frac{4.5}{\sqrt{50}} \\ & = \frac{4.5}{7.071} \\ & \approx 0.637 \end{array} \]

Margin of Error (ME)

\[ \begin{array}{rl} ME & = z_{\alpha/2} \times SE \\ & = 1.96 \times 0.637 \\ & \approx 1.248 \\ \end{array} \]

Confidence Interval

\[ \begin{array}{rl} CI_{95\%} & = \bar{x} \pm ME \\ & = 23.8 \pm 1.248 \\ & \approx (22.552,\; 25.048) \\ \end{array} \]

Interpretation (Data Science)

With 95% confidence, the average number of daily clicks from users of the new feature is estimated to lie between:

\[ 22.552 \text{ and } 25.048 \text{ clicks per day} \]

The confidence interval from the z-distribution is narrower than that from the t-distribution because \(\sigma\) is known and does not need to be estimated from the sample.

8.1.2 R Code for z-distribution

# Load libraries
library(knitr)
library(kableExtra)
library(htmltools)

# Data input
xbar <- 23.8                 # sample mean
sigma <- 4.5                 # population standard deviation (known)
n <- 50                      # sample size
alpha <- 0.05                # significance level
z_crit <- qnorm(1 - alpha/2) # Critical z-value for 95% CI
SE <- sigma / sqrt(n)        # Standard Error (SE)
ME <- z_crit * SE            # Margin of Error (ME)
lower_CI <- xbar - ME        # LCI
upper_CI <- xbar + ME        # UCI

# Summary table with formulas (LaTeX)
summary_table <- data.frame(
  Parameter = c("Sample mean (x̄)", 
                "Population SD (σ)", 
                "Sample size (n)", 
                "z critical value", 
                "Standard Error (SE)", 
                "Margin of Error (ME)", 
                "Lower CI", 
                "Upper CI"),
  Value = c(xbar, sigma, n, round(z_crit,4), 
            round(SE,4), round(ME,4), round(lower_CI,3), round(upper_CI,3)),
  Formula = c(
    "$$\\bar{x} = \\frac{1}{n}\\sum_{i=1}^{n} x_i$$",
    "$$\\sigma$$",
    "$$n$$",
    "$$z_{1-\\alpha/2}$$",
    "$$SE = \\frac{\\sigma}{\\sqrt{n}}$$",
    "$$ME = z_{1-\\alpha/2} \\times SE$$",
    "$$\\bar{x} - ME$$",
    "$$\\bar{x} + ME$$"
  ),
  stringsAsFactors = FALSE
)

# Render tabel in Quarto HTML
kable(summary_table, escape = FALSE, booktabs = TRUE, align = "lcc") %>%
  kable_styling(full_width = FALSE)
Parameter Value Formula
Sample mean (x̄) | 23.8000 | $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$
Population SD (σ) 4.5000 $$\sigma$$
Sample size (n) 50.0000 $$n$$
z critical value 1.9600 $$z_{1-\alpha/2}$$
Standard Error (SE) 0.6364 $$SE = \frac{\sigma}{\sqrt{n}}$$
Margin of Error (ME) 1.2473 $$ME = z_{1-\alpha/2} \times SE$$
Lower CI 22.5530 $$\bar{x} - ME$$
Upper CI 25.0470 $$\bar{x} + ME$$

8.2 CI Using t-Distribution

When estimating a population mean from sample data, we often do not know the true population standard deviation. In these situations—especially when the sample size is small—the t-distribution provides a more accurate way to measure uncertainty than the normal (\(z\)) distribution.

The \(t\)-distribution has heavier tails, reflecting the extra variability that comes from estimating the standard deviation directly from the sample.

Confidence Interval for a population mean - t distribution

A Confidence Interval (CI) for the population mean using the \(t\)-distribution is:

\[ \bar{x} \,\pm\, t_{\alpha/2,\, df} \left( \frac{s}{\sqrt{n}} \right) \]

where:

  • \(\bar{x}\) = sample mean
  • \(s\) = sample standard deviation
  • \(n\) = sample size
  • \(df = n - 1\) = degrees of freedom
  • \(t_{\alpha/2,\, df}\) = critical \(t\)-value from the \(t\)-distribution

Using this formula, we create an interval that likely contains the true population mean, while realistically accounting for uncertainty due to limited data.

8.2.1 Manual of t-distribution

The product team launched a new recommendation feature and took a small sample of user interactions to measure engagement, specifically the time spent (in minutes) on the feature. We want to estimate the average time spent by all users on this feature with a 95% confidence level.

Sample data (minutes): \(7.2,\; 5.8,\; 6.5,\; 8.0,\; 6.9,\; 7.4,\; 5.5,\; 6.7,\; 7.1,\; 6.3\)

Sample size

\[ n = 10 \]

Sample mean

\[ \begin{array}{rl} \bar{x} &= \frac{1}{n} \sum_{i=1}^{n} x_i \\[2mm] &= \frac{1}{10} (x_1 + x_2 + \cdots + x_{10}) \\[1mm] &= \frac{1}{10} (7.2+5.8+6.5+8.0+6.9+7.4+5.5+6.7+7.1+6.3) \\[1mm] &= \frac{67.4}{10} \\[1mm] &= 6.74 \ \text{minutes} \end{array} \] #### Sample standard deviation {-}

\[ \begin{array}{rl} s & = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}} \\[2mm] & = \sqrt{\frac{(7.2-6.74)^2 + (5.8-6.74)^2 + \dots + (6.3-6.74)^2}{10-1}} \\[2mm] & = \sqrt{\frac{4.063}{9}} \\[1mm] & \approx 0.67\ \text{minutes} \end{array} \]

Degrees of freedom

\[ df = n - 1 = 9 \]

Critical \(t\) value for 95% CI

Significance level: \[ \alpha = 0.05,\quad \alpha/2 = 0.025 \]

From the t-table (or statistical function): \[ t_{0.025,\,9} \approx 2.262 \]

Standard Error (SE)

\[ \begin{array}{rl} SE & = \frac{s}{\sqrt{n}} \\ & = \frac{0.77}{\sqrt{10}} \\ & \approx 0.2436 \end{array} \]

Margin of Error (ME)

\[ \begin{array}{rl} ME & = t_{\alpha/2,\,df} \times SE \\ & = 2.262 \times 0.2436 \\ & \approx 0.551 \end{array} \]

Confidence Interval

\[ \begin{array}{rl} CI_{95\%} & = \bar{x} \pm ME \\ & = 6.74 \pm 0.551 \\ & \approx (6.203,\; 7.291) \end{array} \]

Interpretation (Data Science)

With 95% confidence, we estimate that the average time spent by users on the recommendation feature is between 6.203 and 7.291 minutes. Even with a small sample, this interval provides a realistic range for the population mean, accounting for uncertainty from estimating the standard deviation.

8.2.2 R Code t-distribution

library(knitr)
library(kableExtra)

# Data
data <- c(7.2,5.8,6.5,8.0,6.9,7.4,5.5,6.7,7.1,6.3)
n <- length(data)
xbar <- mean(data)
s <- sd(data)
df <- n - 1
alpha <- 0.05
t_crit <- qt(1 - alpha/2, df)
SE <- s / sqrt(n)
ME <- t_crit * SE
CI_lower <- xbar - ME
CI_upper <- xbar + ME

# Summary table with formulas
summary_table <- data.frame(
  Parameter = c("Sample size (n)", 
                "Sample mean (x̄)", 
                "Sample SD (s)", 
                "Degrees of freedom (df)", 
                "t critical value", 
                "Standard Error (SE)", 
                "Margin of Error (ME)", 
                "Lower CI", 
                "Upper CI"),
  Value = c(n, round(xbar,3), 
            round(s,3), df, round(t_crit,3),
            round(SE,3), round(ME,3), 
            round(CI_lower,3), round(CI_upper,3)),
  Formula = c(
    "$$n$$",
    "$$\\bar{x} = \\frac{1}{n}\\sum_{i=1}^{n} x_i$$",
    "$$s = \\sqrt{\\frac{\\sum_{i=1}^{n} (x_i - \\bar{x})^2}{n-1}}$$",
    "$$df = n-1$$",
    "$$t_{1-\\alpha/2, df}$$",
    "$$SE = \\frac{s}{\\sqrt{n}}$$",
    "$$ME = t_{1-\\alpha/2} \\times SE$$",
    "$$\\bar{x} - ME$$",
    "$$\\bar{x} + ME$$"
  ),
  stringsAsFactors = FALSE
)

# Render table
kable(summary_table, escape = FALSE, booktabs = TRUE, align = "lcc") %>%
  kable_styling(full_width = FALSE)
Parameter Value Formula
Sample size (n) 10.000 $$n$$
Sample mean (x̄) | 6.740 | $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$
Sample SD (s) 0.750 $$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$$
Degrees of freedom (df) 9.000 $$df = n-1$$
t critical value 2.262 $$t_{1-\alpha/2, df}$$
Standard Error (SE) 0.237 $$SE = \frac{s}{\sqrt{n}}$$
Margin of Error (ME) 0.537 $$ME = t_{1-\alpha/2} \times SE$$
Lower CI 6.203 $$\bar{x} - ME$$
Upper CI 7.277 $$\bar{x} + ME$$

8.3 Determining the Sample Size

Determining the sample size is a crucial step in designing experiments, surveys, and data analyses. The goal is to ensure that the sample is large enough to provide accurate and reliable estimates of population parameters such as the mean \(\mu\) or proportion \(p\). Sample size calculations typically depend on:

  • confidence level
  • acceptable margin of error
  • variability in the data (e.g., standard deviation)
  • whether the population is large or finite

When the population standard deviation \(\sigma\) is known, the minimum required sample size is:

\[ n = \left( \frac{z_{\alpha/2} \cdot \sigma}{E} \right)^2 \]

where:

  • \(z_{\alpha/2}\) = critical value from the standard normal distribution
  • \(\sigma\) = population standard deviation
  • \(E\) = desired margin of error

Calculating Sample Size (n) to Estimate Population Mean (for Confidence Intervals)

8.3.1 Manual of the Sample Size

A data analytics team wants to estimate the average page loading time in an application. From historical data, the population standard deviation is known to be:

\[ \sigma = 1.8\ \text{seconds} \]

The team wants a margin of error of:

\[ E = 0.3\ \text{seconds} \]

Confidence level:

\[ 95\%, \quad z_{0.025} = 1.96 \]

\[ n = \left( \frac{1.96 \times 1.8}{0.3} \right)^2 \]

\[ \begin{array}{rl} n & = \left( \frac{3.528}{0.3} \right)^2 \\ & = (11.76)^2 \\ & \approx 138.3 \end{array} \]

Sample size must be an integer:

\[ n = 139\ \text{observations} \]

8.3.2 R Code (Sample Size for a Mean)

sigma <- 1.8
E <- 0.3
z <- 1.96

n <- (z * sigma / E)^2
ceiling(n)
[1] 139

8.4 CI for a Proportion

Confidence Interval for a population proportion

8.4.1 Manual of CI Proportion

A data science team wants to estimate the proportion of users who clicked on a new call-to-action (CTA) button during an A/B test. From a sample of users, the team records how many actually clicked the button.

Sample Data

  • Total sample size: \(n = 240\),
  • Number of users who clicked: \(x = 78\)

Sample Proportion

The sample proportion \(\hat{p}\) is:

\[ \hat{p} = \frac{x}{n} = \frac{78}{240} = 0.325 \]

So about 32.5% of sampled users clicked the CTA. Compute a 95% confidence interval for the true population proportion \(p\).

Standard Error (SE)

\[ \begin{array}{rl} SE &= \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \\ & = \sqrt{\frac{0.325(1 - 0.325)}{240}} \\ & = \sqrt{\frac{0.325 \times 0.675}{240}} \\ & \approx 0.0294 \end{array} \]

Critical Value

For a 95% confidence level: \(z_{\alpha/2} = 1.96\)

Margin of Error (ME)

\[ ME = z_{\alpha/2} \times SE = 1.96 \times 0.0294 \approx 0.0577 \]

Confidence Interval

\[ \begin{array}{rl} CI_{95\%} & = \hat{p} \pm ME \\ & = 0.325 \pm 0.0577\\ &\approx (0.267,\; 0.383) \end{array} \]

Interpretation (Data Science)

With 95% confidence, the true proportion of all users who would click the CTA lies between 26.7% and 38.3%. This interval quantifies uncertainty and helps the team decide whether the CTA is performing strongly enough for deployment.

8.4.2 R Code for CI Proportion

library(knitr)
library(kableExtra)

# Data
n <- 240
x <- 78
p_hat <- x / n
z_crit <- 1.96
SE <- sqrt(p_hat * (1 - p_hat) / n)
ME <- z_crit * SE
CI_lower <- p_hat - ME
CI_upper <- p_hat + ME

# Summary table with formulas
summary_table <- data.frame(
  Parameter = c("Total sample size (n)",
                "Number of successes (x)",
                "Sample proportion (p̂)",
                "Standard Error (SE)",
                "Critical value (z_{α/2})",
                "Margin of Error (ME)",
                "Lower 95% CI",
                "Upper 95% CI"),
  Value = c(n, x, round(p_hat,3), round(SE,4), z_crit, round(ME,4), round(CI_lower,3), round(CI_upper,3)),
  Formula = c(
    "$$n$$",
    "$$x$$",
    "$$\\hat{p} = \\frac{x}{n}$$",
    "$$SE = \\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}}$$",
    "$$z_{1-\\alpha/2}$$",
    "$$ME = z_{1-\\alpha/2} \\times SE$$",
    "$$\\hat{p} - ME$$",
    "$$\\hat{p} + ME$$"
  ),
  stringsAsFactors = FALSE
)

# Render table
kable(summary_table, escape = FALSE, booktabs = TRUE, align = "lcc") %>%
  kable_styling(full_width = FALSE)
Parameter Value Formula
Total sample size (n) 240.0000 $$n$$
Number of successes (x) 78.0000 $$x$$
Sample proportion (p̂) | 0.3250 | $$\hat{p} = \frac{x}{n}$$
Standard Error (SE) 0.0302 $$SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Critical value (z_{α/2}) 1.9600 $$z_{1-\alpha/2}$$
Margin of Error (ME) 0.0593 $$ME = z_{1-\alpha/2} \times SE$$
Lower 95% CI 0.2660 $$\hat{p} - ME$$
Upper 95% CI 0.3840 $$\hat{p} + ME$$

8.5 One-Sided CI

One-Sided Confidence Intervals

8.5.1 Manual of One-Sided CI

The product team launched a new recommendation feature and sampled user interactions to measure engagement, specifically the proportion of users who clicked the CTA. We want to estimate the true population proportion with a 95% one-sided confidence interval.

Sample data:

  • Total sample size: \(n = 240\)
  • Number of users who clicked: \(x = 78\)

Sample proportion

\[ \hat{p} = \frac{x}{n} = \frac{78}{240} = 0.325 \]

Standard Error (SE)

\[ \begin{array}{rl} SE & = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \\[1mm] & = \sqrt{\frac{0.325 \times 0.675}{240}} \\[1mm] & \approx 0.0294 \end{array} \]

Critical value for 95% one-sided CI

Significance level: \[ \alpha = 0.05 \]

From z-table (one-sided):

\[ z_{1-\alpha} \approx 1.645 \]

Margin of Error (ME)

\[ \begin{array}{rl} ME & = z_{1-\alpha} \cdot SE \\[1mm] & = 1.645 \cdot 0.0294 \\[1mm] & \approx 0.0484 \end{array} \]

One-Sided Confidence Interval

Upper One-Sided CI:

\[ \begin{array}{rl} CI_{upper} & = \hat{p} + ME \\[1mm] & = 0.325 + 0.0484 \\[1mm] & \approx 0.373 \end{array} \]

Lower One-Sided CI:

\[ \begin{array}{rl} CI_{lower} & = \hat{p} - ME \\[1mm] & = 0.325 - 0.0484 \\[1mm] & \approx 0.277 \end{array} \]

Interpretation (Data Science)

  • Lower one-sided CI: With 95% confidence, at least 27.7% of users would click the CTA.
  • Upper one-sided CI: With 95% confidence, no more than 37.3% of users would click the CTA.

This interval quantifies uncertainty in the population proportion using one-sided estimation, which is useful for decision-making when we are only concerned with a minimum or maximum threshold.

8.5.2 R Code One-Sided CI

library(knitr)
library(kableExtra)

# Data
n <- 240       # sample size
x <- 78        # number of successes
p_hat <- x/n   # sample proportion
alpha <- 0.05
z_crit <- qnorm(1 - alpha)   # one-sided z critical
SE <- sqrt(p_hat * (1 - p_hat)/n)
ME <- z_crit * SE
CI_lower <- p_hat - ME
CI_upper <- p_hat + ME

# Summary table with formulas
summary_table <- data.frame(
  Parameter = c("Sample size (n)",
                "Number of successes (x)",
                "Sample proportion (p̂)",
                "Significance level (α)",
                "Critical value (z₁₋α)",
                "Standard Error (SE)",
                "Margin of Error (ME)",
                "Lower One-Sided CI",
                "Upper One-Sided CI"),
  Value = c(n, x, round(p_hat,3), alpha, round(z_crit,3), round(SE,4), round(ME,4), round(CI_lower,3), round(CI_upper,3)),
  Formula = c(
    "$$n$$",
    "$$x$$",
    "$$\\hat{p} = \\frac{x}{n}$$",
    "$$\\alpha$$",
    "$$z_{1-\\alpha}$$",
    "$$SE = \\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}}$$",
    "$$ME = z_{1-\\alpha} \\times SE$$",
    "$$CI_{lower} = \\hat{p} - ME$$",
    "$$CI_{upper} = \\hat{p} + ME$$"
  ),
  stringsAsFactors = FALSE
)

# Render table
kable(summary_table, escape = FALSE, booktabs = TRUE, align = "lcc") %>%
  kable_styling(full_width = FALSE)
Parameter Value Formula
Sample size (n) 240.0000 $$n$$
Number of successes (x) 78.0000 $$x$$
Sample proportion (p̂) | 0.3250 | $$\hat{p} = \frac{x}{n}$$
Significance level (α) 0.0500 $$\alpha$$
Critical value (z₁₋α) 1.6450 $$z_{1-\alpha}$$
Standard Error (SE) 0.0302 $$SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Margin of Error (ME) 0.0497 $$ME = z_{1-\alpha} \times SE$$
Lower One-Sided CI 0.2750 $$CI_{lower} = \hat{p} - ME$$
Upper One-Sided CI 0.3750 $$CI_{upper} = \hat{p} + ME$$

References

[1]
Hogg, R. V., Tanis, E. A., and Zimmerman, D. L., Probability and statistical inference, global edition, Pearson Education, London, 2024
[2]
Ghahramani, S., Fundamentals of probability, CRC Press, Boca Raton, 2024
[3]
Barron, E. N. and Del Greco, J. G., Probability and statistics for STEM: A course in one semester, Springer, Cham, 2024