STAT 142: Introduction to Computational Statistics
2nd Semester, A.Y. 2024-2025
Introduction
Definition 0.1 Computational statistics is defined as a collection of techniques that have a strong “focus on the exploitation of computing in the creation of new statistical methodology.” - Wegman (1988)
Efron and Tibshirani (1991) refer to what we call computational statistics as computer-intensive statistical methods. They give the following as examples for these types of techniques:
- bootstrap methods,
- nonparametric regression,
- generalized additive models, and
- classification and regression trees.
Gentle (2005) also follows the definition of Wegman (1988) where he states that computational statistics is a discipline that includes a class of statistical methods characterized by computational intensity…
Example 1
Recall the k-means clustering from your Stat 147. The assignment of clusters to a single observation is not a closed-form solution.
It is based on a computational algorithm.
Select the number of clusters k.
Pick random locations of the clusters as the centroids.
Select a data point
Find its nearest centroid (using some distance formula).
Repeat 3 and 4 for all data points.
Reassign centroids
Repeat steps 3 to 6 until centroids do not move anymore.
Example 2
Recall from your sampling design class. The variance of the estimator for the population proportion under SRSWOR is given by:
Var(ˆPSRSWOR)=PQnN−nN−1
If the sample size is n=100, the population size is N=1,000,000, and the population proportion is P=1−Q=0.5, what is the theoretical variance of ˆP assuming finite population?
Show answer
## [1] 0.002499752
Var(ˆPSRSWOR)=0.0024998
How did we come up with this?
By knowing theorems and deriving them! Recall hypergeometric distribution.
Computer Experiment
What if we don’t know the theory?
We can get a close approximation of the value using computer experiment and simulations!
How do we craft the computer experiment?
Step 1: Creating the population
Consider a population of size N with M individuals having a particular characteristic (successes).
There are N=1,000,000 individuals, 50% of those has our characteristic of interest. Let’s create a vector with 500,000 1s and 500,000 0s.
Step 2: Sampling from the population
Again, this is sampling from a finite population, so we use sampling without replacement of size n=100.
Step 3: Compute for the sample proportion
The sample proportion ˆp is just the mean of the sample vector, since this is a vector of 1s and 0s.
ˆp=1nn∑i=1xi
Step 4: Repeat steps 2 and 3 for a large number of times
We will do this to obtain B number of estimated proportions.
B <- 100
prop <- c()
for(i in 1:B){
samp <- sample(pop, size = n, replace = F)
prop[i] <- mean(samp)
}
## [1] 0.53 0.51 0.52 0.58 0.46 0.48 0.50 0.51 0.48 0.52 0.52 0.45 0.49 0.54 0.46
## [16] 0.55 0.54 0.50 0.39 0.45 0.38 0.41 0.59 0.49 0.48 0.51 0.38 0.56 0.46 0.50
## [31] 0.51 0.45 0.44 0.51 0.50 0.42 0.45 0.54 0.55 0.49 0.54 0.50 0.55 0.47 0.47
## [46] 0.45 0.48 0.49 0.47 0.54 0.62 0.49 0.50 0.52 0.51 0.47 0.52 0.61 0.50 0.52
## [61] 0.58 0.45 0.49 0.40 0.49 0.57 0.50 0.46 0.38 0.43 0.46 0.51 0.46 0.47 0.47
## [76] 0.55 0.57 0.53 0.48 0.61 0.46 0.46 0.58 0.47 0.44 0.37 0.45 0.49 0.62 0.49
## [91] 0.59 0.47 0.53 0.58 0.44 0.57 0.52 0.63 0.43 0.52
Step 5: Compute for the variance of the estimated proportions
Finally, we can now obtain an approximate for the variance of the sample proportion.
## [1] 0.003150091
Consolidation to a function
We now consolidate everything to a single function.
var_p <- function(n, N, P, B){
# setting the population
M <- N*P
pop <- c(rep(1, M), rep(0, N - M))
# setting empty vector for the p
prop <- c()
# sampling B times
set.seed(1)
for(b in 1:B){
samp <- sample(pop, n, replace = T)
prop[b] <- mean(samp)
}
# computation of the variance
return(var(prop))
}
Here are some results for different number of iterations B=10,100,1000
## [1] 0.001622222
## [1] 0.002609889
## [1] 0.002635851
## [1] 0.002499752
Visualization
We can also see how the simulated value gets close the the theoretical value as the number of iterations increase.
In the real world, the population proportion P is not known, and needs to be estimated using our sample. In Computational Statistics, we have what we call “Monte Carlo Simulation” where we create a hypothesized population based on our sample.
The following is an example function that implements Monte Carlo simulation to approximate the variance of the proportion estimator ˆP.
Why Computational Statistics?
The essence of statistics has not changed ever since.
Statistics is the science of learning from data and of measuring, controlling, and communicating uncertainty.
Historically, mathematics has been the only bedrock of statistics — we utilize mathematics to in order to develop statistical tools which we can use to learn new information from our data—in a way wherein uncertainty is measured, controlled, and communicated properly.
But as computers become more powerful (e.g., faster, cheaper, and more common), statisticians learned how to harness this increasing power to do statistics.
Of course, mathematics remains as our primary tool and language in studying uncertainty. And will probably remain as such in the many years to come, if not forever.
However, more powerful computing paved the way for us statisticians to be able to advance the way we do statistics in three general areas.