Introduction

Definition 0.1 Computational statistics is defined as a collection of techniques that have a strong “focus on the exploitation of computing in the creation of new statistical methodology.” - Wegman (1988)

Efron and Tibshirani (1991) refer to what we call computational statistics as computer-intensive statistical methods. They give the following as examples for these types of techniques:

  • bootstrap methods,
  • nonparametric regression,
  • generalized additive models, and
  • classification and regression trees.

Gentle (2005) also follows the definition of Wegman (1988) where he states that computational statistics is a discipline that includes a class of statistical methods characterized by computational intensity…

Example 1

Recall the k-means clustering from your Stat 147. The assignment of clusters to a single observation is not a closed-form solution.

It is based on a computational algorithm.

  1. Select the number of clusters k.

  2. Pick random locations of the clusters as the centroids.

  3. Select a data point

  4. Find its nearest centroid (using some distance formula).

  5. Repeat 3 and 4 for all data points.

  6. Reassign centroids

  7. Repeat steps 3 to 6 until centroids do not move anymore.

Example 2

Recall from your sampling design class. The variance of the estimator for the population proportion under SRSWOR is given by:

Var(ˆPSRSWOR)=PQnNnN1

If the sample size is n=100, the population size is N=1,000,000, and the population proportion is P=1Q=0.5, what is the theoretical variance of ˆP assuming finite population?

Show answer
P <- 0.5
Q <- 1-0.5
n <- 100
N <- 1000000

var_phat <- ((P*Q)/n)*(N-n)/(N-1)
var_phat
## [1] 0.002499752

Var(ˆPSRSWOR)=0.0024998

How did we come up with this?

By knowing theorems and deriving them! Recall hypergeometric distribution.

Computer Experiment

What if we don’t know the theory?

We can get a close approximation of the value using computer experiment and simulations!

How do we craft the computer experiment?


Step 1: Creating the population

Consider a population of size N with M individuals having a particular characteristic (successes).

There are N=1,000,000 individuals, 50% of those has our characteristic of interest. Let’s create a vector with 500,000 1s and 500,000 0s.

M <- N*P
pop <- c(rep(1,M), rep(0,N - M))
Step 2: Sampling from the population

Again, this is sampling from a finite population, so we use sampling without replacement of size n=100.

samp <- sample(pop, size = n, replace = F)
Step 3: Compute for the sample proportion

The sample proportion ˆp is just the mean of the sample vector, since this is a vector of 1s and 0s.

ˆp=1nni=1xi

p <- mean(samp)
Step 4: Repeat steps 2 and 3 for a large number of times

We will do this to obtain B number of estimated proportions.

B <- 100
prop <- c()
for(i in 1:B){
    samp <- sample(pop, size = n, replace = F)
    prop[i] <- mean(samp)
}
prop
##   [1] 0.53 0.51 0.52 0.58 0.46 0.48 0.50 0.51 0.48 0.52 0.52 0.45 0.49 0.54 0.46
##  [16] 0.55 0.54 0.50 0.39 0.45 0.38 0.41 0.59 0.49 0.48 0.51 0.38 0.56 0.46 0.50
##  [31] 0.51 0.45 0.44 0.51 0.50 0.42 0.45 0.54 0.55 0.49 0.54 0.50 0.55 0.47 0.47
##  [46] 0.45 0.48 0.49 0.47 0.54 0.62 0.49 0.50 0.52 0.51 0.47 0.52 0.61 0.50 0.52
##  [61] 0.58 0.45 0.49 0.40 0.49 0.57 0.50 0.46 0.38 0.43 0.46 0.51 0.46 0.47 0.47
##  [76] 0.55 0.57 0.53 0.48 0.61 0.46 0.46 0.58 0.47 0.44 0.37 0.45 0.49 0.62 0.49
##  [91] 0.59 0.47 0.53 0.58 0.44 0.57 0.52 0.63 0.43 0.52
Step 5: Compute for the variance of the estimated proportions


Finally, we can now obtain an approximate for the variance of the sample proportion.

var(prop)
## [1] 0.003150091
Consolidation to a function


We now consolidate everything to a single function.

var_p <- function(n, N, P, B){
    # setting the population 
    M   <- N*P
    pop <- c(rep(1, M), rep(0, N - M))
    # setting empty vector for the p
    prop <- c()
    # sampling B times
    set.seed(1)
    for(b in 1:B){
        samp <- sample(pop, n, replace = T)
        prop[b] <- mean(samp)
    }
    # computation of the variance 
    return(var(prop))
}

Here are some results for different number of iterations B=10,100,1000

var_p(100, 1000000, 0.5, 10)
var_p(100, 1000000, 0.5, 100)
var_p(100, 1000000, 0.5, 1000)
var_phat
## [1] 0.001622222
## [1] 0.002609889
## [1] 0.002635851
## [1] 0.002499752
Visualization

We can also see how the simulated value gets close the the theoretical value as the number of iterations increase.

In the real world, the population proportion P is not known, and needs to be estimated using our sample. In Computational Statistics, we have what we call “Monte Carlo Simulation” where we create a hypothesized population based on our sample.

The following is an example function that implements Monte Carlo simulation to approximate the variance of the proportion estimator ˆP.

var_mc <- function(n, N, p, iter){
    prop <- c()
    for(b in 1:iter){
        set.seed(b)
        # generate hypothesized population 
        pop  <- rbinom(N, 1, p)
        
        # sample from the population
        samp <- sample(pop, size = n, replace = T)
        
        # compute for the proportion
        prop[b] <- mean(samp)
    }
    return(var(prop))
}

Why Computational Statistics?

The essence of statistics has not changed ever since.

Statistics is the science of learning from data and of measuring, controlling, and communicating uncertainty.

Historically, mathematics has been the only bedrock of statistics — we utilize mathematics to in order to develop statistical tools which we can use to learn new information from our data—in a way wherein uncertainty is measured, controlled, and communicated properly.

But as computers become more powerful (e.g., faster, cheaper, and more common), statisticians learned how to harness this increasing power to do statistics.

Of course, mathematics remains as our primary tool and language in studying uncertainty. And will probably remain as such in the many years to come, if not forever.

However, more powerful computing paved the way for us statisticians to be able to advance the way we do statistics in three general areas.


References

Most of the contents in this book is adapted from Xavier Javines Bilon. Other references are the following: