Part 2 Introduction

This is an astoundingly short tutorial focused on generating some data and observing the properties of the James-Stein Estimator. Of course any theoretical property can be observed through data generation. This tutorial gives just one example to do so.

2.1 What Is Data Generation

What is data generation and why is it worth learning? Data generation is simulated data generated from your computer not from observations of or experiments on nature. Data generation is often also called simulation.

Data generation is useful for testing methods which in the past would have required moderate knowledge of analytical techniques.

2.1.1 Data Generation For Testing Methods

Data generation for the testing of methods has become an important skill in the age where new/complex estimators and algorithms are being introduced to the world and an rate never before seen. Data generation is carried out to see how well methods work well under many conditions. This used to be done with paper-and-pen. In many cases this is still possible. One can show how to find the confidence intervals of an odds ratio via the delta method. We can also use the valuable techniques taught in mathematical statistics to find confidence intervals of estimates more simple estimates. Also, if a wikipedia page exists on the method of interest, an analyst can read about its properties.

There are a few problems to the above techniques. The first is that the novice may lack the skill to find an analytical (paper-and-pen) solution. Second, the master analyst may find that it is hard to express something like a standard error of an estimate in an analytical manner. This is may be true of the Minimum Covariance Determinant (MCD), an estimate of correlation. Finally, when a new paper claims a methods superiority how can you know if it is true? What kind of data does it work on? Will your data “break” the estimator? In what important ways is your data different from the data the author tested his method on?

In the computer age of of statistical inference, data generation for the purpose of testing methods is like a litmus paper. It is my hope that the reader will have the skills to understand the assumptions of method through data generation break those assumptions by changing the data.

2.2 Who Is This For

This is for the skeptical data scientist. I think that David Robinson said it best

…no matter how much math and proofs there are that show a method is reliable, I really only feel comfortable with a method once I’ve worked with it simulated data. It’s also a great way to teach myself about the statistical method.

I believe that all data scientists want to be masters of their craft. No one wants to apply a method to their data only to find out later that their data was outside of the methods reach. This can be remedied with at least one simulation as a sanity check, if not further tested with replicated simulations.

2.3 Prerequisites

2.3.1 R

To download R, go to CRAN, the comprehensive R archive
network. CRAN use the cloud mirror,
https://cloud.r-project.org.

2.3.2 RStudio

Download and install RStudio, an IDE for R at
http://www.rstudio.com/download.

2.3.3 Tidyverse Package

You can install the complete tidyverse with:

install.packages("tidyverse")

then load with

library(tidyverse)

2.3.4 Notation

Converting formulas to code is sometimes a hurddle. Hopefully this note on notation will make it easy to convert formulas to something you can play with on your machine.

Random variables are capitalized and not bolded.

\[\begin{equation*} X \sim N(\mu,\Sigma) \end{equation*}\]

To indicate that an object is a scalar, I will use the notation \(\alpha \in \mathcal{R}\) To indicate that it is a vector of length \(n\), I will use \(\bf{\alpha} \in \mathcal{R}^n\), with lower-case and bolded letters. I will indicate that an object is a \(r\times c\) (row by column) matrix using \(\textbf{A}\in \mathcal{R}^{r\times c}\), with the \(\textbf{A}\) capitalized and bolded for matrices.

2.3.4.1 Notation Example

A random variable that draws from a normal distribution is coded as follows.

set.seed(42)
x <- rnorm(mean=0,sd = 1,n=3)
x

## [1]  1.3709584 -0.5646982  0.3631284

The random variable is distributed as

\[X\sim N(0,1)\]

where we have the following scalar observations.

\[ x_1=1.37,x_2=-0.56,x_3=0.36 \]

The vector of observations is length three.

\[ \bf{x}\in \mathcal{R}^{3} \]