Chapter 1 Basic Concept and Statistic (Section on Jan 27th)

Population and Sample

Definition 1.1 (Population) Population is defined as a complete collection of all elements that are of interest.
Definition 1.2 (Sample) Sample is defined as a sub-collection of members from part of the population. Normally collected using a random method.
Example 1.1 Suppose one researcher is interested in the weight of deers in our campus. She got the weights for 10 deers living in our campus. Then all deers live in our campus is the population for this experiment and the 10 deers forms the sample.

Which one is larger? Population or sample?

Answer: Population

Parameters and Statistics

Definition 1.3 (Parameter) Parameter is a numerical measurement describing some characteristic of the population.
Definition 1.4 (Statistic) Statistic is a numerical measurement describing some characteristic of sample that is a proxy of the parameter.
Example 1.2 Consider the deer example again. The researcher may interested in the mean of weight, spread of weight (usuanlly measured by standard deviation and variance) and proportion of deers with small weight, of the whole population. In statistics, we denote these four quantities as \(\mu\), \(\sigma\), \(\sigma^2\) and \(p\), respectively. These are all unknown but fixed numbers (from frequentists point of view). Our goal as a statistician is to ESTIMATE these quantities of interest using information we collect from samples. For example, in this example we may use sample mean, sample standard deviation, sample variance and sample proportion of deers with smaller weight. Usually these quantities are denoted as \(\bar{x},s,s^2\) and \(\hat{p}\). When we collected sample, these are all known quantities.

How does a parameter and statistic relate to each other?

Answer: A sample statistic is an estimation/approximation of the population parameter.

Variable Types

There are mainly two categories of variables, each of them can be devided into two sub-groups.

  • Qualitative: Name or a description

    • Nominal: categorical data only. Data can not be arranged in order. Example: County Names (i.e., Modesto, Santa Cruz, Monterey)

    • Ordinal: Data can be arranged in order. Differences are meaningless. Example: Agreement (i.e., strongly agree, agree, disagree, strongly disagree)

  • Quantitative: Counts or measurements

    • Interval: like ordinal but differences are meaningful. There is no natural zero starting point. Example: SAT Score (i.e., 120)

    • Ratio: Similar to interval level but can be seen as a fraction or mixed fraction. The zero starting point makes ratios meaningful. Example: Income by Dollars and cents (i.e., $20.23)

Central Tendency

There are three main measures of central tendency:

Of the three measures, which measure is effected by outliers?

Answer: Mean

Spread/Dispersion

There are five measures of spread/dispersion:

Of the five measures, which measures are effected by outliers?

Answer: Standard Deviation, Variance, Range, Coefficient of Variation

Exercise 1.1 The following data represent the number of pop-up advertisements received by 12 families during the past month.

43, 37, 30, 11, 41, 41, 23, 33, 63, 43, 16, 43

(a) What is the mean number of advertisement received during the past month? \[\begin{equation} \bar{x}=\frac{43+37+30+11+41+41+23+33+63+43+16+43}{12}\approx 35.33 \end{equation}\]

(b) What is the median number of advertisement received during the past month?

Firstly, rank the samples from smallest to largest.

##  [1] 11 16 23 30 33 37 41 41 43 43 43 63

Secondly, is the number of samples an odd number or an even number?

If the numebr of samples (n) is an odd number, median is the one with rank \(\frac{n+1}{2}\).

If it is an even number, median is the average of the one with rank \(\frac{n}{2}\) and \(\frac{n}{2}+1\). In this case, 12 is an even number, so get \(\frac{12}{2}=6\) and \(\frac{12}{2}+1=7\).

\[\begin{equation} \frac{37+41}{2}=39 \end{equation}\]

(c) What is the mode of the number of advertisement received during the past month?

## sample
## 11 16 23 30 33 37 41 43 63 
##  1  1  1  1  1  1  2  3  1

43 shows 3 times.

(d) Provide range of the sample.

\[\begin{equation} 63 − 11 = 52 \end{equation}\]

(e) What is the inter-quartile range of the sample?

\[\begin{equation} \frac{43+43}{2}-\frac{23+30}{2}=43−26.5=16.5 \end{equation}\]

(f) What is the variance of the sample?

By definition, the variance is

\[\begin{equation} \frac{(43-35.33)^2+(37-35.33)^2+\cdots+(43-35.33)^2}{11}\approx 196.4242 \end{equation}\]

(g) What is the coefficient of variation of the sample?

\[\begin{equation} \frac{s}{\bar{x}}\times 100\% = \frac{\sqrt{196.42}}{35.33}\approx 39.66\% \end{equation}\]

(h) Do you detect any outlier in the sample? Argue mathematically.

Visualization of Data

Method 1: For any sample point \(x_i\), compute \(Z_i=\frac{x_i-\bar{x}}{s}\), then check for samples with \(Z_i>2\) or \(Z_i<-2\).

TABLE 1.1:
X Z
43 0.55
37 0.12
30 -0.38
11 -1.74
41 0.40
41 0.40
23 -0.88
33 -0.17
63 1.97
43 0.55
16 -1.38
43 0.55

Method2: Compute LF =Q1−1.5×IQR, UF =Q3+1.5×IQR. Check for which samples \(x_i < LF\) or \(x_i > UF\).

In this exercise, \(LF=\frac{23+30}{2}-1.5*16.5=1.75\) and \(UF=\frac{43+43}{2}+1.5*16.5=67.75\).

TABLE 1.2:
LF sample UF
1.75 43 67.75
1.75 37 67.75
1.75 30 67.75
1.75 11 67.75
1.75 41 67.75
1.75 41 67.75
1.75 23 67.75
1.75 33 67.75
1.75 63 67.75
1.75 43 67.75
1.75 16 67.75
1.75 43 67.75

For both methods, no outlier!