Chapter 1 Basic Concept and Statistic (Section on Jan 27th)
Population and Sample
Which one is larger? Population or sample?
Answer: PopulationParameters and Statistics
How does a parameter and statistic relate to each other?
Answer: A sample statistic is an estimation/approximation of the population parameter.
Variable Types
There are mainly two categories of variables, each of them can be devided into two sub-groups.
Qualitative: Name or a description
Nominal: categorical data only. Data can not be arranged in order. Example: County Names (i.e., Modesto, Santa Cruz, Monterey)
Ordinal: Data can be arranged in order. Differences are meaningless. Example: Agreement (i.e., strongly agree, agree, disagree, strongly disagree)
Quantitative: Counts or measurements
Interval: like ordinal but differences are meaningful. There is no natural zero starting point. Example: SAT Score (i.e., 120)
Ratio: Similar to interval level but can be seen as a fraction or mixed fraction. The zero starting point makes ratios meaningful. Example: Income by Dollars and cents (i.e., $20.23)
Central Tendency
There are three main measures of central tendency:
Of the three measures, which measure is effected by outliers?
Answer: Mean
Spread/Dispersion
There are five measures of spread/dispersion:
Of the five measures, which measures are effected by outliers?
Answer: Standard Deviation, Variance, Range, Coefficient of Variation
Exercise 1.1 The following data represent the number of pop-up advertisements received by 12 families during the past month.
43, 37, 30, 11, 41, 41, 23, 33, 63, 43, 16, 43(a) What is the mean number of advertisement received during the past month? \[\begin{equation} \bar{x}=\frac{43+37+30+11+41+41+23+33+63+43+16+43}{12}\approx 35.33 \end{equation}\]
(b) What is the median number of advertisement received during the past month?
Firstly, rank the samples from smallest to largest.
## [1] 11 16 23 30 33 37 41 41 43 43 43 63
Secondly, is the number of samples an odd number or an even number?
If the numebr of samples (n) is an odd number, median is the one with rank \(\frac{n+1}{2}\).
If it is an even number, median is the average of the one with rank \(\frac{n}{2}\) and \(\frac{n}{2}+1\). In this case, 12 is an even number, so get \(\frac{12}{2}=6\) and \(\frac{12}{2}+1=7\).
\[\begin{equation} \frac{37+41}{2}=39 \end{equation}\]
(c) What is the mode of the number of advertisement received during the past month?
## sample
## 11 16 23 30 33 37 41 43 63
## 1 1 1 1 1 1 2 3 1
43 shows 3 times.
(d) Provide range of the sample.
\[\begin{equation} 63 − 11 = 52 \end{equation}\]
(e) What is the inter-quartile range of the sample?
\[\begin{equation} \frac{43+43}{2}-\frac{23+30}{2}=43−26.5=16.5 \end{equation}\]
(f) What is the variance of the sample?
By definition, the variance is
\[\begin{equation} \frac{(43-35.33)^2+(37-35.33)^2+\cdots+(43-35.33)^2}{11}\approx 196.4242 \end{equation}\]
(g) What is the coefficient of variation of the sample?
\[\begin{equation} \frac{s}{\bar{x}}\times 100\% = \frac{\sqrt{196.42}}{35.33}\approx 39.66\% \end{equation}\]
(h) Do you detect any outlier in the sample? Argue mathematically.
Visualization of Data
Method 1: For any sample point \(x_i\), compute \(Z_i=\frac{x_i-\bar{x}}{s}\), then check for samples with \(Z_i>2\) or \(Z_i<-2\).
X | Z |
---|---|
43 | 0.55 |
37 | 0.12 |
30 | -0.38 |
11 | -1.74 |
41 | 0.40 |
41 | 0.40 |
23 | -0.88 |
33 | -0.17 |
63 | 1.97 |
43 | 0.55 |
16 | -1.38 |
43 | 0.55 |
Method2: Compute LF =Q1−1.5×IQR, UF =Q3+1.5×IQR. Check for which samples \(x_i < LF\) or \(x_i > UF\).
In this exercise, \(LF=\frac{23+30}{2}-1.5*16.5=1.75\) and \(UF=\frac{43+43}{2}+1.5*16.5=67.75\).
LF | sample | UF |
---|---|---|
1.75 | 43 | 67.75 |
1.75 | 37 | 67.75 |
1.75 | 30 | 67.75 |
1.75 | 11 | 67.75 |
1.75 | 41 | 67.75 |
1.75 | 41 | 67.75 |
1.75 | 23 | 67.75 |
1.75 | 33 | 67.75 |
1.75 | 63 | 67.75 |
1.75 | 43 | 67.75 |
1.75 | 16 | 67.75 |
1.75 | 43 | 67.75 |
For both methods, no outlier!