Practice 2 Summary Stastics with R
2.1 Directions
In this practice exercise, you will load data into R, and Summarize that data. You will use the following R functions: read.table, names, head, summary, attach, table, mean, median, var, sd, min, max, range, quantile, cor. Watch the VoiceThread below. Please add comments if you have questions. Once you have watched the VoiceThread read the instructions and use R to answer the questions below.
2.2 A closer look at the code
The data that will being used in this practice exercise can be found here: http://tiny.cc/econ226/data/LungCapData.txt. The following code can be used to load the data.
read.table(file="http://tiny.cc/econ226/data/LungCapData.txt", header = T, sep = "\t") LungCapData <-
The <-
symbol is used to store “into” a variable, so here we are storing the data into LungCapData.
It is always good practice to make sure the data loaded the way you expect. In R, this can be done with the commands names
and head
.
names(LungCapData)
## [1] "LungCap" "Age" "Height" "Smoke" "Gender" "Caesarean"
head(LungCapData)
## LungCap Age Height Smoke Gender Caesarean
## 1 6.475 6 62.1 no male no
## 2 10.125 18 74.7 yes female no
## 3 9.550 16 69.7 no female yes
## 4 11.125 14 71.0 no male no
## 5 4.800 5 56.9 no male no
## 6 6.225 11 58.7 no female no
names
prints the names of all variables in the data set. A data set like LungCapData is called a data frame in R, head
prints the first few rows of a data frame.We can use either attach
or $
to access an individual variable within a data frame. For example, we can summarize LungCap, a variable in LungCapData, by either
summary(LungCapData$LungCap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.507 6.150 8.000 7.863 9.800 14.675
or you could use the attach
command
attach(LungCapData)
summary(LungCap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.507 6.150 8.000 7.863 9.800 14.675
Once attach(LungCapData)
is run, we no longer need to tell R where to look for the variable LungCap
.
The table
command creates a counts of categorical data. For example, this data set contains a variable, Smoke, that takes the value “no” for non-smokers and “yes” for smokers. The command
table(Smoke)
## Smoke
## no yes
## 648 77
creates a frequency table of smokers, and
table(Smoke)/length(Smoke)
## Smoke
## no yes
## 0.8937931 0.1062069
creates a proportion table. Note length
returns the number of observations in Smoke.
A cross tabulation between two variables can be calculated by creating a two-way frequency table. A two-way table that cos tabulates the number of smokers by gender as follows
table(Smoke,Gender)
## Gender
## Smoke female male
## no 314 334
## yes 44 33
The rest of the command are straight forward to use. Here is a complete listing of code with comments used in the VoiceThread.
2.3 R code used in the VoiceThread
# Load data
read.table(file="http://tiny.cc/econ226/data/LungCapData.txt", header = T, sep = "\t")
LungCapData <-
# First look at the data
names(LungCapData) # print the names of all variables in the data frame
head(LungCapData) # print the first few rows of the data frame
# Summarize a variable
summary(LungCapData$LungCap) # access a variable within a data frame by using the $
attach(LungCapData) # or use the attach function to avoid retyping the name of the data frame
summary(LungCap)
# Frequency table
table(Smoke) # create a table with the number of smokers and non-smokers
table(LungCapData$Smoke) # this does the same thing without the attach command
# Proportion
table(Smoke)/length(Smoke) # divides the number of smokers by non smokers to give proportions
table(Smoke)/length(Smoke) * 100 # multiply by 100 to get a percent
# Two-way frequency table
table(Smoke,Gender)
# Mean
mean(LungCap) # Average of the numeric variable
mean(LungCap, trim =0.10) # trimmed mean, i.e. drops the larges 10% and smallest 10%
# Median
median(LungCap)
# Varience and Standard Deviation
var(LungCap)
sd(LungCap)
# Min, max and range
min(LungCap) # lowest lung capacity
max(LungCap) # greatest lung capacity
range(LungCap) # max lung capacity minus min lung capacity
# Percentile
quantile(LungCap, probs = 0.90)
quantile(LungCap, probs = c(0.20,0.50,0.90,1.00))
# Correlation
cor(LungCap,Age)
2.4 Now you try
Use R and the mtcars data set to answer the following questions (this is just for practice you do not need to turn anything in).
- How many cars in this sample have 6 cylinders? (cyl is the number of cylinders)
- What percentage of the cars in this sample are 4 cylinders?
- How many cars in the sample have an automatic Transmission and are V-shaped? ‘am’ is transmission type (0 = automatic, 1 = manual) and ‘vs’ is engine shape (0 = V-shaped, 1 = straight)
- What is the average miles per gallon (mpg) for all cars in the sample?
- What is the correlation between mpg and horsepower (hp)?
Example
data("mtcars")
attach(mtcars)
# Question 1: Store your answer in `a`
table(cyl)
a <-
a# Question 2: Store your answer in `b`
table(cyl)/length(cyl) * 100
b <-
b# Question 3: Store your answer in `c`
table(am,vs)
c <-
c# Question 4: Store your answer in `d`
mean(mpg)
d <-
d# Question 5: Store your answer in `e`
cor(mpg,hp)
e <- e