Practice 2 Summary Stastics with R

2.1 Directions

In this practice exercise, you will load data into R, and Summarize that data. You will use the following R functions: read.table, names, head, summary, attach, table, mean, median, var, sd, min, max, range, quantile, cor. Watch the VoiceThread below. Please add comments if you have questions. Once you have watched the VoiceThread read the instructions and use R to answer the questions below.

2.2 A closer look at the code

The data that will being used in this practice exercise can be found here: http://tiny.cc/econ226/data/LungCapData.txt. The following code can be used to load the data.

LungCapData <- read.table(file="http://tiny.cc/econ226/data/LungCapData.txt", header = T, sep = "\t")

The <- symbol is used to store “into” a variable, so here we are storing the data into LungCapData.

It is always good practice to make sure the data loaded the way you expect. In R, this can be done with the commands names and head.

names(LungCapData)

## [1] "LungCap"   "Age"       "Height"    "Smoke"     "Gender"    "Caesarean"

head(LungCapData)

##   LungCap Age Height Smoke Gender Caesarean
## 1   6.475   6   62.1    no   male        no
## 2  10.125  18   74.7   yes female        no
## 3   9.550  16   69.7    no female       yes
## 4  11.125  14   71.0    no   male        no
## 5   4.800   5   56.9    no   male        no
## 6   6.225  11   58.7    no female        no

names prints the names of all variables in the data set. A data set like LungCapData is called a data frame in R, head prints the first few rows of a data frame.We can use either attach or $ to access an individual variable within a data frame. For example, we can summarize LungCap, a variable in LungCapData, by either

summary(LungCapData$LungCap)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.507   6.150   8.000   7.863   9.800  14.675

or you could use the attach command

attach(LungCapData)
summary(LungCap)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.507   6.150   8.000   7.863   9.800  14.675

Once attach(LungCapData) is run, we no longer need to tell R where to look for the variable LungCap.

The table command creates a counts of categorical data. For example, this data set contains a variable, Smoke, that takes the value “no” for non-smokers and “yes” for smokers. The command

table(Smoke)

## Smoke
##  no yes 
## 648  77

creates a frequency table of smokers, and

table(Smoke)/length(Smoke)

## Smoke
##        no       yes 
## 0.8937931 0.1062069

creates a proportion table. Note length returns the number of observations in Smoke.

A cross tabulation between two variables can be calculated by creating a two-way frequency table. A two-way table that cos tabulates the number of smokers by gender as follows

table(Smoke,Gender)

##      Gender
## Smoke female male
##   no     314  334
##   yes     44   33

The rest of the command are straight forward to use. Here is a complete listing of code with comments used in the VoiceThread.

2.3 R code used in the VoiceThread

# Load data
LungCapData <- read.table(file="http://tiny.cc/econ226/data/LungCapData.txt", header = T, sep = "\t")

# First look at the data
names(LungCapData) # print the names of all variables in the data frame
head(LungCapData) # print the first few rows of the data frame

# Summarize a variable
summary(LungCapData$LungCap) # access a variable within a data frame by using the $
attach(LungCapData) # or use the attach function to avoid retyping the name of the data frame
summary(LungCap)

# Frequency table
table(Smoke) # create a table with the number of smokers and non-smokers
table(LungCapData$Smoke) # this does the same thing without the attach command

# Proportion
table(Smoke)/length(Smoke) # divides the number of smokers by non smokers to give proportions
table(Smoke)/length(Smoke) * 100 # multiply by 100 to get a percent

# Two-way frequency table
table(Smoke,Gender)

# Mean
mean(LungCap) # Average of the numeric variable
mean(LungCap, trim =0.10) # trimmed mean, i.e. drops the larges 10% and smallest 10%

# Median
median(LungCap)

# Varience and Standard Deviation
var(LungCap)
sd(LungCap)

# Min, max and range
min(LungCap) # lowest lung capacity
max(LungCap) # greatest lung capacity
range(LungCap) # max lung capacity minus min lung capacity

# Percentile
quantile(LungCap, probs = 0.90)
quantile(LungCap, probs = c(0.20,0.50,0.90,1.00))

# Correlation
cor(LungCap,Age)

2.4 Now you try

Use R and the mtcars data set to answer the following questions (this is just for practice you do not need to turn anything in).

How many cars in this sample have 6 cylinders? (cyl is the number of cylinders)
What percentage of the cars in this sample are 4 cylinders?
How many cars in the sample have an automatic Transmission and are V-shaped? ‘am’ is transmission type (0 = automatic, 1 = manual) and ‘vs’ is engine shape (0 = V-shaped, 1 = straight)
What is the average miles per gallon (mpg) for all cars in the sample?
What is the correlation between mpg and horsepower (hp)?

Example

data("mtcars")
attach(mtcars)

# Question 1: Store your answer in `a`
a <- table(cyl)
a
# Question 2: Store your answer in `b`
b <- table(cyl)/length(cyl) * 100
b
# Question 3: Store your answer in `c`
c <- table(am,vs)
c
# Question 4: Store your answer in `d`
d <- mean(mpg)
d
# Question 5: Store your answer in `e`
e <- cor(mpg,hp)
e

R Code Window

1
2
# The data has already been loaded and attached
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

1/0 0/0