Chapter 3 Data Science

The Data Science approach may be summarized as:

  • Get
  • Clean
  • Organize
  • Model
  • Visualize

In this section, we will focus on each topic in order, before diving deeper into case studies in the next section, Business Analysis.

3.1 Overview



Included in this process it is important to remember Reality. Too often an Analyst suffers Paralysis of Analysis, forgetting the business is depending on the insights the analyst provides to make critical decisions. This is where the feedback loop provides the necessary check against reality to make sure the Analyst stays on track and schedule.

The Feedback Loop Andreas Schierwagen

3.2 Get

Points to remember:

  • Importing data into R is fairly simple.
  • For Stata and Systat, use the foreign package.
  • For SPSS and SAS the Hmisc package is recommended for ease and functionality.

Example of importing data follows.

mydata <- read.csv("mydata.csv")  # read csv file 
mydata <- read.table("mydata.txt")

mydata <- read.xlsx("mydata.xlsx ", sheet = 'sheet1', startRow = 1, colNames = TRUE, skipEmptyRows = TRUE)

Let’s read the first lines of Acme’s sales data.

df <- read.csv('', stringsAsFactors = FALSE)

Table 3.1: Sample Acme Sales
Date Customer_ID Customer Product Value Units
12/1/2017 68 Yosemite Sam Iron Bird Seed 304217.0 283
12/1/2017 68 Yosemite Sam Earthquake Pills 270966.0 299
12/1/2017 68 Yosemite Sam Hi-Speed Tonic 688150.0 609
12/1/2017 68 Yosemite Sam Dehydrated Boulder 334638.7 235
12/1/2017 68 Yosemite Sam Invisible Paint 292643.3 254
12/1/2017 68 Yosemite Sam Anvil 729439.0 579

From a comma delimited text file

  • First row contains variable names, comma is separator
  • Assign the variable id to row names
  • Note the / instead of  on mswindows systems
mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")

From Excel

To read and write data to Excel, we will use the openxlsx package. An example of extracting using the RODBC package is also provided, but openxlsx is preferred.

Openxlsx is described by the authors as:

“Simplifies the creation of Excel .xlsx files by providing a high level interface to writing, styling and editing worksheets. Through the use of ‘Rcpp’, read/write times are comparable to the ‘xlsx’ and ‘XLConnect’ packages with the added benefit of removing the dependency on Java.”

Read from an Excel file or Workbook object


Read data from an Excel file or Workbook object into a data.frame


read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE, rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE, namedRegion = NULL, na.strings = “NA”, fillMergedCells = FALSE)


xlsxFile = An xlsx file, Workbook object or URL to xlsx file.

sheet = The name or index of the sheet to read data from.

startRow = first row to begin looking for data. Empty rows at the top of a file are always skipped, regardless of the value of startRow.

colNames = If TRUE, the first row of data will be used as column names.

rowNames = If TRUE, first column of data will be used as row names.

detectDates = If TRUE, attempt to recognise dates and perform conversion.

skipEmptyRows = If TRUE, empty rows are skipped else empty rows after the first row containing data will return a row of NAs.

skipEmptyCols = If TRUE, empty columns are skipped.

rows = A numeric vector specifying which rows in the Excel file to read. If NULL, all rows are read.

cols = A numeric vector specifying which columns in the Excel file to read. If NULL, all columns are read.

check.names logical. = If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names.

namedRegion = A named region in the Workbook. If not NULL startRow, rows and cols parameters are ignored.

na.strings = A character vector of strings which are to be interpreted as NA. Blank cells will be returned as NA.

fillMergedCells = If TRUE, the value in a merged cell is given to all cells within the merge.


Formula written using writeFormula to a Workbook object will not get picked up by read.xlsx(). This is because only the formula is written and left to be evaluated when the file is opened in Excel. Opening, saving and closing the file with Excel will resolve this.


xlsxFile <- system.file("readTest.xlsx", package = "openxlsx")

df1 <- read.xlsx(xlsxFile = xlsxFile, sheet = 1, skipEmptyRows = FALSE)
sapply(df1, class)

df2 <- read.xlsx(xlsxFile = xlsxFile, sheet = 3, skipEmptyRows = TRUE)

df2$Date <- convertToDate(df2$Date)
sapply(df2, class)

df2 <- read.xlsx(xlsxFile = xlsxFile, sheet = 3, skipEmptyRows = TRUE, detectDates = TRUE)
sapply(df2, class)

wb <- loadWorkbook(system.file("readTest.xlsx", package = "openxlsx"))
df3 <- read.xlsx(wb, sheet = 2, skipEmptyRows = FALSE, colNames = TRUE)
df4 <- read.xlsx(xlsxFile, sheet = 2, skipEmptyRows = FALSE, colNames = TRUE)
all.equal(df3, df4)

wb <- loadWorkbook(system.file("readTest.xlsx", package = "openxlsx"))
df3 <- read.xlsx(wb, sheet = 2, skipEmptyRows = FALSE,
cols = c(1, 4), rows = c(1, 3, 4))

Read an Excel file using the RODBC package

channel <- odbcConnectExcel("c:/myexcel.xls")
mydata <- sqlFetch(channel, "mysheet")

Keyboard Input Usually you will obtain a dataframe by importing it from SAS, SPSS, Excel, Stata, a database, or an ASCII file. To create it interactively, you can do something like the following.

Creating a dataframe from scratch

age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220) 
mydata <- data.frame(age,gender,weight) 

You can also use R’s built in spreadsheet to enter the data interactively, as in the following example.

Enter data using editor

mydata <- data.frame(age=numeric(0),gender=character(0), weight=numeric(0))
mydata <- edit(mydata)
  • Note that without the assignment in the line above, the edits are not saved!

Exporting Data

There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata. you will need to load the foreign packages. For Excel, you will need the xlsReadWrite, Openxlsx, or RODBC package.

To A Tab Delimited Text File

write.table(mydata, "c:/mydata.txt", sep="\t") 

To an Excel Spreadsheet

write.xls(mydata, "c:/mydata.xls") 


write.foreign(mydata, "c:/mydata.txt", "c:/", package="SAS")

Viewing Data

There are a number of functions for listing the contents of an object or dataset.

####list objects in the working environment

####list the variables in mydata

####list the structure of mydata

####list levels of factor v1 in mydata

####dimensions of an object

There are a number of functions for listing the contents of an object or dataset.

####class of an object (numeric, matrix, dataframe, etc)

####display mydata 

####display first 10 rows of mydata

####display last 5 rows of mydata

Variable Labels

R’s ability to handle variable labels is somewhat unsatisfying. If you use the Hmisc package, you can take advantage of some labeling features.

label(mydata$myvar) <- "Variable label for variable myvar" 

Unfortunately the label is only in effect for functions provided by the Hmisc package, such as describe(). Your other option is to use the variable label as the variable name and then refer to the variable by position index.

names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the variable 

To understand value labels in R, you need to understand the data structure factor. You can use the factor function to create your own value labels. Variable v1 is coded 1, 2 or 3 and we want to attach value labels 1=red, 2=blue,3=green

mydata$v1 <- factor(mydata$v1,levels = c(1,2,3),labels = c("red", "blue", "green")) 

Variable y is coded 1, 3 or 5 and we want to attach value labels 1=Low, 3=Medium, 5=High

mydata$v1 <- ordered(mydata$y,levels = c(1,3, 5),labels = c("Low", "Medium", "High")) 

Use the factor() function for nominal data and the ordered() function for ordinal data. R statistical and graphic functions will then treat the data appropriately.

Note: Factor and ordered are used the same way, with the same arguments. The former creates factors and the later creates ordered factors.

Web Scraping


html = read_html("")
population = html_table(html_nodes(html, "table")[1], fill=TRUE)

Exchange Rates

####Retrieving Exchange Rates
eu_fx = head(Quandl("FRED/EXUSEU"),5)
mx_fx = head(Quandl("FRED/DEXMXUS"),5)


Diesel Prices


bls.content <- getURL("")
bls.json <- fromJSON(bls.content)
tmp <-bls.json$Results[[1]][[1]]
diesel <- data.frame(year=sapply(tmp$data,"[[","year"),
diesel <- head(diesel,5)

Global Trade Data from UN Comtrade


string <- ""
reporters <- fromJSON(file=string)
reporters <-$results,rbind)))

get.Comtrade <- function(url=""
  string<- paste(url
                 ,"max=",maxrec,"&" #maximum no. of records returned
                 ,"type=",type,"&" #type of trade (c=commodities)
                 ,"freq=",freq,"&" #frequency
                 ,"px=",px,"&" #classification
                 ,"ps=",ps,"&" #time period
                 ,"r=",r,"&" #reporting area
                 ,"p=",p,"&" #partner country
                 ,"rg=",rg,"&" #trade flow
                 ,"cc=",cc,"&" #classification code
                 ,"fmt=",fmt        #Format
                 ,sep = ""
  if(fmt == "csv") {<- read.csv(string,header=TRUE)
  } else {
    if(fmt == "json" ) {<- fromJSON(file=string)
      validation<- unlist($validation, recursive=TRUE)
      ndata<- NULL
      if(length(data)> 0) {
        var.names<- names(data[[1]])
        data<- sapply(data,rbind)))
        ndata<- NULL
        for(i in 1:ncol(data)){
          data[sapply(data[,i],is.null),i]<- NA
          ndata<- cbind(ndata, unlist(data[,i]))
        colnames(ndata)<- var.names
      return(list(validation=validation,data =ndata))

part 2: download specific data

start downloading:

q8 <- get.Comtrade(r='8',p='all',ps='2015',fmt='csv',cc='480255')

3.3 Clean

Missing Data

In R, missing values are represented by the symbol NA (not available) . Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

Unlike SAS, R uses the same symbol for character and numeric data.

Test for Missing Values: # returns TRUE of x is missing
y <- c(1,2,3,NA)
y <- # returns a vector (F F F T) 


Lets recode missing values as 99.

y[] <- 99
Table 3.2: Vector y Recoded

We already did this in a previous example, but a little review never hurts, so lets exclude missing values from our analyses and perform a function on data with missing values.

x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2

The function complete.cases() returns a logical vector indicating which cases are complete. Here, we combine x and y into a dataframe and list rows of data that have missing values.

The function na.omit() returns the object with listwise deletion of missing values. Here we create new dataset without missing data

df <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))

x y 1 1 0 2 2 10

Table 3.3: Dataset without missing values
x y
1 0
2 10

Advanced Handling of Missing Data

Most modeling functions in R offer options for dealing with missing values. You can go beyond pairwise of listwise deletion of missing values through methods such as multiple imputation. Good implementations that can be accessed through R include Amelia II, Mice, and mitools.

Date Values

Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates. Use as.Date( ) to convert strings to dates

mydates <- as.Date(c("2007-06-22", "2004-02-13"))

Number of days between 6/22/07 and 2/13/04

days <- mydates[1] - mydates[2]
Sys.Date() ####returns today's date.
Date() ####returns the current date and time. 
The following symbols can be used with the format( ) function to print dates.
Table 3.4: Date Formatting
Symbol Description Example
%d day as a number 01-31
%a abbreviated weekday Mon
%A unabbreviated weeekday Monday
%m month 00-12
%b abbreviated month Jan
%B unabbreviated month January
y 2-digit year 07
Y 4-digit year 2007

Print today’s date:

today <- Sys.Date()
format(today, format="%B%d%Y")

3.4 Organize

Once you have access to your data, you will want to massage it into useful form. This includes creating new variables (including recoding and renaming existing variables), sorting and merging datasets, aggregating data, reshaping data, and subsetting datasets (including selecting observations that meet criteria, randomly sampling observation, and dropping or keeping variables).

Each of these activities usually involve the use of R’s built-in operators (arithmetic and logical) and functions (numeric, character, and statistical). Additionally, you may need to use control structures (if-then, for, while, switch) in your programs and/or create your own functions. Finally you may need to convert variables or datasets from one type to another (e.g. numeric to character or matrix to dataframe).

Use the assignment operator <- to create new variables. A wide array of operators and functions are available here.

Three examples for doing the same computations

####Example 1:
mydata$sum <- mydata$x1 + mydata$x2
mydata$mean <- (mydata$x1 + mydata$x2)/2

####Example 2:
mydata$sum <- x1 + x2
mydata$mean <- (x1 + x2)/2

####Example 3:
mydata <- transform( mydata,sum = x1 + x2,mean = (x1 + x2)/2)

In order to recode data, you will probably use one or more of R’s control structures.

Create 2 age categories

mydata$agecat <- ifelse(mydata$age > 70, c("older"), c("younger"))

Another example: Create 3 age categories

mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"


You can rename variables programmatically or interactively.

#####Rename interactively 
fix(mydata) # results are saved on close 

####Rename programmatically 
mydata <- rename(mydata, c(oldname="newname"))

You can re-enter all the variable names in order Changing the ones you need to change. The limitation is that you need to enter all of them!

names(mydata) <- c("x1","age","y", "ses") 


To sort a dataframe in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.

Here are some examples using the sales_data dataset.

####Retrieve the sales_data data set
sdata <- read.csv('')
####Sort by Net_Sales_Unit
newdata <- sdata[order(sdata$Net_Sales_Unit),] 
Table 3.5: Sort by Net_Sales_Unit
Customer Net_Sales_Unit
Penelope Pussycat 705
Piggy 736
Inki 760
Playboy Penguin 779
Ralph Wolf 783
Quick Brown Fox and Rapid Rabbit 798

Sort by Net_Sales_Unit and Distribution_Unit

newdata <- sdata[order(sdata$Net_Sales_Unit, sdata$Distribution_Unit),]
Table 3.6: Sort by Net_Sales_Unit and Distribution_Unit
Customer Net_Sales_Unit Distribution_Unit
Penelope Pussycat 705 62
Piggy 736 61
Inki 760 78
Playboy Penguin 779 188
Ralph Wolf 783 85
Quick Brown Fox and Rapid Rabbit 798 56

Sort by Net_Sales_Unit (descending) and Distribution_Unit (ascending)

newdata <- sdata[order(-sdata$Net_Sales_Unit, sdata$Distribution_Unit),]
Table 3.7: Sort by Net_Sales_Unit (descending) and Distribution_Unit (ascending)
Customer Net_Sales_Unit Distribution_Unit
Buddy 1463 54
Beaky Buzzard 1459 57
Hubie and Bertie 1303 72
Marc Antony and Pussyfoot 1279 89
Colonel Shuffle 1263 171
Charlie Dog 1253 49


To merge two dataframes (datasets) horizontally, use the merge function. In most cases, you join two dataframes by one or more common key variables (i.e., an inner join).

####Merge two dataframes by ID
total <- merge(dataframeA,dataframeB,by="ID")

####Merge two dataframes by ID and Country
total <- merge(dataframeA,dataframeB,by=c("ID","Country")) 

To merge two dataframes (datasets) horizontally, use the merge function. In most cases, you join two dataframes by one or more common key variables (i.e., an inner join).

####Merge two dataframes by ID
total <- merge(dataframeA,dataframeB,by="ID")

####Merge two dataframes by ID and Country
total <- merge(dataframeA,dataframeB,by=c("ID","Country"))

Adding Rows To join two dataframes (datasets) vertically, use the rbind function. The two dataframes must have the same variables, but they do not have to be in the same order. total <- rbind(dataframeA, dataframeB)

If dataframeA has variables that dataframeB does not, then either: - Delete the extra variables in dataframeA or - Create the additional variables in dataframeB and set them to NA (missing) before joining them with rbind.


It is relatively easy to collapse data in R using one or more BY variables and a defined function. Using the sdata data set, let’s aggregate Units_Billed by Customer.

aggdata <- aggregate(sdata["Units_Billed"],by=sdata[c("Customer")], FUN=sum)

When using the aggregate() function, the by variables must be in a list (even if there is only one). The function can be built-in or user provided.

See also:

  • summarize() in the Hmisc package
  • summaryBy() in the doBy package

Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character.

  • Use to test for data type foo. Returns TRUE or FALSE
  • Use to explicitly convert it.
  • is.numeric(), is.character(), is.vector(), is.matrix(),, as.numeric(), as.character(), as.vector(), as.matrix(),

As a conglomerate that seems to produce everything imaginable, Acme is involved in significant trade globally. As Acme’s Business Analyst, you have been asked to analyze Acme’s Global Trade flows for Glue and Smoke Screen Bombs at the regional level. Unfortuntately, the data set provided is at the country level and a bit disorganized, so this will require some work on our part first.

####Import Data
trade <- read.csv('', stringsAsFactors = FALSE)

Table 3.8: Trade Data Structure
Variable Class First_Values
ID integer 1, 2, 3, 4, 5, 6
Partner.Country character Algeria, Algeria, Algeria, Angola, Angola, Angola
Reporting.Country character South Africa, Egypt, Egypt, South Africa, South Africa, South Africa
Product character Glue, Smoke Screen Bomb, Glue, Smoke Screen Bomb, Giant Rubber Band, Glue
X2002 double 0, 0, 0, 4.8, 0, 3.4
X2003 double 0, 0, 0, 75.6, 0, 0
X2014 double 0, 0, 0, 409.6, 3.2, 3.6
X2015 double 0, 0, 0, 108, 0, 0.8

That is one messy data set. As mentioned, the data only shows countries, but we need to aggregate to the regional level and the years are in columns with a X attached. Let’s begin cleaning this up by bringing in a region reference table.

####Import Data
region <- read.csv('', stringsAsFactors = FALSE)

Table 3.9: Region Reference
Variable Class First_Values
X integer 1, 2, 3, 4, 5, 6
Country character Afghanistan, Albania, Algeria, Andorra, Angola, Antigua and Barbuda
A_Country character Other.Far.East, Other.Eastern.Europe, Middle.East, Other.Western.Europe, Other.Africa, Other.Latin.America
A_Region character Far.East, Eastern.Europe, Middle.East, Western.Europe, Africa, Latin.America
B_Country character Afghanistan*, Albania, Algeria, Andorra, Angola, Antigua & Barbuda

The region table has many more fields than we require. We are going to merge the “Country” field in the region table with the “Partner.Country” and “Reporting.Country” fields in the trade data set, to return the “Region”. After that, we will deal with the years being in columns and the attached X.

####Rename Partner.Country in Trade dataset
names(trade)[names(trade)=="Partner.Country"] <- "Country"

trade1 <- (merge(region, trade, by = 'Country'))
names(trade1)[names(trade1)=="Region"] <- "Partner.Region"
names(trade1)[names(trade1)=="Country"] <- "Partner.Country"

####Repeat the preceding steps for Reporter.Country
names(trade1)[names(trade1)=="Reporting.Country"] <- "Country

Now, let’s remove columns not needed and use the gsub() function to remove the “X” from year.

####Remove columns not needed
vars <- names(trade1) %in% c("X", "A_Country.x", "A_Region.x", "B_Country.x", "X.x", 
                               "A_Country.y", "A_Region.y", "B_Country.y", "X.y", "Segment") 
trade1 <- trade1[!vars]

#####Aggregate the data, adding a year column
trade2 <- melt(trade1, id=c("Reporting.Country","Reporting.Region", "Partner.Country", "Partner.Region"))
names(trade2)[names(trade2)=="variable"] <- "Year"

####Convert Year to numeric
trade2$Year <- gsub("X", "", paste(trade2$Year))
trade2$Year <- as.numeric(trade2$Year)
trade2$value <- as.numeric(trade2$value)

####Sum by Reporting.Region, Partner.Region and Year
trade.agg <- aggregate(trade2["value"], by=trade2[c("Reporting.Region","Partner.Region", "Year")], FUN=sum, na.rm=TRUE)

Table 3.10: Sample Trade Data
Reporting.Region Partner.Region Year value
ASIA ASIA 2002 95252.8
CEB ASIA 2002 427.6
CIS ASIA 2002 3151.4
MEXICO ASIA 2002 0.0

3.5 Model

Linear Regression

Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.


The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:

\[\begin{equation} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{equation}\]

where the slope is:

\[\begin{equation} \hat{\beta}_1 = \frac{\sum(X_i - \bar{X}) (Y_i - \bar{Y})} {\sum(X_i - \bar{X})^2} \end{equation}\]

and the intercept is:

\[\begin{equation} \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \end{equation}\]

Collectively, they are called regression coefficients. \(\epsilon_i\) is the error term, the part of Y the regression model is unable to explain.

Example Problem For this analysis, we will use the sales_data dataset that is posted on Github RforBusiness. sales_data is a sample data from Acme. You will find that it consists of 68 observations(rows) and 12 variables (columns), which we will limit to 2 in our example, Units_Billed, and Cont_Margin_Unit. Lets view the first six observations here.

Table 3.11: Sample sales_data
Net_Sales_Unit Cont_Margin_Unit
1237.1175 688.0386
937.0911 501.6672
1459.1544 940.9065
1054.7908 632.4303
1053.0315 623.4967
890.7143 411.4174


Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair just like what we have here in Net_Sales_Unit and Cont_Margin_Unit. Correlation can take values between -1 to +1. If we observe for every instance where Net_sales_Unit increases, the Cont_Margin_Unit increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.

cor(sdata_full$Net_Sales_Unit, sdata_full$Cont_Margin_Unit)  # calculate correlation between Net_Sales_Unit and Cont_Margin_Unit 

[1] 0.8755682

scatter.smooth(x=sdata_full$Net_Sales_Unit, y=sdata_full$Cont_Margin_Unit, main="Cont_Margin ~ Net Sales") 

Build Linear Model

Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is lm(). The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class formula. But the most common convention is to write out the formula directly in place of the argument as written below.

linearMod <- lm(Cont_Margin_Unit ~ Net_Sales_Unit, data=sdata_full)  # build linear regression model on full data


Call: lm(formula = Cont_Margin_Unit ~ Net_Sales_Unit, data = sdata_full)

Coefficients: (Intercept) Net_Sales_Unit
-134.8339 0.7017

Now that we have built the linear model, we also have established the relationship between the predictor and response in the form of a mathematical formula for Cont_Margin_Unit as a function for Net_Sales_Unit. For the above output, you can notice the Coefficients part having two components: Intercept: -134.8339, Net_Sales_Unit: 0.7017 These are also called the beta coefficients. In other words,


\[=> Cont_Margin_Unit = -134.8339 + 0.7017*Net_Sales_Unit\]

Linear Regression Diagnostics

Now the linear model is built and we have a formula that we can use to predict the dist value if a corresponding speed is known. Is this enough to actually use this model? NO! Before using a regression model, you have to ensure that it is statistically significant. How do you ensure this? Lets begin by printing the summary statistics for linearMod.


Call: lm(formula = Cont_Margin_Unit ~ Net_Sales_Unit, data = sdata_full)

Residuals: Min 1Q Median 3Q Max -144.79 -34.27 14.96 34.02 150.39

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -134.83395 51.24853 -2.631 0.0106 *
Net_Sales_Unit 0.70168 0.04765 14.724 <2e-16 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 61.99 on 66 degrees of freedom Multiple R-squared: 0.7666, Adjusted R-squared: 0.7631 F-statistic: 216.8 on 1 and 66 DF, p-value: < 2.2e-16

The p Value: Checking for statistical significance

The summary statistics above tells us a number of things. One of them is the model p-Value (bottom last line) and the p-Value of individual predictor variables (extreme right column underCoefficients). The p-Values are very important because, We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05. This is visually interpreted by the significance stars at the end of the row. The more the stars beside the variable p-Value, the more significant the variable.

Null and alternate hypothesis

When there is a p-value, there is a hull and alternative hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).


We can interpret the t-value something like this. A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance. So, higher the t-value, the better.

Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis (the β coefficient is equal to zero or that there is no relationship) is true. So if the Pr(>|t|) is low, the coefficients are significant (significantly different from zero). If the Pr(>|t|) is high, the coefficients are not significant.

What this means to us? when p Value is less than significance level (< 0.05), we can safely reject the null hypothesis that the co-efficient \(\beta\) of the predictor is zero. In our case, linearMod, both these p-Values are well below the 0.05 threshold, so we can conclude our model is indeed statistically significant.

It is absolutely important for the model to be statistically significant before we can go ahead and use it to predict (or estimate) the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

How to calculate the t Statistic and p-Values? When the model co-efficients and standard error are known, the formula for calculating t Statistic and p-Value is as follows:

\[t-Statistic=\frac{\beta-coefficient}{Std. Error}\]

modelSummary <- summary(linearMod)  # capture model summary as an object
modelCoeffs <- modelSummary$coefficients  # model coefficients
beta.estimate <- modelCoeffs["Net_Sales_Unit", "Estimate"]  # get beta estimate for Net_Sales_Unit
std.error <- modelCoeffs["Net_Sales_Unit", "Std. Error"]  # get std.error for Net_Sales_Unit
t_value <- beta.estimate/std.error  # calc t statistic
p_value <- 2*pt(-abs(t_value), df=nrow(cars)-ncol(sdata_full))  # calc p Value
f_statistic <- linearMod$fstatistic[1]  # fstatistic
f <- summary(linearMod)$fstatistic  # parameters for model p-value calc
model_p <- pf(f[1], f[2], f[3], lower=FALSE)

R-Squared and Adj R-Squared

The actual information in a data is the total variation it contains, remember?. What R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.


where, SSE is the sum of squared errors given by \(\sum_{i}^n (y_i-\hat{y}_i)^2\) and \(SST=\sum_{i}^n (y_i-\bar{y}_i)^2\) is the sum of squared total. Here, yi^ is the fitted value for observation i and \(\hat{y}_i\) is the mean of Y.

We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.

Now thats about R-Squared. What about adjusted R-Squared? As you add more X variables to your model, the R-Squared value of the new bigger model will always be greater than that of the smaller subset. This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation that was already explained. It is here, the adjusted R-Squared value comes to help. Adj R-Squared penalizes total value for the number of terms (read predictors) in your model. Therefore when comparing nested models, it is a good practice to look at adj-R-squared value over R-squared.


where, MSE is the mean squared error given by \(MSE=\frac{SSE}{(n-q)}\) and \(MST=\frac{SST}{(n-1)}\) is the mean squared total, where n is the number of observations and q is the number of coefficients in the model.

Therefore, by moving around the numerators and denominators, the relationship between \(R^2\) and \(R^2adj\) becomes:


Standard Error and F-Statistic

Both standard errors and F-statistic are measures of goodness of fit.



where, n is the number of observations, q is the number of coefficients and MSR is the mean square regression, calculated as,

\[MSR=\frac{\sum_{i}^n (\hat{y_{i}} - \bar{y})}{q-1}=\frac{SST-SSE}{q-1}\] AIC and BIC

The Akaike’s information criterion - AIC (Akaike, 1974) and the Bayesian information criterion - BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.

The AIC is defined as:


where, k is the number of model parameters and the BIC is defined as:


where, n is the sample size.

For model comparison, the model with the lowest AIC and BIC score is preferred.

AIC(linearMod) # AIC => 758.2119 BIC(linearMod) # BIC => 764.8704

How to know if the model is best fit for your data?

The most common metrics to look at while selecting the model are:
Table 3.12: Basic Statistical Criterion
Statistic Criterion
R-Squared Higher the better (>0.70)
Adj. R-Squared Higher the better
F-Statistic Higher the better
Std. Error Closer to zero the better
t-statistic Should be greater than 1.96 for p-value to be less than 0.05
AIC Lower the better
BIC Lower the better
Mallows cp Should be close to the number of predictions in model
MAPE (Mean Absolute Percent Error Lower the better
MSE (Mean Squared Error Lower the better
Min_Max Accuracy => mean(min(actual, predicted)/max(actual,predicted)) Higher the better

Predicting Linear Models

So far we have seen how to build a linear regression model using the whole dataset. If we build it that way, there is no way to tell how the model will perform with new data. So the preferred practice is to split your dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the original dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this..

*Step 1: Create the training (development) and test (validation) data samples from original data.

*Step 2: Develop the model on the training data and use it to predict the distance on test data.

*Step 3: Review diagnostic measures.

Call: lm(formula = Cont_Margin_Unit ~ Net_Sales_Unit, data = trainingData)

Residuals: Min 1Q Median 3Q Max -138.910 -39.611 7.582 39.901 161.087

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -103.72031 62.78696 -1.652 0.105
Net_Sales_Unit 0.66761 0.05916 11.285 1.36e-15 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 64.41 on 52 degrees of freedom Multiple R-squared: 0.7101, Adjusted R-squared: 0.7045 F-statistic: 127.4 on 1 and 52 DF, p-value: 1.358e-15

AIC (lmMod)  # Calculate akaike information criterion

AIC = 607.0629

From the model summary, the model p value and predictor’s p value are less than the significance level, so we know we have a statistically significant model. Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.

*Step 4: Calculate prediction accuracy and error rates

A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and vice-versa.

actuals_preds <- data.frame(cbind(actuals=testData$Cont_Margin_Unit, predicteds=contPred))  # make actuals_predicteds dataframe.

correlation_accuracy <- cor(actuals_preds)  # 94.9%


actuals    predicteds
688.0386   722.1935
501.6672   521.8924
940.9065   870.4278
949.2637   872.9064
596.3764   611.2421
856.9224   732.9666

Now lets calculate the Min Max accuracy and MAPE: \[MinMaxAccuracy=mean\Bigg(\frac{min(actuals,predicteds)}{max(actuals,predicteds)}\Bigg)\] \[MeanAbsolutePercentageError (MAPE)=mean\Bigg(\frac{abs(predicteds-actuals)}{actuals}\Bigg)\]

k- Fold Cross validation

Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the model’s performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is built on a different subset of training data and predicted on the remaining data.

How to do this is? Split your data into k mutually exclusive random sample portions. Keeping each portion as test data, we build the model on the remaining (k-1 portion) data and calculate the mean squared error of the predictions. This is done for each of the k random sample portions. Then finally, the average of these mean squared errors (for k portions) is computed. We can use this metric to compare different linear models.

By doing this, we need to check two things:

  1. If the models prediction accuracy isn’t varying too much for any one particular sample, and if the lines of best fit don’t vary too much with respect the the slope and level.
  2. In other words, they should be parallel and as close to each other as possible. You can find a more detailed explanation for interpreting the cross validation charts when you learn about advanced linear model building.

cvResults <- suppressWarnings(CVlm(data=sdata_full, form.lm=Cont_Margin_Unit ~ Net_Sales_Unit, m=5, dots=FALSE, seed=10, legend.pos="topleft",  printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."));  # performs the CV

attr(cvResults, 'ms')  # => 3961.37 mean squared error

[1] 3961.37

3.6 Visualize

For visualizions, we will rely heavily on the ggplot2 package created by Hadley Wickham. The ggplot2 package offers a powerful graphics language for creating elegant and complex plots. Its popularity in the R community has exploded in recent years. Origianlly based on Leland Wilkinson’s The Grammar of Graphics, ggplot2 allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner. Grouping can be represented by color, symbol, size, and transparency. The creation of trellis plots (i.e., conditioning) is relatively simple.

Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful resources). There is a helper function called qplot() (for quick plot) that can hide much of this complexity when creating standard graphs.

qplot() The qplot() function can be used to create the most common graph types. While it does not expose ggplot’s full power, it can create a very wide range of useful plots. The format is:

qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)

where the options are:

Option <- c('')

Description <- c('')

plot_opts <- cbind(Option, Description)

knitr::kable(plot_opts, caption = 'ggplot Options')

ggplot2 cannot be used to create 3D graphs or mosaic plots. Use I(value) to indicate a specific value. For example size=z makes the size of the plotted points or lines proporational to the values of a variable z. In contrast, size=I(3) sets each point or line to three times the default size. Here are some examples using sales data (Net_Sales_Unit, Distribution_Unit, DVC_Unit, PKG_Loss_Unit, etc.) contained in the sdata data frame.

sdata <- read.csv('')


####Plot Net Sales Per Unit vs. Units Billed
qplot(Units_Billed, Net_Sales_Unit, data=sdata, main="Net Sales Per Unit vs. Units_Billed")

####PSelect Top 10 Customers based on Net_Sales_Unit and Plot Customer vs. Net_Sales_Unit
top_10 <- tail(sdata[order(-sdata$Net_Sales_Unit),],10) 
top_10$Net_Sales_Unit <- round(top_10$Net_Sales_Unit,0)

####POrder Net_Sales_Unit for plot
top_10$Customer <- factor(top_10$Customer, levels = top_10$Customer[order(top_10$Net_Sales_Unit)])

qplot(Net_Sales_Unit, Customer, data=top_10, main="Top 10 Customers by Net_Sales_Unit")

####PGrouped by Variance per Unit, where Variance < 4 and Units < 5000
sdata1 <- sdata[ which(sdata$Variance_Unit < 4
                       & sdata$Units_Billed < 5000), ]

####Pcreate factors with value labels 
sdata1$Variance_Unit <- factor(sdata1$Variance_Unit,levels=c(1,2,3,4,5,6,7,8,9,10),
    labels=c("1","2","3", "4","5","6","7","8","9","10"))

qplot(Units_Billed, data=sdata1, geom="density", fill=Variance_Unit, alpha=I(.5), main="Distribution of Variance", xlab="Units Billed", 

####PSeparate regressions of Units_Billed on Net_Sales_Unit where Variance < 4 and Units < 5000
qplot(Units_Billed, Net_Sales_Unit, data=sdata1, geom=c("point", "smooth"), 
   method="lm", formula=y~x, color=Variance_Unit, 
   main="Regression of Units and Net Sales Per unit", 
   xlab="Units", ylab="Net Sales")

Customizing ggplot2 Graphs

Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par( ) function. They can be modified using the theme() function, and by adding graphic parameters within the qplot() function. For greater control, use ggplot() and other functions provided by the package. Note that ggplot2 functions can be chained with “+” signs to generate the final plot.

Let’s use the trade data set reviewed previously and plot the values by year for the reporting regions.

p1_data <- aggregate(trade2["value"], by=trade2[c("Reporting.Region","Year")], FUN=sum, na.rm=TRUE)

ggplot(p1_data) + geom_line(aes(x=Year, y=value, colour=Reporting.Region))

3.7 Review