3 Data Importing and Tidying

Before we begin, the following packages are loaded to help us with all stages of this project.

We also write this function which will later on help us with table formatting.

The data we’ll be using for this project will be harvested from the website Basketball-Reference.com. Basketball Reference is a website containing data on NBA players, scores, team statistics, historical data, and many other NBA-related information. It is a great resource for any kind of NBA statistical analysis, and is very popular among the NBA world, as many NBA analysts, writers, podcasters, and fans across America go to this site in the first place for information. David Leonhardt of the New York Times once praised Basketball-Reference’s database as “a gift straight from the basketball gods.”

For this particular project, this website is extremely useful, since it has pretty much every information about an NBA player, from his statistics and awards to his height, weight, nickname(s), or even his twitter username. We are interested in looking at looking at Russell Westbrook’s statistics, in particular his triple-doubles, during his last three seasons - 2015-16, 2016-17, and 2017-18 - with the Oklahoma City Thunder. To that end, the technique of web scraping is utilized to get Russ’s statistics over his last 3 seasons with the Thunder. We take advantage of scraping functions from the rvest package in R and write a function to grab the statistics table of each of Russ’ last 3 OKC campaigns.

This function takes in the ending year of the season, executes some data-harvesting tasks to read in the data from the webpage, and performs some table transformations, including renaming, creating new variables and selecting columns that are needed in further steps.

RussData <- function(Year){
  # get url
  RussURL <- paste("https://www.basketball-reference.com/players/w/westbru01/gamelog/", Year, "/", sep = "")
  # grab all content from the raw html file
  RussHTML <- read_html(RussURL)
  # parse html code into data tables, use index to get our desired table
  RussStats <- html_table(RussHTML, fill = TRUE)[[8]]
  
  # rename the columns
  names(RussStats)[6] <- "side"
  names(RussStats)[8] <- "result"
  RussStats <- filter(RussStats, GS == "1") # filter out the games he didn't play
  RussStats[RussStats == "CHO"] <- "CHA" # correct abbreviation for Charlotte
  
  # create a vector of teams from the Western Conference for further usage
  WestTeams <- c("LAL", "UTA", "LAC", "DEN", "DAL", "HOU", "OKC", "MEM",
                 "SAS", "PHO", "POR", "NOP", "SAC", "MIN", "GSW")
  
  # make sure all the stats are of numeric type
  for (i in c(11:ncol(RussStats))){
    RussStats[,i] <- as.numeric(RussStats[,i])
  }
  
  # transforming the table
  RussStats <- RussStats %>% 
    separate(MP, into = c("Min", "Sec")) %>%  # separate the minutes variable (min:sec)
    mutate(Season = paste(Year - 1, "-", Year, sep = ""), # create season variable
           Side = ifelse(side == "@", "Away", "Home"), # which side OKC is
           Result = ifelse(grepl("W", result), "Win", "Loss"), # game result for OKC
           OppConf = ifelse(Opp %in% WestTeams, "West", "East"), # opponent's conference
           Minutes = round(as.integer(Min) + as.integer(Sec)/60, 2), # Russ' playing time in minutes
           TripDbl = ifelse(PTS >= 10 & AST >= 10 & TRB >= 10, "Yes", "No")) %>%  # triple-double
    select(Season, Result, Side, Opp, OppConf, Minutes,
           FG, FGA, `FG%`, `3P`, `3PA`, `3P%`, FT, FTA, `FT%`,
           ORB, DRB, TRB, PTS, AST, TripDbl,
           STL, BLK, TOV, PF, `+/-`)
}

Now we want to combine all the data from every year into one big data table. To do this, we use a for loop, which allows us to iterate over the year range, then join the tables together. A full_join is used here to append the tables, after the full dataset is initialize by getting the data from the first season.

The table tidying and transforming tasks are complete, and here’s a quick glimpse of our data table:

Season Result Side Opp OppConf Minutes FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB PTS AST TripDbl STL BLK TOV PF +/-
2016-2017 Win Away PHI East 36.38 11 21 0.524 1 2 0.500 9 11 0.818 1 11 12 32 9 No 0 0 2 2 10
2016-2017 Win Home PHO West 45.32 17 44 0.386 2 10 0.200 15 20 0.750 3 10 13 51 10 Yes 2 0 5 3 7
2016-2017 Win Home LAL West 33.65 11 21 0.524 5 6 0.833 6 6 1.000 4 7 11 33 16 Yes 1 1 7 3 10


The attributes of this tables are:

  • Season: 2016-2017, 2017-2018, 2018-2019
  • Result: whether Russ’ team (OKC) recorded a win or a loss
  • Side: whether OKC is playing at home or on the road
  • Opp: opponent - the team that is playing against OKC
  • OppConf: the conference that the opponent is belong to. The NBA consists of 30 teams, divided into 2 conferences, the Eastern Conference and Western Conference
  • Minutes: Russ’ total game playing time, in minutes
  • FG: number of Field Goals Russ made in a game. In basketball, a “field goal” made is just a basket scored on any shot, excluding the free throws.
  • FGA: number of Field Goals attempted by Russ
  • FG%: Russ’ Field Goal Percentage in a game (FG/FGA)
  • 3P, 3PA, 3P%: number of 3-pointers made, attempted, and 3-point percentage in a game
  • FT, FTA, FT%: number of free throws made, attempted, and free throw percentage in a game
  • ORB, DRB, TRB: number of Offensive Rebounds, Defensive Rebounds, and Total Rebounds (ORB + DRB) in a game
  • PTS: number of Points Russ scored
  • AST: number of Assists Russ had
  • TripDbl: whether Russ recorded a triple-double
  • STL: number of Steals
  • BLK: number of Blocks
  • TOV: number of Turnovers (losing possession of the ball to the opposing team before the ball hits the rim)
  • PF: number of Personal Fouls
  • +/- (plus/minus): a measure of a player’s contribution to his team in a game. It is calculated by counting up points scored by the player’s team and points scored against the player’s team when that player is on the floor, then subtracting points against from points for. A positive +/- value for a player means his team outscored the opponent by that many points while he was on the court, whereas a negative +/- indicates that the opposing team outscored his team by 3 points while he was playing.

After finishing our data wrangling jobs, we are now ready to explore what’s interesting behind our data.