3 Data Importing and Tidying
The ultimate data file for this project will simply consist of match final scores of all Premier League games from its inaugural competition, the 1992-93 season, to the last fixture of the last completed 38-matchweek contest, 2018-19 season. The 5 main attributes of this dataset are Season, Home Team, Away Team and the number of goals scored by each team.
The following webpage https://www.football-data.co.uk/englandm.php contains all data of premier league matches from 1992 to 2019. Each season has its own data file; every file is in csv format, and they can be saved easily by clicking on the hyperlink associated with each one of them. We downloaded and named the data file for each season by the “starting year” of the season. For example, if the season is 2018-19, the data is saved as 2018.csv
. This will help us in a later step.
In addition, since we are interested in investigating the relationship between scoring time and the time between goals, data on these two topics for Manchester United, a Premier League club, are collected and stored in a file named muscoringtime.xlsx
. More details on this data file will be discussed once we get to the analysis of time between goals.
Before we begin, the following packages are loaded to help us with all stages of this project.
library(tidyverse)
library(kableExtra)
library(mosaic)
library(readxl)
library(surveillance)
library(knitcitations)
We also write this function which will later on help us with table formatting.
To load the match results data into R, it is a good idea for us to write an importing function. This function takes in the starting year of the season, reads in the data file from its file path, and performs some table transformations including creating a new column to represent the seasons, renaming and selecting variables that are needed in further steps.
epl.season <- function(year){
epl.path <- paste(year, ".csv", sep = "") # get file path
epl.Getdata <- read.csv(epl.path) # read in file
epl.Getdata <- epl.Getdata %>%
mutate(Season = paste(year , "-", year + 1, sep = "")) %>%
rename(Home.Goals = FTHG, Away.Goals = FTAG) %>%
select(Season, HomeTeam, AwayTeam, Home.Goals, Away.Goals)
}
Now we want to combine all the data from every year into one big data table. To do this, we use a for
loop, which allows us to iterate over the year range, then join the tables together. A full_join
is used here, which is a type of outer join in R. An outer join keeps observations that appear in at least one of the tables, and a full join keeps all observation in the tables.
epl.fulldata <- epl.season(1992) # initializes full dataset
for (year in 1993:2018) {
epl.fulldata <- epl.fulldata %>%
full_join(epl.season(year))
}
Here’s a quick glimpse at the last 4 rows of our “mega” data table.
Season | HomeTeam | AwayTeam | Home.Goals | Away.Goals | |
---|---|---|---|---|---|
10503 | 2018-2019 | Man United | Cardiff | 0 | 2 |
10504 | 2018-2019 | Southampton | Huddersfield | 1 | 1 |
10505 | 2018-2019 | Tottenham | Everton | 2 | 2 |
10506 | 2018-2019 | Watford | West Ham | 1 | 4 |
The data importing is complete; we can now proceed to the analysis stage of the project.