3 Data Importing and Tidying

The ultimate data file for this project will simply consist of match final scores of all Premier League games from its inaugural competition, the 1992-93 season, to the last fixture of the last completed 38-matchweek contest, 2018-19 season. The 5 main attributes of this dataset are Season, Home Team, Away Team and the number of goals scored by each team.

The following webpage https://www.football-data.co.uk/englandm.php contains all data of premier league matches from 1992 to 2019. Each season has its own data file; every file is in csv format, and they can be saved easily by clicking on the hyperlink associated with each one of them. We downloaded and named the data file for each season by the “starting year” of the season. For example, if the season is 2018-19, the data is saved as 2018.csv. This will help us in a later step.

In addition, since we are interested in investigating the relationship between scoring time and the time between goals, data on these two topics for Manchester United, a Premier League club, are collected and stored in a file named muscoringtime.xlsx. More details on this data file will be discussed once we get to the analysis of time between goals.

Before we begin, the following packages are loaded to help us with all stages of this project.

We also write this function which will later on help us with table formatting.

To load the match results data into R, it is a good idea for us to write an importing function. This function takes in the starting year of the season, reads in the data file from its file path, and performs some table transformations including creating a new column to represent the seasons, renaming and selecting variables that are needed in further steps.

Now we want to combine all the data from every year into one big data table. To do this, we use a for loop, which allows us to iterate over the year range, then join the tables together. A full_join is used here, which is a type of outer join in R. An outer join keeps observations that appear in at least one of the tables, and a full join keeps all observation in the tables.

Here’s a quick glimpse at the last 4 rows of our “mega” data table.

Season HomeTeam AwayTeam Home.Goals Away.Goals
10503 2018-2019 Man United Cardiff 0 2
10504 2018-2019 Southampton Huddersfield 1 1
10505 2018-2019 Tottenham Everton 2 2
10506 2018-2019 Watford West Ham 1 4

The data importing is complete; we can now proceed to the analysis stage of the project.