Abstract

1. Introduction

2. Background and Set-Up

2.1 The Set-Up
2.2 Caveats

Time Period of Sample
Size of the Sample
Clauses in Contracts
Other Contract Factors

2.3 Data Gathering and Transformations

Player Ability
A Player’s Age
Contract Length
Status of a club

2.4 Other Variables Tested

Physical Stats
Being Premier League Proven
Being a Former Player
Being British/English
Club Wealth
Money Available to Spend

2.5 Unquantifiable Metrics

Club Staff
Interest in a Player
Players Forcing a Move
Commercial Purchases
Desperate Situations

3. Modelling

3.1 Model Basics
3.2 Final Model

Dummies
Age
Contract Length
Player Ability
Champions League since 2010

3.3 Model Interrogation

Transfer Fee
League Position
Playing Position
Age
Contract Length

3.4 OLS Assumptions and Tests
3.5 Model Conclusions

R Squared and Adjusted R-squared
Standard Error of Regression
Final Comments

3.6 Model Equation
3.6 January 2020 Transfer Window

Results

3.7 Jadon Sancho

4. Conclusions and Future Analysis

4.1 Ideas for Future Analysis

Using Different Leagues
Modelling by Position
Specific Playing Characteristics
Use Heatmaps for Positions
Percentage of Games Played
International Caps
Date of Signing
Age Variable
Player History

4.2 Conclusion

5. Appendix

5.1 Available Leagues
5.2 League Rankings
5.3 Jadon Sancho Detailed Example
5.4 Champions League Qualification
5.5 Players in the Sample
5.6 Players Excluded from the Sample

Abstract

This project investigates what factors affect a player’s transfer fee in football. Using a sample of 62 player transfers from the Premier League 2019/20 season, I’ve performed an Ordinary Least Squares Regression. My analysis shows there are four key metrics that affect a player’s transfer fee: Age, Ability, Contract Length and the club the player signs for. This model can now be used to appraise future transfers and evaluate their fee.

1. Introduction

You often see debates, usually on transfer deadline day, with pundits arguing over whether a player is worth ‘x’ amount or not. Each gives opposing views - one naming them the ‘bargain of the window’ while the other reckons ‘his best days are behind him’. While this is often good TV, you never seem to have a solution. You are usually left with a closing statement from the host exclaiming, ‘I guess we will have to wait and see if the player justifies his price tag’.

There are lots of clichés in football but hearing statements like these has always frustrated me. I’ve therefore decided to see if I can cut out the middle man (no more pundits telling me their opinions on transfer fees) and build a model myself that can accurately predict what a player’s transfer value should be and therefore, whether a player was worth the money or not, all by using statistics.

With a background in Econometrics, this project provided me with the interesting task of merging my love of football with my skills in modelling. The project not only contains my proposed solution but also a lot of discussion points brought about by my analysis. I will explain the variables in the model and how one can use my equations to evaluate a Premier League transfer. I have also tested the model against the 2020 Winter Transfer Window and have made some predictions regarding the potential transfer of Jadon Sancho.

2. Background and Set-Up

2.1 The Set-Up

The most time-consuming part of the project was data collection. As there was no magical dataset with all the information I would need, I had to compile my own. Due to the number of variables I wanted to test and the volume of individual player transfers, I decided to set my scope as the 2019 Summer Transfer Window and only for players signing to a Premier League Club. In 2019, a total of 139 players signed for a Premier League Club but after removing loan deals and free transfers, this number became 77. My final sample size, however, became 62 as I removed a further 15 players where reliable data was not available. For example, one of my key variables was player performance. The data I used to evaluate player performance was from ‘whoscored.com’, however, this source does not have coverage on every league. Therefore, players like Gabriel Martinelli, who had previously applied his trade in the Brazilian 3rd division for Ituano Futebol Clube, have been removed from this sample. A full list of available leagues from ‘whoscored.com’ are listed in Appendix in Section 5.1 Available Leagues.

2.2 Caveats

Before we get into the nitty gritty, there are some caveats that need mentioning.

Time Period of Sample

The sample is only based on one Transfer Window, therefore the model will not be able to account for any fluctuation in player transfer fees based on the year the transfer happened. Fluctuations in the transfer market happen YoY, and player values are often dictated by the year the transfer was made. The output from the model, however, only evaluates what a player would be worth arriving into the Premier League in the 2019 Summer Transfer Window. Without having any variable accommodating for the year of transfer, the model’s estimates are best suited to years close to the date of the sample. I.e. the model will be better at estimating a player’s value in 2020 than in 2010. At the time of writing it is difficult to say what the impact of COVID-19 will have on transfer fees; with lots of clubs losing revenue, player transfer fees may even decrease in the following years.

However, I aim to keep collecting new data and updating the model for future transfer windows. I therefore aim to have a measure of impact of the ‘year of transfer’ in future analysis.

Size of the Sample

Due to the sample only being based on the 2019 Transfer Window, my sample size of 62 is small and not ideal. Therefore, which variables and more importantly, the amount of variables I can add into the model is affected. I am limited in the number of variables I can add to my model, due to my model potentially over fitting my small sample. Without being cautious, I may end up building a model which perfectly explains the sample I have but when you use it to evaluate new players, it may give poor estimates. As with any modelling, the larger the sample size, the more confident you can be in your results. As I mentioned previously, I will continue to add in data from future transfers and therefore the accuracy of the results and the ability to add more explanatory variables will grow in future analysis.

Clauses in Contracts

I have done my best to investigate the transfer of each player in detail. However, I was unable to find a reliable source on whether the transfer was completed through triggering a clause in a player’s contract. That may be either a ‘Buy Out’ clause (common in La Liga), a ‘Release Clause’ or a ‘Buy Back’ clause. Therefore, without a reliable source, I have not included this as a factor in my modelling.

Other Contract Factors

The model will only account for the overall Transfer Fee of the transfer. The fee is only one aspect of a transfer - the model does not consider the wages the player has been given at his new club, any contract bonus’, agent fees, contract length and whether the fee is paid in instalments or in one payment. All these other factors may have some influence on the overall Transfer Fee but not all of these factors are available to the public.

2.3 Data Gathering and Transformations

The following two sections will focus on the variables which I tested in my model. This section will include all variables I found to have a statistical effect. Section 2.4 Other Varibles Tested contains the other variables I tested but where there was not enough statistical significance to be confident in its effect.

Variables include both individual player metrics and metrics in relation to the buying club. The modelling therefore tests not only factors about a player, but also whether the buying club the player signs for influences the overall transfer fee.

This section includes variables where a statistical effect was confirmed and the following variables are included in the final model. I go into detail about each variable, with my own ‘hypothesis’ behind why I originally included the variable. I’ve then labelled the source of the data I have used to test my hypothesis and then I have included a final section that details how I’ve structured and transformed the data to create variables that best explain its potential effect.

Player Ability

The hypothesis: Higher skilled players will have larger transfer fees.

The data: I’ve used player ratings from ‘whoscored.com’and league rankings from ‘globalfootballrankings.com’ which use club data from ‘fivethirtyeight.com’.

The source: https://www.whoscored.com, https://www.globalfootballrankings.com, https://fivethirtyeight.com

Preliminary Analysis:

Below, I have charted up the average ratings from ‘whoscored.com’ for player performance over the two previous seasons. On the same chart is the corresponding transfer fee that the player was bought for.

There is a slight positive correlation between the two variables. This implies that players with higher average ratings tend to have higher transfer fees. I did however notice a problem with this data. That the player ratings do not incorporate the standard of football in that league. On the chart below, Trezeguet is player number 22, he has the highest average rating of the sample with a 7.49. He was however, playing his football in the Turkish League which is known to be a lower standard of football than the Premier League. I therefore need a way of accounting for this.

Transformations and manipulation: I’ve taken the average of a player’s ratings across the previous season, previous 2 seasons and previous 3 seasons. My model found that the average of the previous 2 seasons to be the most significant. To me, this is logical as it gives a large enough sample of games to get a reliable estimate without judging a player’s ability on games from a long time ago.

However, I did stumble across a problem with using this data, the player ratings given by ‘whoscored.com’ are affected by various factors. The main two being: 1. The quality of the player and 2. The league in which the player plays his football. Therefore, I need a way of comparing players across different leagues (and hence different standards of football).

If we were only looking at a player’s rating at face value, you could argue that James Hanson of League 2 Grimsby Town (Season Average Rating 7.47, 2019/20) is a better player than Manchester City’s Sergio Aguero (Season Average Rating 7.20, 2019/20). Therefore, if we want a true reflection of a player’s ability, their ratings need to incorporate the standard of football from the league in which they play.

In order to account for this, I used data from ‘globalfootballrankings.com’, which contains a reliable ranking of the quality of each league. I’ve then upweighted each player’s ratings based on the standard of football of the league in which they play. This means that players playing in the Premier League (No.1 ranked league) have a higher upweighting than players in League 2. And therefore although James Hanson is having a good season, he’s not been playing better football than Sergio Aguero.

Visualisation:

Below is a snippet of the league rankings which use the Soccer Power Index (SPI). The league rankings in full and the detials and calculations behind them can be found in the Appendix, Section 5.2 League Rankings. The final column of the table below is the percentage upweight given to a player rating dependent on the league they play in.

League Name	Average SPI	Log of SPI	Percentage Upweight
Barclays Premier League	74.67	1.873146	40.10%
Spanish Primera Division	72.78	1.862012	39.30%
German Bundesliga	70.17	1.846151	38.10%
Italy Serie A	65.08	1.813448	35.70%
French Ligue 1	61.96	1.792111	34.10%

An example of how this is applied can be seen in the stacked chart below. The chart compares two players, André Gomes and Trezeguet. Both players are from my sample. The chart shows how even though André Gomes had a lower average rating of 6.59 compared to Trezeguets 7.49, his overall rating after accounting for the standard of league, is greater than Trezeguets, having a total of (9.23) compared to Treseguets (9.19).

The calculations behind the graph above are as follows:

André Gomes’ average player rating $\cdot$ Percentage upweight from English Premier League

=6.59 $\cdot$ 40.1% = 2.64

André Gomes’ Original Score + André Gomes’ Upweight

=6.59 + 2.64 = 9.23

Trezeguets’ average player rating $\cdot$ Percentage upweight from Turkish Super Lig

=7.49 $\cdot$ 22.8% = 1.70

Trezeguets’ Original Score + Trezeguets’ Upweight

=7.49 + 1.70 = 9.19

A Player’s Age

The hypothesis: The older a player, the lower the transfer fee as a player has less years left to keep performing at the highest level.

The data: Raw data on the ages of players at time of transfer was found at ‘transfermarkt.co.uk’.

The source: https://www.transfermarkt.co.uk

Preliminary Analysis:

Below, I have charted up the age of the player at the time of transfer with their corresponding transfer fee. The chart shows a negative correlation, that is, that the older the player, the lower the transfer fee. It also demonstrates that no player over the age of 29 was bought for more than £10M. Furthermore, a large proportion of the lower transfer fees were for players over the age of 27.

Transformations and manipulation: I tested two different variables which account for the age of a player. The first was taking the raw variable - the age of a player at the time of transfer, and applying simple transformations e.g. log, square root and squared.

The second variable focuses on breaking down player ages into ranges, for example: 17-20, 21-24, 25-28, 29+, to see if transfer fees differ by age ranges. From my analysis I didn’t seem to find any evidence of such. Many people in the sport believe players peak at certain ages, and cite this reason as a factor in explaining a transfer fee. Yet my model will account for a player being in his prime through the player ratings variable. My model therefore suggests that a transfer fee is not affected by a player being in his prime, but the player’s ability at that prime age.

Visualisation: The variable which best explained the effects of age in the model was age squared. In the graph below you can see the raw variable (Age) plotted against the modified variable (Age Squared). The modified variable being the most significant, indicates that the effects of age are not linear on transfer price. I.e. For each additional year that a player gets older, the greater the incremental impact on their transfer price.

It will come as no surprise that this variable has a negative relationship with transfer price, as age increases, transfer price decreases.

Contract Length

The hypothesis: If a player is signed up to a long contract, the transfer fee will be higher. This is because the longer the length of the contract, the more secure a player is at a club.

The data: The number of days left on a contract at the time of transfer is found on ‘transfermarkt.co.uk’.

The source: https://www.transfermarkt.co.uk

Preliminary Analysis:

In the chart below, we have the total number of days remaining on a player’s contract at the time of transfer, plotted against the corresponding transfer fee. The chart has a slight positive correlation; the more days remaining on a player’s contract, the higher the transfer fee. Remarkably, the five lowest transfer fees were all from players with less than a year left on their contract. While the five largest transfer fees were from players with over 2 ½ years remaining.

Transformations and manipulation: The manipulation here was similar to the manipulation for Player’s Age. I took the overall number of days left on a contract and transformed it with a log, square root and square.

Visualisation: The variable which best explained the effects of Contract Length in the model was Contract Length Logged. In the graph below you can see the raw variable (Contract Length) plotted against the modified variable (Contract Length Logged). What is important is the difference between the patterns of the two lines. The log variable being the most significant, implies that the effects of contract length are best accounted for when the variable is compressed. An incremental increase in contract length does not have the same size incremental effect on transfer fee.

Status of a club

The hypothesis: ‘Bigger’ clubs will have to pay more for players, simply because the selling club knows that the ‘bigger’ clubs have more disposable money and they can demand more money for their player. Conversely, the same variable tests the alternative hypothesis that ‘bigger’ clubs get better deals (a lower transfer fee). Players may try to force moves when they hear a ‘big’ club is interested, giving the selling club less power and giving the buying club the upper hand in negotiations.

The data: Classifying a club as big is quite difficult. I have gone about this in different ways. First, I used ‘myfootballfacts.com’ to look at the number of trophies Premier League teams have won. I looked at three different variables: overall trophies won, trophies won since 2000 and trophies won since 2010. I also investigated Champions League qualifications i.e. How many times Premier League clubs have qualified for the Champions League group stages since its inception in 1955, since 2000 and since 2010.

The source: http://www.myfootballfacts.com

Preliminary Analysis:

Below is a chart which plots transfer fee against the number of times the buying club of the player has qualified for the Champions League since 2010. The chart shows a positive correlation; the more times a club has qualified for the Champions League, the larger a player’s transfer fee. The chart also shows that every transfer fee over £40M was to a club that qualified for the Champions League at least once since 2010.

Transformations and manipulation: The variables were quite simple - the raw number of trophies, with Log, Square and Square Root transformations. The variable which worked best was the raw number of times a club has qualified for the Group Stages of the Champions League since 2010.

Visualisation:

The table below shows the seven teams who have qualified for the Champions League group stages since 2010. As it was the raw variable that was the most significant in my model, it implies that the more times a club has qualified for the Champions League, the more the club will have to pay for a player. This is logical, as selling clubs know which clubs will have larger revenues due to Champions League appearances and will set their price accordingly.

Club	Number of Champions League Group Stages Qualifications since 2010
Manchester City	9
Chelsea	8
Manchester United	7
Arsenal	7
Tottenham Hotspur	5
Liverpool	4
Leicester City	1

2.4 Other Variables Tested

In this section, I discuss the other variables which were tested in the model. Some variables showed indications of having an effect but not to the level of significance required (95%). Perhaps with a larger sample and further modelling, we may come to see these factors have a confirmed statistical effect on a transfer fee.

Physical Stats

I looked into the height, weight and strong foot of the player and tested hypotheses such as whether taller goalkeepers or defenders are more desirable, or whether players who are left footed are of more value. With all of the physical variables, there is no real evidence to suggest that either short, tall, left or right footed players would demand higher transfer fees than another.

Being Premier League Proven

These variables tested the hypothesis that by having played in the Premier League before, you’re less of a risk for the buying club and therefore the buying club are willing to spend more money on the player. I began by creating dummy variables with different threshold levels i.e. players who had played a minimum of 10, 20 and 50 games in the Premier League. These variables did show a positive effect on transfers i.e. if the player has played a minimum of ‘x’ games in the Premier League during their career, their transfer fee will be ‘y’ percentage higher. However, the t-statistics on these variables fell just short of the confidence level desired. These variables would be ones I would like to test again with a larger sample.

Being a Former Player

These variables were dummy variables for whether a player had previously played for the club before. The hypothesis being that if you have already played for the buying club, the club will know a player’s game in more detail. Having seen the player perform in not only matches but also in training, day in day out, you are more informed about the player you are signing. This variable also encapsulates players who have been loaned to a club and then bought after their loan period i.e. Mateo Kovacic and Raul Jimenez in our modelling sample. However, similarly to the variable of being Premier League proven, the t-statistics fell short of the desired level of confidence.

Being British/English

One variable I was intrigued to investigate was whether being British or English translates to having a higher transfer fee. The hypothesis here was that since 2010, quotas were put in place by the FA meaning a minimum of 8 out of a squad of 25 must be homegrown, hence making homegrown players more valuable. I tested dummy variables of both being British and more specifically English, and surprisingly there was little evidence to suggest that it had any effect on a player’s value. My theory here is that it may have had an effect in the years post 2010, when the rule change was then introduced but now, nearly 10 years later, clubs have adapted and are more prepared.

Club Wealth

This variable was difficult to make as nearly all the sources available were incomplete. Both Forbes and Deloitte have annual rich lists for clubs, with both having a rank of the top 20 richest clubs in the world. This would have been a reliable dataset to make a variable from but both rich lists do not have extensive lists for all clubs in the Premier League, with usually only the top 5/6 clubs featured. Therefore, I had to think of a proxy variable which would reflect Club Wealth.

The first proxy variable I created looked at the average wages of the current players at the buying club. With the idea that ‘the more the club pays it’s players, the wealthier the club’. I used data from ‘globalsportssalaries.com’ which provided a ranking and an average yearly wage figure for all Premier League clubs.

The second proxy variable I created was by using prize money. There was reliable data available from ‘planetfootball.com’ for the amount of prize money won in the previous Premier League season. This prize money included positional prize money and TV revenue. Liverpool came out top overall, due to larger revenue from TV, even though it was Manchester City who won the league and had the largest amount from positional prize money. The problem here is that I did not have a corresponding dataset for clubs which had been promoted from the Championship. I therefore tested a range of different values of prize money for the promoted clubs.

Neither proxy variable showed any statistical effects as the alternative club variable, ‘Champions League qualification since 2010’ was already explaining the majority of the club variation.

Money Available to Spend

The idea for this variable came from Andy Carroll. When Liverpool bought Andy Carroll in January 2011 for £36.9M it was believed to be overpriced but the transfer fee made sense (this was because they had received £52.7M for Fernando Torres earlier that day from Chelsea).

The variable I tried to create focused on income minus expenditure. Looking at whether after clubs have sold players for big money, do they then have to pay a premium because selling clubs know that the buying club has just received a lot of money? I calculated how much money each individual Premier League club had in income and expenditure over the different periods of time seen below:

The current transfer window, Summer 2019.
The current transfer window, Summer 2019 + Winter 2019 TW
The current transfer window, Summer 2019 + Winter 2019 TW + Summer 2018 TW
The current transfer window, Summer 2019 + Winter 2019 TW + Summer 2018 TW + Winter 2018

Each player was then assigned a value (income minus expenditure) for each period (1,2,3 and 4) to create four different variables based on the club they signed for. I then tested these variables along with variable transformations. But unfortunately, I did not see any significant effects from this variable at explaining variation among our sample.

Perhaps this variable does more to explain the amount of money a club may spend in a transfer window rather than the any variation in an individual player’s transfer fee.

2.5 Unquantifiable Metrics

In Section 4.1 Ideas for Future Analysis I discuss variables which I would like to test in future analysis with a larger sample of data. In this section I want to mention some factors which will most likely affect a player’s transfer fee but one which I was unable to quantify.

Club Staff

One factor which my model does not and will always struggle to quantify is the quality of club staff. Firstly, club scouts. Our player ability metrics are based on an algorithm which analyses over 200 metrics from a game to give an overall rating for each match. However, this metric does not include other qualities which scouts will see e.g. mental qualities such as leadership, off-the-ball abilities such as creating space for others etc. Therefore, club scouts will have a better understanding of a player’s true ability and therefore their estimations on player quality will be different to the data I have access to at ‘whoscored.com’.

Another unquantifiable factor are club negotiators. Some clubs are very savvy and shrewd in the transfer market and will only buy players if they know they are getting a good deal. Tottenham’s Daniel Levy is known to be a tough negotiator who does not like being ripped off so perhaps we may see variation between what different clubs would pay for a player and how much each club believes they are worth. Do some clubs get a better deal than others due to tough negotiators? It is something that would be of much interest but sadly unable to quantify to be able to find out. With a much larger sample you could potentially test this by having a dummy variable for each club.

Interest in a Player

The amount of interest in a player is another variable which could drive up a transfer price but again, this data is not readily available. The more clubs that are interested in a player, the more chance of a bidding war between buying clubs and paying more than a player is worth. It is also worth mentioning here the ‘ability’ of the player’s agent. It has been reported that some agents attempt to get many clubs interested in their player to then manufacture a bidding war.

Players Forcing a Move

Another tricky metric to evaluate is on what terms the player left the club. Players can put in transfer requests and often conduct themselves in different ways after doing so. This behaviour I suspect can have an impact on the player’s transfer fee. Players have a lot of power to ‘turn’ a dressing room and often clubs are forced to sell a player as keeping them has become untenable. This then puts the buying club in a position of power, perhaps being able to acquire the player cheaper than under normal circumstances i.e. if the player were to not have handed in a transfer request or if they had behaved in a better manner during their time at the club.

Some of this effect may be being picked up by the contract length variable, as players are in a position of power to force a move when there is limited time available on their contract but how a player conducts themselves while under contract is a different matter. A recent example of a player being sold for below his value is Christian Eriksen whilst he was Tottenham Hotspur. Eriksen had told the club he wanted to leave in 2019 and it was reported that Daniel Levy then valued him at over £130M. However, he left for Inter Milan in the January 2020 Transfer Window for only £18M. Of course, Eriksen’s contract was expiring in the summer of 2020 but if Eriksen had not publicly said he wanted to leave, Tottenham may have got a better deal.

Commercial Purchases

The commercial reasons behind a purchase could also affect a player’s transfer fee. The reason a football club buys a ‘commercial’ player might not be to necessarily play them. With many clubs competing for international fans, clubs may buy players from countries with ‘untapped’ fan bases such as Japan, USA or China in order to promote the club’s brand in that country and build ties with fanbases. The benefits of such a purchase may be to increase revenue from international countries in shirt sales, online subscriptions etc. It is therefore difficult to evaluate what value you put on that player. Are you judging them on playing ability, or on their ability to help gain international fans and grow the club’s brand?

Desperate Situations

Sometimes clubs end up in situations they might not have foreseen. This could be through players leaving or players getting injured, and as such, they are forced to act quickly in the transfer market to find a replacement. The buying club is then in a difficult situation as selling clubs know this and can charge a higher fee knowing the buying club might be left with no alternative solution. An example of this was the transfer of Will Grigg to Sunderland in January 2019. The ‘Sunderland till I Die’ TV show on Netflix gave an insight into how Sunderland purchased Will Grigg after top goal scorer Josh Maja moved to Bordeaux earlier in the window. Sunderland identified Will Grigg as the replacement but with Wigan Athletic not wanting to sell and Sunderland in a position of desperation, their offer increased by 210% from £1M to up to £3.1M. Smashing the League One record transfer fee in the process. This shows that desperate situations can influence transfer fees. In Section 4.1 Ideas for Future Analysis, I hypothesise that by looking at what point in the window the transfer was made, could potentially act as a proxy for how desperate the club’s situation is.

3. Modelling

In this section I will discuss my final model and the conclusions that I have drawn from it. I will also interrogate the model from various perspectives, highlighting areas for potential future improvement. I will then go on to show how the model performs against relevant statistical tests and then test the model against a new sample. Finally, I will conclude by making some predictions regarding the potential transfer of Jadon Sancho.

The modelling in this section was performed using EViews and the section contains output from this statistical package.

3.1 Model Basics

The first step before I can start modelling is to look closely at the dependent variable. In order to decide on which method of modelling is appropriate, I have charted up a histogram, looking at the quantity of transfer fees made between each £10M segment i.e. the amount of transfers between the values of £0M and £10M, between £10 and £20M and so on. This histogram (Histogram 1) can be seen below. We can see from Histogram 1 that the dependent variable is highly skewed towards lower transfer fees. Modelling a dependent variable which is highly skewed is problematic - in our case, the model would be prone to giving unreliable estimates for higher transfer fees. One way of getting around the issue is by using a logarithmic transformation on the dependent variable.

$Histogram 1 - left, Histogram 2 - right.$

Histogram 1 - left, Histogram 2 - right.

If we log our dependent variable, our variable is transformed in a way that reduces the skew. Looking at Histogram 2, (even though the x-axis is now on a logarithmic scale), the variable no longer has the same level of skew. Because of this, it makes statistical sense to use the log of the transfer value as the dependent variable in the model.

I have decided to use the Ordinary Least Squares method of modelling to perform my regression analysis with a logged dependent variable. Not only does this solve my problem with skewed data, but it gives me the chance to capture multiplicative effects between variables. I will show that the model passes the assumptions of Ordinary Least Squares in Section 5.1 OLS Assumptions and Tests and why it is therefore appropriate.

3.2 Final Model

I am now all set to perform my modelling. I will spend time testing variables in my model and different combinations of these variables together until settling on a solution which I am confident in, making sure that not only are the variables statistically valid, but their real-life interpretation is logical.

Below is the output of my final model:

I will firstly give more detail on each of the variables in the model and the corresponding interpretation.

Dummies

The purpose of the dummy variable is to quite simply ‘dummy out’ players who I have judged to be outliers. These were players who seemed to defy any variables I used to explain their transfer fees. I therefore did not want my model to be influenced by such players and go on to provide unreliable estimates.

Two of the players I have had to individually dummy out were Jordan Ayew and Sam Byram. Their respective transfer fees were £2.52M and £747K. I was unable to find variables which could explain why both transfers were so cheap. In the case of Jordan Ayew, while he was 27 years old at the time of the transfer and with less than a year left on his contract at Swansea, his average player rating was 6.84. This is higher than both Danny Ings (Player Rating - 6.77, Transfer Fee - £19.98M) and Mateo Kovacic (Player Rating - 6.70, Transfer Fee - £40.5M) and goes some way in explaining why my model struggled to explain his low transfer fee. This suggests that £2.52M was a bargain for Crystal Palace.

The reason behind the grouping of the two other dummy variables of Tyrone Mings and Lloyd Kelly, was that I believed their transfer fees to be related. My model struggled to explain why both fees were so high (£20.1M and £13.3M respectively). My hypothesis behind this was with Lloyd Kelly arriving at Bournemouth on July 1st costing £13.3M and Tyrone Mings moving to Aston Villa from Bournemouth a week later for £20.1M, either Bournemouth knew they were getting a large sum of money from the sale of Mings and recognised they could pay more for Kelly, or they had spent a lot on Kelly and hence needed to drive up the price for Mings to recoup money from their earlier investment. Either way, one fee goes some way in explaining the other. It is then no surprise that both of these transfers have come out as outliers in the model with similar size coefficients.

Age

The next variable in my final model is age squared. I tested the raw age variable and also logged, squared and square rooted age variables. It was age squared which was most statistically significant and is in my model.

As the coefficient for this variable is negative, this tell us that the older a player gets, the more the transfer fee will reduce. In the graph above you can see the difference between age and age squared. If it was the raw variable which was the most significant, it would indicate for each year a player gets older a set percentage $X$ will be deducted from their transfer fee. However, as our model has age squared being the variable which was most significant, it implies that for each year the player gets older, the larger the percentage lost from their fee i.e. it isn’t a set percentage $X$ lost for each additional year a player gets older, but an increasing amount $X^{2}$ for each additional year.

Analysing the Coefficient

The formula $\Delta y = 100 \cdot ( e^{B1} -1)$ gives an approximation of the expected change in transfer fee, if there is a change in our explanatory variable ‘age squared’ of one unit. Using the coefficient from the model above for Age Squared (-0.001341), we can calculate that one unit change in Age Squared will result in a deduction of 0.134% in transfer fee.

Contract Length

The variable which best explained the effects of contract length in the model was the number of days left on a contract at the time of transfer logged. In the graph below you can see the raw variable (Contract Length) plotted against the modified variable (Contract Length Logged). The log variable being the most significant, implies that the effects of contract length are best accounted for when the variable is compressed. An incremental increase in contract length does not have the same size incremental effect on transfer fee.

As the variable’s coefficient is positive, this indicates the more days left on a contract, the higher the transfer fee. This is logical as clubs try not to let players run down their contract and try to sign young players up to long term deals.

Analysing the Coefficient

As this explanatory variable is logged, the interpretation of the coefficient compared to the other explanatory variables is different. Log-Log relationships are power relationships. Therefore, the rule we need to follow is for a percentage change ‘ $c$ ’ in the explanatory variable (contract length), our dependent variable (Transfer Fee) changes by ‘ $c^{B_1}$ ’. Where our ${B_1}$ is the coefficient 0.943415.

So for example, if there is a percentage 10% increase in a player’s contract length, while holding all other variables constant, the resulting change in transfer fee would approximately be $1.1^{0.943414}=1.09408$ , an increase of 9.4%.

Player Ability

The variables which explain the greatest amount of variation in the model are the variables for player ability. These were the average ratings from ‘whoscored.com’ based on the previous two seasons along with an additional upweighting by the standard of football played (based on the league ranking from ‘globalfootballrankings.com’).

In the model, each variable is split out by position. I will talk about the repercussions of doing so in Section 3.4 OLS Assumptions and Tests. The headline finding here is that all variables have positive coefficients, hence the higher the players ability, the higher the transfer fee.

Analysing the Coefficient

The formula $\Delta y = 100 \cdot ( e^{B1} -1)$ gives an approximation of the expected change in transfer fee if there is a change in the explanatory variable ‘player ability’ of one unit. Because of multicollinearity present between the player rating variables, discussed in Section 3.4 OLS Assumptions and Tests. I am unable to draw conclusions between the coefficients of the player ability variables. Therefore, to analyse the coefficients from the model above I have taken an average coefficient of the four player rating variables. As $e^{1.16049525}=3.1915$ , one unit change in Player Ability will result in an approximate increase of 219% transfer fee.

This implies that if a player had an average rating (after upweight has been applied for league standard) of 9 and an estimated transfer fee of £10M, then their transfer fee would increase to £31.9M if their player ability rating increased to a 10. While keeping all other variables constant.

Champions League since 2010

After testing many variables regarding club wealth and club status, I found that the variable which had the highest T-Statistic for a club effect was the number of times the clubs had qualified for the Champions League since 2010. The fact that the coefficient is positive implies that it has a positive effect on transfer price. This tells us that clubs who have qualified for the Champions League since 2010 pay more for the same player than a club who has not. This is somewhat logical as clubs who qualify for the Champions League have increased revenue, therefore the buying clubs will have to pay a premium as the selling club knows that they can afford it.

Analysing the Coefficient

The formula $\Delta y = 100 \cdot ( e^{B1} -1)$ gives an approximation of the expected change in transfer fee if there is a change in the explanatory variable ‘number of Champions League qualifications since 2010’ by one unit. Using the coefficient from the model (0.077945), we can calculate that $e^{0.077945}=1.081$ , therefore one unit change will result in an approximate increase of 8.1% on a transfer fee.

This suggests that Chelsea (8 Champions League qualifications since 2010) pay an additional 8.1% on players over Arsenal (7 Champions League qualifications since 2010).

3.3 Model Interrogation

One way that I have interrogated the model is by reordering all the variables (dependent and explanatory) by different factors. I have then been able to look at my Actual VS Fitted from various perspectives. I have reordered the variables by Transfer Fee, Player Position, Age, Contract Length and the League Position of the buying club from the previous season. While doing this, I reworked the model and now the Actual VS Fitted’s shown below are all from the final model.

The following Actual VS Fitted are all from the same model, I have just reordered the variables in the model for a different perspective.

Transfer Fee

In the plot below you can see the Actual values (Transfer Fee Logged), plotted against the model’s fitted value. The Transfer Fee Logged and all other variables have been reordered by Transfer Fee from lowest to highest. Each highlighted area represents a cross section of the KPI, with the size of each section being dependent on how many players are in each cross section.

Key

Purple – Transfer Fees Below £5M - 12 Player
Green – Transfer Fees Between £5M and £10M - 12 Players
Red – Transfer Fees Between £10M and £20M - 17 Players
Blue – Transfer Fees Between £20M and £30M - 13 Players
Orange – Transfer Fees Above £30M - 8 Players

Actual VS. Fitted

The Actual VS Fitted when ordered by Transfer Fee are satisfactory. However, in the first cross-section (Purple), where the transfer fee was lower than £5M, the model is mostly overestimating and in cross-section four and five (Blue and Orange), the model is mostly underestimating. After seeing the residuals structured in this way, I attempted to create a variable which would explain this effect. I believed the problem was a club effect, that bigger, wealthier clubs pay more for players. However, after creating variables for the number of trophies a club has won, wages of a club and financial size of a club, none of these variables explained this missing variation. I looked at the individual players who were being over and underestimated and to my surprise, both cross sections contained players who had transferred to clubs from both the top and lower end of the league table. Therefore, I concluded that it was not a club effect that I was missing but a player effect.

My conclusion is that the variation I am not explaining is due to the simple economic theory of supply and demand. In the football market there is a large pool of players from which a club can buy from. In this large, talent pool of players, the higher the ability of the player, the fewer players there are. My model is overestimating when there is a larger talent pool of players of ‘low quality’ and underestimating when there is a small talent pool of ‘high quality’ players. This is due to supply and demand. The supply is high for low ability players but demand is low, therefore the transfer fee is cheaper. Conversely, supply for high ability players is low and the demand is high which leads to fees becoming more expensive.

My model does not account for this talent pool effect - I need a variable which accounts for the breakdown of the proportion of players in this hypothetical talent pool. One which shows scarcity of ‘high ability’ players and an abundance of ‘low ability’ players. The hypothesis here is that there are a lot of players to choose from who are lower ability and the buying club therefore has a lot of options and can ‘shop around’ until they find a player whose price they are comfortable paying. However, as there are fewer players of high quality, buying clubs are often competing against other clubs to sign the player which drives up the price they will pay.

I therefore acknowledge that in its current state, the model’s estimates may slightly overestimate lower transfer fees and underestimate higher transfer fees. This will be an area I focus on improving in my next analysis.

League Position

I also looked at the Actual VS Fitted by league position of the buying club from the previous season, with the teams promoted from the Championship being classed as Positions 16 to 20.

Key

Purple – League Positions 1 to 5 - 12 Players
Green – League Positions 6 to 10 - 18 Players
Red – League Positions 11 to 15 - 12 Players
Blue – League Positions 16 to 20 - 20 Players

Actual VS. Fitted

From this perspective, the model performs very well but in the Red section, however, you can see that the model overestimates for four data samples in a row. This is not too serious and I believe it to be more by chance. I am therefore happy that the model is not over or underestimating due to the club the player signs for.

Playing Position

I then reordered my Actual VS Fitted by player position. The following cross sections are Goalkeepers, Defenders, Midfielders and Strikers.

Key

Purple – Goalkeeper - 2 Players
Green – Defender - 20 Players
Red – Midfielder - 27 Players
Blue – Striker - 13 Players

Actual VS. Fitted

When we view the model from a player position perspective, we can see that it performs very well. There is no obvious cross section which the model is over or under estimating. I am confident that the model estimates are therefore not over or underestimating by position.

Age

I then reordered my Actual VS Fitted by age - I’ve split the cross sections into age ranges which have similar size samples in each.

Key

Purple – Aged 21 and Below - 15 Players
Green – From 22 to 23 - 17 Players
Red – From 24 to 26 - 19 Players
Blue – Aged 27 and Above - 11 Players

Actual VS. Fitted

The model performs well when analysed from an Age perspective, however, the in blue cross section (players who are aged 27 and above), 7 data points from the sample of 11 are overestimated. This perhaps indicates a variable which penalises further for older age could work well to explain the remaining variation. Despite this, I am again confident in the model in that it’s estimates are not skewed due to age.

Contract Length

I then reordered my Actual VS Fitted by contract length. I’ve split the cross sections into less than 1 year remaining, between 1 and 2 years remaining, between 2 and 3 years remaining and above 3 years remaining.

Key

Purple – Less than 1 Year - 17 Players
Green – Between 1 and 2 Years - 21 Players
Red – Between 2 and 3 Years - 14 Players
Blue – Above 3 Years - 10 Players

Actual VS. Fitted

This perspective also shows the model performing well and indicates the model is accounting for the variation caused by contract length sufficiently.

3.4 OLS Assumptions and Tests

In order to be confident in the output and estimates from my model, I need to check the model satisfies the assumptions of Ordinary Least Squares (OLS) Modelling. If these assumptions hold, OLS modelling creates the best possible estimates.

I’ll go through each assumption one by one to test the model validity.

Assumption 1: The regression model is linear in the coefficients and the error term.

Model Validity: In statistics, a regression model is linear when all terms in the model are either constant or a parameter multiplied by an independent variable. This restricts the model to be one type:

$Y= \beta_0 + \beta_1X_1+\beta_2X_2+...+\beta_kX_k+\epsilon$

The model fits this type, as the betas ( $\beta_n$ ’s) are linear values and our model contains an error term $\epsilon$ .

Assumption 2: The error term has a population mean of zero.

Model Validity: The error term accounts for the variation in the dependent variable which the explanatory variables do not explain. For my model to be unbiased, the average value of the error term must equal zero. The constant present in the model forces the mean of the residuals to equal zero, hence our model passes this assumption.

Assumption 3: All explanatory variables are uncorrelated with the error term.

Model Validity: If an explanatory variable and the error term are correlated, then a variable could be missing from the model. Violating this assumption may give biases to coefficient estimates. I have looked at the residuals from various perspectives, tested numerous models and have been unable to find any more variables that should statistically be included in the model. Hence there is no variable bias that I am aware of.

Assumption 4: Observations of the error term are uncorrelated with each other (no autocorrelation).

Model Validity: While autocorrelation is rare in non-time series modelling, I was still wary of this assumption. In Section 3.3 Model Interrogation, I tested for autocorrelation. By viewing the residuals from different perspectives and observing whether there were any periods where the model continually under or overestimated, I found that in most perspectives, no autocorrelation was present. A small problem was seen however, when I reordered the Actual VS Fitted by Transfer Fee. I could see for lower fees the model slightly overestimates and for higher transfer fees it slightly underestimates. There is not that much I can do with the data currently available apart from being aware of this and look to address it in future analysis.

Assumption 5: The error term has a constant variance (no heteroscedasticity).

Model Validity: The variance of the errors should be consistent for all observations. If the variance changes, then that is referred to as heteroscedasticity. The easiest way to check this assumption is to create a Residuals vs Fitted value plot, seen below.

Heteroscedasticity appears as a cone shape where the spread of the residuals increases in either direction. As no such cone shape exists, the model passes this assumption.

Assumption 6: No explanatory variable is a perfect linear function of other explanatory variables (no multicollinearity).

Model Validity: This assumption is a test of multicollinearity. Firstly, perfect multicollinearity is when two variables have a correlation of 1. Essentially, one variable can be transformed to the other through a scalar. EViews does not allow two variables with a perfect correlation to enter a model, so I know that the model passes the assumption in this case.

However, having high multicollinearity is also a problem. This is when variables are either highly positively or negatively correlated. If this is present in the model, then it can reduce the precision of the estimates. One way to check for multicollinearity is by checking the Variance Inflation Factors (VIF) of the model.

In the table above, you can see the VIF for my model. A common rule of thumb is that if a VIF is greater than 10 you have multicollinearity present in the model. As our four player ability variables have VIFs much higher than 10, we have multicollinearity present in our model. Understanding why the problem exists is quite straight forward. Those four variables are highly negatively correlated with each other. For example, the defender player ability variable has 0 for players from other position and then a score (usually between 7 and 11) for defenders. Conversely, a midfielder player ability variable has 0 for players from other positions, and then a score for midfielders. It is therefore logical that such variables are highly negatively correlated.

Because of the multicollinearity present in the model, I am unable to draw conclusions between the coefficients of the four player ability variables. As the variables are so similar, it is difficult for the model to determine what size effect each variable is having on the dependent variable. It is not all doom and gloom though; multicollinearity does not reduce a models overall predictive power and therefore the final equation is still valid. The solution to the multicollinearity problem is to model by individual position, as then you would only have one player ability variable in each model. However, I am unable to do this because of the size of the sample e.g. I only have two goalkeepers in my current sample. This is a consideration for future analysis and is discussed in more detail in Section 4.1 Ideas for Future Analysis.

Assumption 7 (optional): The error term is normally distributed.

Model Validity: Ordinary Least Squares does not require the error terms to follow a normal distribution to produce unbiased estimates. However, by satisfying this assumption my model will also generate reliable confidence intervals. We can use a normal probability plot to determine whether the residuals are approximately normally distributed.

In the plot above, we can see that the residuals lie on the straight line fairly well, as the bulk of the data points follow the line, this implies that the error term is normally distributed. The tails on either end of the line do indicate the model has a slight ‘left skew’ but this is a very mild effect and hence the model passes this assumption.

3.5 Model Conclusions

Now that I am confident my model satisfies the assumptions of OLS regression, I can comment on the model’s performance. Below is the output of the model again:

R Squared and Adjusted R-squared

Firstly, from the output above you can see the model has an R-squared of 83.4% (the R-squared is the percentage of variation explained by the model). A value of 100% indicates the model explains all the variation in the dependent variable. The Adjusted R-squared, however, is a better measure of a model’s performance as it does not assume that every variable contributes to explaining variation in the model, it essentially penalises for having variables which do not improve the model. A good Adjusted R-squared value is quite difficult to define, as a good Adjusted R-squared score in one model may be a bad Adjusted R-squared score in another. This is reliant on the amount of variation present in the dependent variable. Despite this, for the amount of variation present in my dependent variable and being limited in the number of variables I can add, I believe an Adjusted R-squared of 80.2% to be quite good.

Standard Error of Regression

Another test of a model’s performance is by calculating the Standard Error of Regression (this value represents the typical distance a data point falls from the model).

My model has a Standard Error of Regression of 0.461014. As this value is in relation to the values of the dependent variable, the way to evaluate the model’s performance is to compare the Standard Error of Regression (0.461014) to the Mean Dependent Variable (16.34505). Any Standard Error below 5% of the Mean Dependent Variable is performing well.

In my example, we have 0.461 < 16.345 $\cdot$ 5% and as 0.461 < 0.81725, therefore the model is performing well.

Final Comments

Due to the overall size of my sample, I am restricted in the number of variables I can include due to fears around overfitting the model. I am happy that the model passes the assumptions of OLS and hence it’s estimates are valid. I did expect the model to struggle in some areas and I understand it’s strengths and weaknesses and where I can improve in future analysis.

3.6 Model Equation

Below are the final equations from the model which can be used for evaluating a player’s estimated transfer fee. Firstly, the Key below explains details about the variables and should be read before using an equation.

• Our constant denoted by ‘C’ = 3.734864.

• Number of days left on Contract is denoted by ‘CL’, has the coefficient of 0.943415.

• Player Ability $\cdot$ League Rating (League ratings can be found in Appendix) are denoted by ‘PAG’, ‘PAD’, ‘PAM’ and ‘PAS’ for Goalkeepers, Defenders, Midfielders and Strikers respectively, with corresponding coefficients, 1.175252, 1.102124, 1.159117, 1.205488.

• Age of a player at time of transfer is denoted by ‘AGE’, has the coefficient -0.001341.

• The number of times the buying club has qualified for the Champions League since 2010 is denoted by ‘CLUB’, with a coefficient of 0.077945.

I could have included all of the above in one equation, but I have split out by position to keep it simpler and easier to use.

Estimated Transfer Fee for Goalkeepers: $= e^{(3.734864+1.175252\cdot PAG+0.943415\cdot \log(CL)+0.077945\cdot CLUB-0.001341\cdot AGE^{2})}$ Estimated Transfer Fee for Defenders:
$= e^{(3.734864+1.102124\cdot PAD+0.943415\cdot \log(CL)+0.077945\cdot CLUB-0.001341\cdot AGE^{2})}$ Estimated Transfer Fee for Midfielders:
$= e^{(3.734864+1.159117\cdot PAM+0.943415\cdot \log(CL)+0.077945\cdot CLUB-0.001341\cdot AGE^{2})}$ Estimated Transfer Fee for Strikers:
$= e^{(3.734864+1.205488\cdot PAS+0.943415\cdot \log(CL)+0.077945\cdot CLUB-0.001341\cdot AGE^{2})}$

The equations above will give an estimate of the transfer fee of a player. A full example of how to use this equation can be seen in the Appendix, Section 5.3 Jadon Sancho Detailed Example.

3.6 January 2020 Transfer Window

I thought that looking at the 2020 January Transfer Window would be a good way to test my model’s accuracy. As the model has not been built with any of these players involved, it would be interesting to see how it would perform. Firstly, there are three caveats that need mentioning:

1: The model has no variable which accommodates for the date of the transfer and any possible monetary inflation between the model sample (Summer Window 2019) and test sample (Winter Window 2020). But as these two periods are only 6 months apart, it’s estimates should remain reliable.

2: Another caveat is that the model is only based on summer transfers and we may see that the model over or underestimates players who transfer in January. Perhaps clubs are more desperate and pay higher fees or maybe they get a cheap deal on an unhappy player? Any such variation cannot be accounted for by my model as it does not contain any players who transferred in January.

3: My model estimates what it judges to be the correct transfer fee. If the model’s estimate is different to that of the actual transfer fee, it does not necessarily imply poor model performance but it may be that the buying club got a good or bad deal.

Results

Out of the 40 transfers made in January 2020, only 9 players remained after I removed any loan deals and players who I was unable to get the necessary data on. Below is the full list of players I evaluated with my model’s estimated fee, the actual fee and the difference between the two:

Player Name	Predicted Fee	Actual Fee	Numerical Difference	Percentage Difference
Aaron Mooy	£12,175,321	£5,400,000	£6,775,321	125.47%
Bruno Fernandez	£52,998,906	£49,500,000	£3,498,906	7.07%
Darren Randolph	£2,737,384	£4,230,000	-£1,492,616	-35.29%
Giovani Lo Celso	£41,414,126	£28,800,000	£12,614,126	43.79%
Ignacio Pussetto	£16,396,962	£7,200,000	£9,196,962	127.74%
Josh Brownhill	£7,410,125	£9,000,000	-£1,589,875	-17.67%
Lukas Rupp	£2,371,114	£450,000	£1,921,114	426.91%
Sam McCallum	£1,138,568	£3,740,000	-£2,601,432	-69.56%
Steven Bergwijn	£28,563,319	£27,000,000	£1,563,319	5.79%

There are two ways you can look at the table above. The first is by looking at whether the buying club has got a good deal, indicated by the positive difference in the final column. The second is using it to evaluate whether my model performs well and when my model does not perform well, is there an obvious reason why not?

Before we go in depth about the performance of the model, we need to provide some background on the transfer of Giovani Lo Celso. Tottenham originally paid Real Betis a £14.4M loan fee in the 2019 Summer Transfer Window but opted to sign Lo Celso on a permanent deal in January 2020. Therefore, he technically signs for Tottenham on 1st July 2020 but part of his loan fee will cover the football he plays from January 2020 until the end of 2019/2020 season. If you include his loan fee paid at the beginning of the season with the money spent by Tottenham in January, the model estimate is more in line with the actual transfer fee.

Overall, the results are what I expected. I knew from my Actual VS Fitted interrogation that my model struggled to evaluate players with lower transfer fees and this is an area I aim to improve on next time around. The poorest estimates are that of Aaron Mooy, Ignacio Pussetto and Lukas Rupp, where my model overestimates the actual transfer fee. Digging deeper into the transfer of Mooy, when you investigate his individual statistics you can see why the model predicts a larger transfer fee. Mooy had over 2 years left on his contract and he had an average rating of 6.73 (which is comparable to that of Alex Iwobi (player rating – 6.74 , transfer fee – £27.4M) and Leonardo Dendoncker (player rating – 6.74 , transfer fee – £12.2M)), but he was 29 years old. What I believe to be happening here is that the transfer exposes the two weaknesses in my model at the same time - the weaknesses that the model overestimated lower transfer fees and overestimates the transfer fees of older players. The combination of both weaknesses has led to the large overestimation.

One transfer fee which my model underestimated was that of Darren Randolph. This can be explained however by the reasons behind the purchase. West Ham’s first choice keeper, Lukas Fabianski, was out with a long-term injury and the selling club would have known the situation of West Ham needing to sign a replacement. The situation of the buying club is a potential factor for future modelling discussed in more detail in Section 4.1 Ideas for Future Analysis.

It is worth noting that the model also performs very well with its estimates of Bruno Fernandez, Stephen Bergwijn and Josh Brownhill and it is also worth remembering that the fee the club has paid is not necessarily the correct one. This does indicate, however, that there is still room for improvement with the model, especially predicting players with lower transfer fees.

3.7 Jadon Sancho

To finish off my model analysis, I decided to use my model to make a prediction about the potential transfer of Jadon Sancho. Jadon Sancho is one of the hottest young properties in football right now (he also looks set to be the subject of a bidding war this summer).

Again, the caveat here is that this model was built on transfers from the previous summer, so it does not account for any financial changes year on year or what effect Covid-19 might have on club finances. The estimate is essentially what Sancho would have cost if the transfer happened in 2019 (based on his up to date performances). As the transfer is only a year after the sample period, the model’s prediction will still be accurate.

With brilliant performances for Dortmund over the two previous seasons (even without the highest upweighting of playing in the Premier League), Sanchos average player score with Bundesliga upweighting is higher than anything the model has seen previously. Combined with him being 20 years old and having over 2 years left on his contract, the model’s estimate is certainly going to be big. For the purposes of this estimate, I have hypothetically assumed he signs for Manchester United as my model also incorporates an effect from the buying club due to the ‘number of times a club has qualified for Champions League since 2010’ variable. My model therefore predicts his transfer fee to be £112M. If this is the case, the British transfer record may be broken this summer and by looking at his stats, quite rightly so.

I have also included the Jadon Sancho transfer fee calculations in the Section 5.3 Jadon Sancho Detailed Example.

4. Conclusions and Future Analysis

4.1 Ideas for Future Analysis

During the project I often thought of new variables I would like to test in my model. I could already see, however, that my model performed well and passed statistical tests and furthermore, I knew that with my limited sample size I may be overfitting my model if I added in more variables. Consequently, I decided to make a note of extra variables that I plan to investigate in future analysis when I have a larger sample of players to adequately test the variables against.

I go into detail below about each hypothetical variable and my thought process behind why it could potentially be having an effect on a player’s transfer fee.

Using Different Leagues

Firstly, I would be interested in looking at transfers from different leagues as this would be a good way of increasing the current sample. Data is readily available on Premier League players and if the same level of data is available across different leagues, then this something I would like to investigate. I could either have simple dummy variables to indicate which league the player has transferred to or have different models entirely for different leagues. I could then test to see if different factors affect a transfer fee based on the league. I might see that the same factors have an effect but with different strengths of coefficient. For example, I could test to see if age has a weaker effect on a player’s transfer fee when signing for a club in Serie A than it does for a player signing for a club in the Premier League.

Modelling by Position

Ideally, modelling by position would be more logical. As referenced in Section 3.4 OLS Assumptions and Tests, the model has a high level of multicollinearity in its current state - this is due to player ability variables being split by player position. Modelling by position would solve this problem but it was not possible with the current sample sizes. Nevertheless, if we had much larger samples then modelling by position would give more insight into other factors such as age. It is often referenced that goalkeepers peak later in their career and if my models were split by position, we could test to see if goalkeeper transfer fees are less affected by older age.

Specific Playing Characteristics

By using the average ratings for player ability from ‘whoscored.com’, I am not testing the importance of any individual player characteristics, only an overall figure. The website has individual breakdowns of each of the statistics they use to calculate their overall player rating. By using statistics such as how a player performs against expected goals (among many other characteristics), we could test and create a model that contains unique characteristics, rather than an overall score. My model would then decide which characteristics indicate larger transfer fees rather than using an all-encompassing score.

Use Heatmaps for Positions

The way football is played is continuously changing and one impact this has had is on traditional players’ positions. A right back from one team might have an entirely different role to that of another team. Trent Alexander-Arnold’s average heatmap over the 2019/20 season shows he spent more time in the opposition half than his own. My variable for player ability individually groups all Goalkeepers, Defenders, Midfielders and Strikers together, when in fact, it may be that some individual players are more closely aligned to a different position. One area I would like to look at is using average heatmaps to group players by where they play on the pitch. The position categories will then contain players who are better suited to each other. Therefore the characteristic variables (mentioned previously) which I test, will have a better chance of explaining variation in the specific model. For example, the characteristic of ‘strong ability in long passes’ might not be statistically significant for an overall defender model which contains both centre backs and full backs. However, if centre backs and full backs were split into two different models (based on their heatmaps), we may see the characteristic now has a statistical effect in the full back model as part of the modern full back’s role is crossing, switching the play and feeding balls down the line.

Percentage of Games Played

The current variable in the model is an average of a player’s rating over the two previous seasons this gives a large enough sample for most players. For some players however, either through not being in the team or through injury, this might only result in a handful of games. With a small sample, the less reliable the variable is at predicting the player’s ability. In it’s current state, the variable could be exploited hypothetically by a player who has made 1 appearance across the previous 2 seasons and happened to score 2 goals, gaining a match rating of 7.5. This rating would then also be their overall average season. Therefore, to combat this, I would like to create and test a new variable which gives an upweighting for players who have played in a high percentage of games. Alternatively, for players who have featured in a small number of games I could try to increase their sample by including games from an additional previous season.

International Caps

Another variable which I would like to test is players who have represented their respective countries at a national level. Players who have played international football may have played in high pressure, important matches with the hopes of a nation on their shoulders. It stands to reason that international players may therefore have a higher transfer fee because of this. The variable could simply be a dummy variable for whether the player has played at international level but a better variable would also include a weighting dependent on the strength of the international country they represented. It is obviously much harder for a player to get an international cap for Belgium (currently 1st in FIFA rankings) then it is for Anguilla (currently bottom 210th FIFA rankings). Therefore, a variable that uses the FIFA rankings but also the number of international caps would be best at explaining this potential effect.

Date of Signing

In Section 2.5 Unquantifiable Metrics, I discussed clubs potentially getting into ‘desperate situations’ and referenced Sunderland’s acquisition of Will Grigg. As there is no quantifiable metric on ‘how desperate a club is’, one way of measuring the effect could be by using a proxy variable e.g. this could be one which looks at what date in the Transfer Window the club signed the player. The idea here is that the closer the date gets to the deadline, the more likely the buying club will pay a higher fee, as the selling club doesn’t want to sell their player because it leaves them limited time to get a replacement. This variable could then potentially explain this desperation effect and test the hypothesis that transfer fees may be inflated due to being a last-minute purchase.

Age Variable

An easy alteration for my next modelling is to be more specific with the age variable. The raw variable I used was ‘age in years at time of the transfer’. Using ‘age in days at the time of transfer’ is more specific and hence provides more reliable estimates.

Player History

There are three variables I would like to test to capture the effect of player history. The first is looking closer at the player’s previous transfer history and whether the player has made a big money move earlier in their career, or more specifically, in their previous transfer. The hypothesis here is that if the selling club spent big on a specific player and now plans to sell them, they will want to recoup most of the money from when they first bought the player. This is also dependent on how recent the purchase was. For example, a club who have bought a player for a £50M fee the previous summer will not be willing to sell the player cheaply in the following window. With the right variable it would be an interesting effect to attempt to capture.

The second variable which would try to capture the effects of a player’s history would be a variable which looked at the selling club and the league they play in. Are there some clubs which charge higher fees for their players or even maybe some leagues perhaps? Maybe clubs prefer buying players from leagues which are deemed to be more similar in style to their own and hence clubs are confident the player will fit into the league’s style of play. For example, the Premier League is often cited as being more physical than other leagues and pundits often question whether certain players will be able to deal with said physicality. With a large enough sample, you could test a dummy variable for every selling club, but more realistic would be a dummy variable for the league the selling club plays in. This variable would also test whether a transfer fee is increased due to a player transferring from a team within the same league.

The final variable for a player’s history is the effect of selling a player to a rival club. Clubs who sell players to rivals (either historic rivals or a club with similar season goals) are bound to produce an inflated transfer fee. I thought of this effect during my modelling period when looking at the transfer of Harry Maguire. He transferred from Leicester City to Manchester United for a fee of £78.3M. My model struggled to explain why his fee was so high. Perhaps as both Leicester City and Manchester United would have had similar aims at the start of the season (both would have been targeting a Top 4 position to guarantee Champions League football) then the resulting transfer fee gets inflated. By selling the player to a rival club, you’re not only making your team weaker, you’re also strengthening the rival’s team and decreasing the likelihood of achieving the club’s targets.

4.2 Conclusion

This project was completed to test my own hypothesis’ and the many clichés you hear in football. Although the sample size is small, it has given me much insight for evaluating future players. My findings were in line with my initial thoughts about transfer fees and the four key factors I have found that may affect a player’s transfer fee are: Age, Ability of the Player, Contract Length and the Club the player signs for. My next step is to grow my sample through adding in player transfer fees from January and the coming 20/21 Premier League Summer Transfer window. This will give me the opportunity to test the variables mentioned above, and as my sample grows, the more nuances I will be able to capture in my modelling.

It is worth bearing in mind that one aspect which will almost certainly affect my future modelling is Covid-19. It has been widely reported that it’s impact will produce a ‘buyers market’. As clubs struggle with their finances, many will not be able to afford to spend large amounts of money. Furthermore, many clubs may need to balance the books by offloading players for lower than their market value. This suggests that player transfer fees may become cheaper than we have seen in previous seasons. We may even see a shift in strategy for clubs. Perhaps they will prioritise short term solutions rather than a young but expensive, upcoming star in the current economic climate? Going forwards, I will attempt to evaluate all the subtleties of what the impact has been of Covid-19 and in future analysis I look forward to the challenge of accounting for it’s effect.

5. Appendix

5.1 Available Leagues

The list below is a list of all the leagues covered by https://www.whoscored.com. I was therefore unable to evalulate any player who did not transfer from one of the leagues listed below.

English Premier League, German Bundesliga, Spanish LaLiga, Italian Serie A, French Ligue 1, Dutch Eredivisie, Turkish Super Lig, American Major League Soccer, Russian Premier League, Brazilian Brasileirão, English Championship, Argintine Superliga, Portuguese Liga NOS, Chinese Super league, English League One, English League Two, Champions League, Europa League.

5.2 League Rankings

The data in the table below is based on the SPI (Soccer Power Index) for clubs from the statistical website https://fivethirtyeight.com . FiveThirtyEight has a ranking of every football club across multiple leagues. Global Football Rankings have then found an average for each league based on the average SPI ratings of the clubs within it. This gives an overall league ranking based on the strength of the teams within it.

I’ve changed the scale through a Log transformation.I have then calculated the percentage upweight from the bottom league to each individual league. The lowest league here is English League One, as that was the lowest league to feature in my sample. For the full list, visit the website at https://www.globalfootballrankings.com/.

League Name	Average SPI	Log of SPI	Percentage Upweight
Barclays Premier League	74.67	1.873146	40.1%
Spanish Primera Division	72.78	1.862012	39.3%
German Bundesliga	70.17	1.846151	38.1%
Italy Serie A	65.08	1.813448	35.7%
French Ligue 1	61.96	1.792111	34.1%
Russian Premier Liga	52.07	1.716588	28.4%
Brasileiro Série A	51.60	1.712650	28.1%
Portuguese Liga	49.56	1.695131	26.8%
English League Championship	49.17	1.691700	26.6%
Mexican Primera Division Torneo Clausura	45.90	1.661813	24.3%
Dutch Eredivisie	45.44	1.657438	24.0%
Swiss Raiffeisen Super League	45.44	1.657438	24.0%
Austrian T-Mobile Bundesliga	44.88	1.652053	23.6%
Belgian Jupiler League	44.26	1.646011	23.1%
Turkish Turkcell Super Lig	43.74	1.640879	22.8%
Argentina Primera Division	43.66	1.640084	22.7%
Chinese Super League	42.01	1.623353	21.4%
Danish SAS-Ligaen	40.19	1.604118	20.0%
Greek Super League	39.75	1.599337	19.7%
Major League Soccer	38.91	1.590061	19.0%
German 2. Bundesliga	34.55	1.538448	15.1%
Japanese J League	33.81	1.529045	14.4%
Spanish Segunda Division	33.75	1.528274	14.3%
Scottish Premiership	32.52	1.512151	13.1%
Swedish Allsvenskan	31.97	1.504743	12.6%
French Ligue 2	30.45	1.483587	11.0%
Australian A-League	29.14	1.464490	9.6%
Norwegian Tippeligaen	29.04	1.462997	9.5%
Mexican Primera Division Torneo Apertura	24.89	1.396025	4.4%
Italy Serie B	24.86	1.395501	4.4%
South African ABSA Premier League	22.06	1.343606	0.5%
English League One	21.71	1.336660	0.0%

5.3 Jadon Sancho Detailed Example

In order to evaluate a player yourself, I’ve provided an in-depth look at how to use the equations from Section 3.6 Model Equations. For this example, I will use the hypothetical transfer of Jadon Sancho.

To evaluate any player transfer you only need to know six details about them.

Their Age in years at the time of transfer. Found on https://www.transfermarkt.co.uk/.
The number of days remaining on their Contract. Found on https://www.transfermarkt.co.uk/.
Their average ability on https://www.whoscored.com for the previous two seasons.
What League they play in and the Percentage Upweight from Section 5.2 League Rankings.
The club they sign for or the hypothetical club they sign for and the number of times that club has qualified for the Champions League since 2010. Found in Section 5.4 Champions League Qualification. Use a 0 if the club does not appear on this list.
Their Position on the field. This is sometimes tricky to determine, I use https://www.whoscored.com for their positions.

So, at the time of writing, the details for Jadon Sancho are as follows:

Age (AGE) - 20
Number of days left on contract (CL) - 804
Player Ability and position - Average of 7.526333333
League Percentage Upweight - German Bundesliga, 38.1%
Signing Club (CLUB) - Hypothetically, Manchester United (7 Champions League Qualifcations since 2010)
Position - Midfielder

By using the values discussed above, I can caculate the (PAM) as 7.526333333 $$38.1%= 10.39512908.

So as Sancho is classified as a midfielder, I use the equations from Section 3.6 Model Equations as shown seen below.

Estimated Transfer Fee For Midfielders: $= e^{(3.734864+1.159117\cdot PAM+0.943415\cdot \log(CL)+0.077945\cdot CLUB-0.001341\cdot AGE^{2})}$
Hence the equation becomes: $= e^{(3.734864+1.159117(10.39512908)+0.943415\log(804)+0.077945(7)-0.001341(20^{2})}$
This results in: $= e^{18.53411}$
Which is equal to: $=£112,011,485$

5.4 Champions League Qualification

The table below shows the number of times a club has qualified for the Champions League groups stages since 2010. This data is from http://www.myfootballfacts.com.

Club	Number of Champions League Group Stages Qualifications since 2010
Manchester City	9
Chelsea	8
Manchester United	7
Arsenal	7
Tottenham Hotspur	5
Liverpool	4
Leicester City	1

5.5 Players in the Sample

Aaron Wan-Bissaka, Adam Webster, Alex Iwobi, Allan Saint-Maximin, André Gomes, Angeliño, Anwar El Ghazi, Ayoze Pérez, Bailey Peacock-Farrell, Ben Osborn, Björn Engels, Bruno Jordão, Callum Robinson, Che Adams, Craig Dawson, Daniel James, Danny Ings, David Luiz, Dennis Praet, Douglas Luiz, Emil Krafth, Erik Pieters, Ezri Konsa, Fabian Delph, Gonçalo Cardoso, Harry Maguire, Ismaïla Sarr, Jack Clarke, James McCarthy, Jay Rodríguez, Jean-Philippe Gbamin, João Cancelo, Joelinton, Jordan Ayew, Jota, Kortney Hause, Leander Dendoncker, Lloyd Kelly, Luke Freeman, Lys Mousset, Mateo Kovacic, Matt Targett, Moise Kean, Neal Maupay, Nicolas Pépé, Oliver McBurnie, Pablo Fornals, Patrick Cutrone, Pedro Porro, Philip Billing, Raúl Jiménez, Rodri, Ryan Sessegnon, Sam Byram, Sébastien Haller, Sepp van den Berg, Tanguy Ndombélé, Tom Heaton, Trezeguet, Tyrone Mings, William Saliba, Youri Tielemans.

5.6 Players Excluded from the Sample

Albian Ajeti, Arnaut Danjuma, Daniel Adshead, Gabriel Martinelli, Jack Stacey, James Justin, Kieran Tierney, Leandro Trossard, Marvelous Nakamba, Matt Clarke, Moussa Djenepo, Pedro Neto, Ryotaro Meshino, Wesley, Zack Steffen.

A Modelling Analysis of Transfer Fees from the 2019/20 Premier League Season

Callum Littler

Abstract

1. Introduction

2. Background and Set-Up

2.1 The Set-Up

2.2 Caveats

Time Period of Sample

Size of the Sample

Clauses in Contracts

Other Contract Factors

2.3 Data Gathering and Transformations

Player Ability

A Player’s Age

Contract Length

Status of a club

2.4 Other Variables Tested

Physical Stats

Being Premier League Proven

Being a Former Player

Being British/English

Club Wealth

Money Available to Spend

2.5 Unquantifiable Metrics

Club Staff

Interest in a Player

Players Forcing a Move

Commercial Purchases

Desperate Situations

3. Modelling

3.1 Model Basics

3.2 Final Model

Dummies

Age

Contract Length

Player Ability

Champions League since 2010

3.3 Model Interrogation

Transfer Fee

League Position

Playing Position

Age

Contract Length

3.4 OLS Assumptions and Tests

3.5 Model Conclusions

R Squared and Adjusted R-squared

Standard Error of Regression

Final Comments

3.6 Model Equation

3.6 January 2020 Transfer Window

Results

3.7 Jadon Sancho

4. Conclusions and Future Analysis

4.1 Ideas for Future Analysis

Using Different Leagues

Modelling by Position

Specific Playing Characteristics

Use Heatmaps for Positions

Percentage of Games Played

International Caps

Date of Signing

Age Variable

Player History

4.2 Conclusion

5. Appendix

5.1 Available Leagues

5.2 League Rankings

5.3 Jadon Sancho Detailed Example

5.4 Champions League Qualification

5.5 Players in the Sample

5.6 Players Excluded from the Sample