World Cup Elo Part 1: Predictive Efficacy of Prior Results

Part 2 can be found here, and a followup discussing the eventual observed World Cup results is found here.

The FIFA World Cup is the biggest biggest spectacle of the sports and entertainment, world. With tens of billions of dollars bet on its results, it presents a fascinating predictive challenge. Its structure defies that of regular domestic sports leagues, which offer regular and consistent competitive results which we can use to gauge future performance. National teams play each other infrequently, and there are frequently large shifts in roster between each meaningful competition. True World Cup predictions have to go far beyond standard sabermetric models, and have to consider a wide array of input information.

This provides an interesting opportunity to consider the practical limits of an analytical model in such a challenging situation. We consider the efficacy of predicting national team strength using solely the World Cup results of the past 20 years. The core of this short project was originally written in five days as part of a curricular project (knowing little about international football), hence some of the initial restrictions, but it has been expanded since. The original paper goes into some detail here, but we highlight some of the points here. This post will describe the setup, challenge, and data setting, and Part 2 will describe some of the choices made in tuning the model (with Part 3 to come).

Challenges

We start with a brief overview of why common statistical approaches are poorly suited to this problem. This will be fairly obvious to most, but these challenges inform any more successful approach, and need to be kept in mind. A naive approach to predict the outcome of a match might be to consider the past results between the two teams in question. However, the sparsity of the dataset makes this impossible. Of the 48 matches played in the 2018 group stage, only 14 of them involve a pair of teams which have a single game played in our dataset. This is not solely due to the vast number of national teams and the infrequency of scheduled matches, but becuase the qualifying and group selection process specifically encourages matchup diversity. Qualfication is done by region, so a large number of a team’s total games will be against the same consistent competition, while group selection encourages a variety of regional representation.

The closed nature of regional qualification highlights the primary challenge in this task. Almost any measurable result in football matters only relative to its competition. As a toy example, compare the basic performance metrics between Argentina split between their qualification matches and their group stage matches in our dataset.

Qualification Performance

	Wins	Draws	Losses	Mean Goals For	Mean Goals Against
Australia	36	11	10	3.14	0.84
Argentina	39	19	11	1.69	0.91

Group Stage Performance

	Wins	Draws	Losses	Mean Goals For	Mean Goals Against
Australia	2	2	6	1.22	2.22
Argentina	9	2	1	1.92	0.58

This is an insultingly trivial example (no one truly expects simply comparing average goals scored to be perfectly predictive, and this simply looks at the dataset as a whole, rather than considering the temporal shifts), but the magnitude of the result is still striking. In qualification matches, Australia is completely dominant, with their mean result being better than a decisive 3-1 blowout, while Argentina has a consistent but comparatively pedestrian winning record. However, in the group stage, Argentina’s performance actually improves, while Australia becomes a decidedly losing team. Argentina’s route to qualification lies in the ASEAN East Asian federation, which is largely comprised of small island countries with minimal football infrastructure, while Argentina has to qualify in the elite CONMEBOL South American region, with a dozen football countries that can field top tier programs. The average result for Australia is inflated by matches such as their 31-0 defeat of Amerian Samoa, while Argentina is forced to grind out tough matches against Brazil and Uruguay.

This ultimately presents a circular challenge. Every single recorded aspect of a match result is only meaningful relative to the strength of competition, and yet our goal is to estimate the strength of these teams. This is a task best suited to iteratively updating systems. For thi problem, we utilize the Elo rating system, the most widespread and popular used in competition since its invention. Elo rating systems are far from perfect, and we will see that this is another example of an occasionally problematic fit. But Elo rating systems provide numerous advantages.

Elo is naturally self correcting. A surprising result causes a larger shift in its perception of the strength of each team, so large errors tend to be smoothed out over time, with sufficient data. Elo is not necessarily well suited for delicate inference tasks when the data violates its underlying assumptions, but Elo on the whole is extremely robust to these inaccuracies.
Elo has a vast array of customization. In common parlance, people frequently refer to “the Elo model”, which depending on context, is probably correct. But it does cause some people to not realize that Elo simplifies defines a class of rating systems. The constructor of the model has to make a large number of important and ultimately subjective choices in order to complete the model. This customizability allows Elo systems to adapt to a wider array of problems, and also makes this a worthwhile project, because it’s difficult to understand the choices inherent in constructing an Elo system until one tries it in action. And constructing this model will better inform us as we hear about Elo models in other contexts in the future.

A Brief Introduction to Elo

Elo ratings provide a broad framework for continuously updating the relative strength of each national team at the time of each successive match, which we will need in order to use the past results for future predictions. The core setup is as follows. Each team has a rating (which we denote $R$), that is updated as after each match they play. If team A and team B (with ratings $R_A$ and $R_B$) play each other in a match, we can calculate the Expected Score for team A ($E_A$, shown in the equation below) in that match. Here, score is a function of the match result (not the goals scored), which in its base form is generally $E_A=1$ for a win, $E_A = 1/2$ for a draw, and $E_A=0$ for loss (we will see that we can adjust this concept of “score” to account for margin of victory in a later section. Then, once we observe the result of the match (which has observed score $S_A$), we can use this, the expected score ($E_A$), and a tuning parameter $K$ weighting the match (which we will discuss in detail in later sections) to update the Elo rating for team A to its new value, $R_A^*$ (shown in Equation \ref{eqn:eloupdate}, with a corresponding update for team B).

\begin{align} E_A &= \frac{1}{1 + 10^{-(R_A - R_B)/400}}\label{eqn:eloes}\\
R_A^* &= R_A + K(S_A - E_A)\label{eqn:eloupdate} \end{align}

This is a naturally self correcting system, because an upset causes a large shift in score for both teams, while a result that is expected will cause a much smaller shift in ratings. This property is particularly appealing to us, because we know that a primary challenge will be that we naturally expect the strength of national teams to shift in ways different than if we were measuring the rating of an individual. As players retire and new ones join, the team composition itself can change dramatically between competitions, so we want a rating system that works to quickly adjust to data that conflicts with its previous belief. Further, if two teams play each other endlessly with a fixed probability of match outcomes, their Elo ratings will reach equilibrium, rather than endlessly diverging, even if one team has a positive winning percentage. In fact, it will reach the equilibrium where that winning percentage corresponds with the given Expected Score between this pair of Elo ratings. Certain rating systems tend to blindly reward teams for playing additional matches, while this system only cares about the quality of a team’s results. The $400$ term in Equation above determines the scale of the rating system. This is the constant used by the FIDE chess rating system, and it implies that a team with a $100$ point advantage in Elo rating has an Expected Score of about $0.64$, which is a fairly intuitive scale to work with.

Click here for brief overview of some issues with Elo

(Quoted from original paper).

Elo rating systems are not without their flaws. For example, one study of Elo ratings in chess showed that it tends to underrate the chance of an upset in very lopsided matches. A primary reason hypothesized for this is that weaker players tend to improve more quickly between tournaments than stronger players, which means the algorithm will tend to underestimate the weaker opponent's chances in a match. This by itself is not likely to be a major concern for national teams in soccer. In chess, each player tends to improve as their career progresses, while this cannot universally be true among soccer national teams, as the pool of top competitors is largely fixed (besides political shifts in country definitions) and their skill is determined relative to the pool of teams. Thus, we should be concerned that the strength of teams fluctuates between World Cups (which it does), but it is unlikely to be systemically true that all teams tend to improve over time, as team strength is relative to a fixed pool of national teams.

Precise analysis of Elo ratings requires assumptions about the dataset that are unlikely to be exactly true, but the system is somewhat robust against these inaccuracies, due to its self correcting nature. We assume that each team has some true strength at a given moment in time, which we cannot directly measure. The crucial Expected Score calculation (described below) assumes that each team has the same standard deviation for their observed performance in a given match (which is randomly distributed around their true strength at that time) \cite{glickman}. This is a core assumption that may not precisely fit our data, as it is difficult to prove that some national teams could not have a higher standard deviation of observed performance given such limited data. Further, Elo ratings assume that shifts in the true strength of a team are gradual over time. This depends on the time frame that one considers, but among soccer national teams this is unlikely to always be true. Sometimes a large number of players will retire between World Cups, or for a specific match, a crucial star player may be missing due to injury. Unfortunately, it is entirely possible for a national team to have a rapid shift in true strength. We note that this will prove problematic no matter our approach. It incentivizes us more heavily weighting extremely recent matches rather than taking a broader look at past performance. It is reasonable to place a high weight on recency, but given the sparsity of our dataset, and the inherent randomness involved in soccer, we have to strike a balance. It is trivial to find cases where a team has an excellent match on one day, and plays poorly soon after, with no changes to be found between the games, as we understand that the results of a soccer match have a relatively high variance.

Thus, we can see that there are elements of Elo rating assumptions that are not precise fits for our data. However, by and large, similar assumptions are unavoidable for any insightful analysis, and an Elo rating system is well equipped to produce reasonable results even with some violation of assumptions. Ultimately, the way to address these concerns is to carefully examine our resulting model, and ensure that the results are intuitive and accurate along the way. Indeed, much of our work will come from trying a variety of Elo based approaches, and analyzing the results.

The Dataset

We consider World Cup results (group stage, knockout, and qualifying) from 2002-2018, with the data compiled by the Rec.Sport.Soccer Statistics Foundation. After removing an assortment of matches without results (qualifying matches are occasionally canceled or annuled for various reasons), we are left with 2964 complete match results. The primary data cleaning challenge is creating a complete mapping from these matches, to the national teams involved. Not only does the dataset use a variety of abbreviations for each country, many political borders have shifted during the course of this dataset. Luckily, this largely applies to teams who are not qualified for the 2018 World Cup, but we surely cannot ignore these because our model will be predicated on an accurate assessment of not just the teams directly qualified, but for the teams which they played in order to qualify (to give tehir results context). Thus, there is no solution for this but a laborious manual inspection of this messy situation. Ultimately, we compile a list of 115 mappings (applying to 42 different countries), with certain selections being routine (the different ways a country can be abbreviated), and others being more impactful (when countries split). Details can be found in the paper.

The most significant choice involves the Serbian national team, which is qualified for the 2018 World Cup. Up through 2006, it competed as the joint Serbia & Montenegro national team, until Montenegro declared independence. As the majority of players stayed on the Serbian national team, we simply count that as a singular unified team, and consider the Montenegro national team to be “newly formed” in 2006. While the dataset is overall sparse (due to the vast number of teams in FIFA), thankfully the 32 teams for the 2018 World Cup are reasonably well represented. The least prolific is Iceland (with 31 matches recorded), but most countries have 40 to 60 that we cn use.

In the next post, we will outline the myriad choices required in constructing this model.

Written on December 5, 2017