World Cup Elo Part 2: Tuning the Model
Part 1 can be found here, and a followup discussing the eventual observed World Cup results is found here.
Elo Rating System Choices
As previously described, the Elo rating system is a broad and flexible framework, rather than a monolithic model. In fact, the primary draw of this project was to better understand the subjective choices that any person constructing such a model must make, and how we can use our intuition and domain knowledge to make plausible choices. In this post, I’ll briefly outline how one might approach some of those choices in the context of the World Cup.
Calibrating Draws
The Elo rating system only computes an Expected Score for a given matchup, which does not uniquely determine the likelihood of a draw. We consider draws worth half a point (so Expected Score $= P(\text{Win}) + P(\text{Draw})/2$), but an Expected Score of $1/2$ might correspond to a 50% chance of either team winning (and no chance of a draw), or 100% chance of a draw, or anything in between. Thus, we need to find a formula to convert from Expected Score to the chances of either team winning or a draw. There is no perfect system to do this, the best we can do is examine historical data, and find a plausble mapping.
The details are covered in Section 3.1 of the corresponding paper. In the top four leagues, about 26% of matches end in a draw. In our full dataset, only 21.7% of matches end in a draw, but in the group stage, 25.1% end in a draw. This fits out intuition, as the qualifying stages have significantly less parity, which reduces the chance of a draw. As a final ballpark estimate, we consider the draw percentages of individual matches implicitly predicted by Premiere league sportsbooks, and note that the draw percentage almost never rises above 31%, with a heavy left tail. Thus, for our initial estimate, we choose that two evenly matched teams have about a 29% chance of a draw (about one standard deviation above what we see in top leagues). Then, when we run our Elo model, we can select values of a tuning parameter (which we call $k_d$), which increases the likelihood of a draw, as shown in \ref{eqn:draws}. Again, the messiness of these models lies in the fact that there is no truly rigorous approach. However, this fits our intuition, and can be adjusted until it best fits our variety of historical data.
\begin{equation} P(\text{Draw}) = .29(12ES1/2)^{k_d}.\label{eqn:draws} \end{equation}
Incorporating Margin of Victory (MOV)
So far, when we refer to the “score” in the context of our Elo ratings, we have been referring to the match result (with $1$ denoting a win, $1/2$ a tie, and $0$ a loss). This approach is sensible, but it is possible that we are missing out on additional valuable information by ignoring the margin of victory (which is the difference between the scores of the two respective teams). We have to be careful not to overweight margin of victory, as it is match result that ultimately determines advancement (in qualifying and group stage matches, margin of victory is only ever a tiebreaker), so teams do not have as much incentive to widen a blowout lead as they do to gain the lead. Further, we have to be concerned with the concept of “garbage time”, which is a widely believed phenomena where teams will not necessarily play their hardest once flipping the game result is essentially out of reach (the idea is that if a team is down multiple goals without much time left, they may “give up” and be more likely to concede additional goals, leading to a wider blowout). Thus, we wish to test margin of victory as an alternative approach to our regular score format.
We still need to have the score range from $0$ to $1$, with $1/2$ denoting a draw. However, when using margin of victory, we can now treat narrow victories as not being worth a full point, and the same for narrow defeats being worth more than $0$ points. As mentioned before, we want to minimize the impact of “garbage time” goals on our rating, thus we want to make sure that the difference between a 1 goal win and a draw is substantially more than the difference between a 5 goal win and a 4 goal win. As before, let $S_A$ be the observed score of the match, and let $M_A$ be team A’s margin of victory (which is negative in a loss). Then, we choose our score function so that each additional goal brings the score half the remaining distance to $1$ (or $0$), as shown in Equation \ref{eqn:mov} (for $M_A=0$, we define this sum to be $0$, so $S_A=1/2$). Thus, a draw is worth $1/2$ score, winning by one goal is worth $3/4$ score, and losing by 2 goals is worth $1/8$ score, and so on. Thus, the most significant goal is still the difference between a win and a draw, and each successive goal adds more (but decreasing) weight to the significance of the result. It is not obvious that this approach is superior to that of our original simple score system, and we have to test this approach and see if it provides a better hit to the histrical data.
\begin{equation} S_A = 1/2 + \text{Sign}({M_A})\sum_{i=1}^{M_A} (1/2)^{i+1} \label{eqn:mov} \end{equation}
Weighting Historical Results
Perhaps the most fundamental challenge for national team predictions is that for any given team, the matches are a small sample stretched over a long period of time. The 2018 group stage team with the largest number of total matches in our dataset is Mexico (with 74), but most play far fewer (with the average being $48$ matches played in our dataset). We can see that Mexico’s roster has shifted dramatically during this span (indeed, no players from the 2002 squad are still on the roster), which calls into question the predictive efficacy of looking at results from the distant past. However, the best world cup nations tend to show consistent success even as rosters shift (Germany, Italy, and Brazil combine for 13 World Cup victories, while the rest of the world only has a combined 7), likely due to their infrastructure for the sport. On the other hand, if a collection of quality players mature at the same time, a roster can achieve success that is at odds with the nation’s historical strength,. For instance, in the five World Cups from 1990 to 2010, Belgium did not qualify twice, had one group stage exit, and two round of 16 exits. However, entering the 2014 World Cup, bookmakers gave them the fifth best odds to win the tournament (behind perennial powerhouses Brazil, Argentina, Germany, and Spain). This example demonstrates a major challenge that we face. Historical success is a reliable initial predictor, but it does not capture the rise of a particular roster. Indeed, en route to qualifying for the 2014 World Cup, Belgium only played 6 matches, mostly against opponents who were not strong enough to qualify for the World Cup themselves. It is hard to predict the Belgium squad’s strength based on their historical success, but it is dangerous to weight their small sample size of qualifying victories against lesser opponents too highly. Thus, we want to find a delicate balance between weighting a team’s full set of games and their smaller sample of recent games that better matches their current roster.
The main way that we account for this balance is in the $K$ weights of the Elo system (Equation \ref{eqn:eloupdate}). The shift in Elo after each match result is tuned by the $K$ parameter, with larger values placing higher weight on the match. We can adjust the $K$ values to be small for games far in the past, and large for recent games. However, our weighting need not be limited to simply the date of the match. We will also experiment with giving higher weight to main event (group stage or knockout) matches over qualification matches. This makes intuitive sense, as qualification matches are more likely to be disrupted by rushed travel, and might not be as predictive. There is also a higher likelihood of the match being comparatively unimportant to the team (this year, Brazil secured qualification with several qualification matches remaining on their schedule).
Let $c_t$ be our tuning parameter to weight match recency, and $c_q$ be our tuning parameter to weight matches which are played in the main event. In each case, a value of $1$ denotes that there is no additional weighting, and a value greater than 1 increases the weighting. Let $k_b$ be the base $K$ value (which will generally be initialized to around $10$, leading to intuitive shifts in Elo, but it will vary based on our other tuning parameters), let $Y_m$ be the year of the match we are considering, and $Y_c$ the year we are projecting for (it will ultimately be 2018, but we will test our models on earlier years). Then our weighting formula is given by
\begin{equation} K(k_b, c_t, c_q) = k_b + \frac{5}{8}(c_t1)(16(Y_cY_m)) + k_b (c_q1) \mathbb{1}\{\text{Main Event Match}\}\label{eqn:k}. \end{equation}
We have a broad choice in defining this formula, but this leads to fairly intuitive results. We note that the additional weighting (beyond the $k_b$ base value) is only present if we increase $c_q$ and $c_t$ beyond $1$, and when they are present they are additive to the base value. We construct the time modification so that at the tail end of our dataset (16 years prior), the time bonus is $0$, and it scales up to $10(c_t 1)$ in the present day. The qualifying tuning parameter simply adds a constant based on whether the match was part of the main event of the World Cup. If we include these additive values (and set $k_t, k_q > 1$), we can simply reduce $k_b$ so that the total Elo shifts do not become too dramatic. These tuning parameters will be a core part of our search for optimal parameters (Section \ref{sect:modelanalysis}), and we will select values for which they give us intuitive results.
Resulting Model
With these options laid out, we have to perform a search of the multidimensional parameter space to find the model which is the most reasonable fit with our intuition and the historical data. This process is more substantial, and can be found in the original report (in the future I might createa summary for the webpage). Ultimately, this search is less interesting than the lessons learned in setting up the model (although there is much to be discussed when comparing the efficacy of different parameter choices on the historical data). In the future, I’d like to write that up in more detail. For now, I will simply leave the results of our final model. Our final parameter choices are

For our scaling $K$ factor, we use tuning parameters of $c_q = 2, c_t = 2,$ and $k_b = 10$ (see Equation \ref{eqn:k}).

We incorporate margin of victory into our scoring system (see Equation \ref{eqn:mov}).

We select the tuning parameter for the likelihood of draws $k_d = .5$ (see Equation \ref{eqn:draws}).

We assume that home teams have a rating advantage of 90 points, and that this applies for World Cup hosts as well.
The final estimated Elo ratings entering the 2018 World Cup are as follows. These show remarkable consistency with betting markets leading into the World Cup, showing that the vast majority of our World Cup predictions can be accounted for by a simple analysis of past results, with informatin about shifting rosters and practice rumors and analysis of aging players only a very slight factor.
Team  Elo Rating 

Germany  1413.74 
Brazil  1375.23 
Spain  1351.30 
Argentina  1338.77 
France  1321.57 
Portugal  1311.52 
Colombia  1310.96 
England  1303.28 
Mexico  1301.45 
Sweden  1285.50 
Denmark  1265.82 
Belgium  1265.15 
Uruguay  1257.39 
Costa Rica  1249.41 
Switzerland  1246.65 
Serbia  1239.61 
Croatia  1239.21 
Russia  1232.75 
Japan  1214.17 
Australia  1209.51 
Iran  1204.75 
Nigeria  1194.97 
Saudi Arabia  1193.30 
Egypt  1190.45 
Tunisia  1183.82 
South Korea  1181.74 
Poland  1180.45 
Peru  1177.72 
Morocco  1165.22 
Iceland  1157.48 
Senegal  1154.91 
Panama  1103.44 