Rating of players (or teams) is a topic of interest in sports analytics and, in general, in competitive games (e.g., video gaming). The rating in sports and video gaming consists of assigning a numerical value to a player or team using the past games’ results. The estimation of player (team) skills uses statistical models to match players or teams for tournaments and matches. In general, the rating dynamics (how the players’ skills evolve) is a hot topic in competitive games.
🎮 In gaming, a ranking is a numerical ordering of players (ordinal number), and rating is the numerical approximation of the player skill (real number). If one player quits the ranking list, all players will gain a +1 position in the rank while each player’s rating will remain the same.
Let me introduce you to a simple and elegant rating method, which laid the foundation of several competitive games rating systems.
Elo’s Rating system
Árpád Élő (1903–1992) was a Hungarian-born physics professor and master chess player, and this led him to create a method to rate and rank chess players. The World Chess Federation (FIDE) in 1970 approved Elo’s system. A few decades later, Elo’s idea is still popular to rate and match players in sports and competitive games.
Starting with some initial set of rankings for each player, Elo assumed that each player’s performance is a normally distributed variable X (Central Limit Theorem) whose mean (µ) could gradually change over time. Once a player’s rating becomes established, the degree to which the player performs above or below his/her mean changes the rating score. Therefore, it becomes a reflection of your ability relative to the population (other game players).
Elo’s rating formula
Each time a player plays against another player, its old rating is updated by:
where K is a constant, and E is the winning probability of a player.
The performance or recent score, S, of a player against another player, is a discrete variable that can get three different values: 1 if wins, 0 if loses, and ½ if ties against another player. Thus, the update of your rating depends on whether you win or lose.
The K-factor is a constant to balance the deviation between actual and expected scores against prior ratings. If K is too large, playing above expectations can generate significant changes in the ratings. If K is too small, even substantial improvements will not be reflected in the ratings. The subjectivity of the K-Factor has been criticized because it needs to be adapted for new versus experienced players. However, on the other hand, it allows you to customize your rating system.
As more chess statistics became available, FIDE reconsidered the initial Elo’s premise (the player’s performance follows a normal distribution) and considered the scoring difference's logistic function between two players.
The expected winning probability (E)
With the new assumption, the expected scoring difference between two players or teams is a logistic function of their ratings, and since chess ratings employ its base-10 version,
The Elo’s chess expected score of a player A against player B is:
where 400 is the logistic parameter ( c ), and c=400 value comes from the chess world.
💕 Tinder used in the past a proprietary algorithm based on Elo’s rating system. Yes, your profile had a secret “Elo Score”, deciding your popularity on Tinder. Now they are using a new rating system which only they know.
In this context, the logistic function maps any real number (scoring difference) to the [0,1] interval. To think beyond chess, let’s consider teams instead of players. If the scoring difference between team A and B is zero, the probability that team A wins is greater than 0.5, as shown in the Figure below. If the difference is greater or lesser than zero, then the expected wining probability becomes >0.5 or >0.5, respectively.
For example, if Bob, an average chess player, has a rating of 1500 and a stronger player, Alice, has a rating of 1900, then Bob’s chance of winning is 9%,
Whereas Alice chance of winning is 91%,
Therefore, the reward of Bob if wins (S = 1 in Elo’s rating update formula) against Alice, is:
While the reward to Alice, if she wins an average player, is only:
Playing a bit with the K-factor is easy to observe that Elo’s formula rewards a weaker player for beating a stronger player.
from numpy import*
from matplotlib import pyplot as pltK = arange(10,50,1) # K-factor different values
fig, ax = plt.subplots(figsize=(6,2.5))
ax.set(title='Reward to a player for beating another player')
E_Bob = 0.91*K # Bob rating update if wins
E_Alice = 0.09*K # Alice rating update if wins
E = transpose([E_Bob, E_Alice]) # To make E_Bob and E_Alice column vectors
plt.plot(K, E, marker='o', markersize=6, lw=0)
plt.legend(title='Players', labels=['Bob beating Alice', 'Alice beating Bob']);
plt.ylabel('Rating update (R´- R)')
The home-advantage factor (H) can be easily added to (3),
The logistic parameter
The logistic parameter affects the spread of the ratings. Given the team A, as follows:
and the winning probability for team B, calculated in the same way:
We can compare both winning probabilities applying a bit of Algebra:
It is easy to notice that for every c rating points, the probability that team A beats team B is ten times higher than the probability that team B manages to defeat team A. Adjusting the logistic parameter c is a way to fine-tune the system for a particular sport or competition.
In general, due to Elo’s rating system, simplicity is still widely used. With new computation capabilities, new rating systems appeared.
Glicko rating system
Mark Glickman, a statistics professor at Harvard University, developed the Glicko rating system (Glickman, 1995) to improve Elo’s ideas. Glicko system introduced a reliability variable of player’s ratings, the “rating deviation” (RD), or standard deviation (in statistical terms). RD decreases with match results (a player competes very often) and increases during inactivity.
Where t is the amount of time (rating periods) since the last competition and “350” is assumed to be the RD of an unrated player. The constant c is a fixed speed to which RD will start to decay.
Glicko 2 rating system (Glickman, 2013) appeared as an improvement on the original Glicko system. Glicko 2 introduced a new measure; the rating volatility σ, which measures player performance inconsistency (because we are humans and not always have the same mood). The rating volatility will be low when the player performs on a consistent level and higher when not.
Glicko has been implemented in Pokémon Showdown, Dota Underlords, CS:GO, TF2, Go, Chess.com, to name a few.
Glicko’s rating formula
Simplifying formulation, the new rating of a player is calculated by:
As Glicko’s rating system mathematical formulation is more complicated than Elo’s formulation, I encourage you to visit Mark Glickman's website.
TrueSkill is a proprietary rating algorithm developed by Microsoft research in 2005 (Herbrich et al., 2006) for matchmaking on Xbox Live, which improves Elo’s ideas. The team specifically created TrueSkill to address individual rating players in team games and game outcomes treated as a permutation of teams or players rather than merely a winner and a loser.
The TrueSkill rating system uses a Bayesian approach to predict player ability and assumes player skill is normally distributed. One of its main drawbacks besides being proprietary is the complex formulation, not being easy to tune-fine its parameters.
Key ingredients to designing a Rating Algorithm
In general, there is no algorithm better than another. The only requirement that you should make it optimal for your particular needs. Elo’s system is simple and easy to customize. Glicko 2 is useful when players don’t play regularly. To deal with teams, TrueSkill can be a good option. Glicko 2 and TrueSkill can model uncertainty (Cold Start). But what all of them have in common?
1. Players skills probability distribution.
2. Parameters which can be tunned.
3. Rating update rule.
And the secret ingredient 🍳 shall be, make it simple.