Elo Scores and Squash Ratings

By Jasmine Siswandjo

May 18, 2024

Elo scores

Elo scores are most well-known from chess, with ratings that go from below 1000 to 2700+. Most people have heard of Chess Grandmasters, who have to be in the 2500-2700 range in order to attain the title, which they then hold for life. Something interesting: as of 2024, 42 women have been awarded the GM title, out of about 2000 Grandmasters in total.

But enough about chess, because this is about squash! Elo scores can be used in any context comparing the relative skill level of players, which also works in the head-to-head competition setting of squash matches. Elo rating systems inherently have a scaling constant, which allows ratings to be in any chosen range. The Club Locker or Universal Squash Rating (USR) system used in the US goes from 1.0 (beginners) to beyond 7.0 for pros.

Although the rating system originally used Elo scores up until 2014, the current system was redesigned between 2012-2014 by Elder Research, and continues to gain increases in accuracy as the algorithm gets continually updated. When released, the new ranking system had a 10% gain in predictive accuracy. It weights the most recent games more heavily, specifically those in the last 15 months, and will only take into account matches up to 45 months old (that’s 3 years and 9 months). This recency weighting, along with a reliability score, makes the USR different from a pure Elo system.

US Squash recommends only competing with players who are within +/- 0.5 of their own rating, and will even disregard results from games where there is more than a 0.5 differential in rating if the higher rated player wins, because of the potentially negative impact on the higher-rated player’s rating. An analysis of ~80,000 matches found that for each 0.1 gap in rating, the probability of the higher rated player winning goes up by 15%, but only if the two players have ratings within 0.25 of each other. With a rating difference of more than 0.5, the results are pretty much a foregone conclusion.

Although the algorithm currently being used for the USR rating is proprietary, we can build our own Elo system, in order to understand how it works.

Current state of women’s squash

The current world number one women’s squash player, Nour El Sherbini.

Nour El Sherbini celebrates win

And her rating:

El Sherbini rating

Calculate the probability of player A winning

The probability that A defeats B given their respective skill levels (with \(k\) as a scaling factor):

\(P(\text{A defeats B}|\theta_A, \theta_B) = \frac{1}{1+e^{(-k(\theta_A-\theta_B)}}\)

To illustrate this, let’s simulate some squash matches.

From now on, I will use \(\theta\) and ability level interchangeably.

library(tidyverse)
library(kableExtra)

Simulate data for 100 players, and give them each a true ability level, which will be used later to calculate the probability of winning a match.

players <- data.frame(
  id = 1:100,
  # set up a bounded normal distribution, bound between 1 and 7
  theta = qnorm(p = runif(
    n = 100,
    min = pnorm(1, 3.5, 2),
    max = pnorm(7, 3.5, 2)
  ), mean = 3.5, sd = 2)
)

players is a dataframe with two columns, id and theta.

head(players)
##   id    theta
## 1  1 4.038305
## 2  2 1.081590
## 3  3 2.764817
## 4  4 2.689292
## 5  5 5.188098
## 6  6 2.609840
players |>
  arrange(-theta) |>
  slice(1:6) |>
  ggplot(aes(y = theta, x = reorder(id, -theta))) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = round(theta, 2)), vjust = 1.8) +
  labs(x = "Player ID", y = "theta", title = "Top 6 players by theta")

Take note of the best players: 22 and 88.

Next, we simulate outcomes of 1000 matches. Using the equation above, we can give rbinom the probability that Player A wins the match, given the \(\theta\) of both Player A and Player B.

ids <- 1:100
matches <- data.frame(
  player_1 = sample(x = ids, size = 1000, replace = T),
  player_2 = sample(x = ids, size = 1000, replace = T)
) |>
  filter(player_1 != player_2) |>
  rowwise() |>
  # k was set at 1
  mutate(result = rbinom(n = 1, size = 1, prob = 1 / (1 + exp(-1 * (players[player_1, "theta"] - players[player_2, "theta"])))))

Just to confirm that players with higher ability levels should be winning more of their matches, we calculate the win rate. Since the original matches dataframe has one row per match, with result being 1 if Player 1 won, matches_full duplicates each match and flips the player ids so that we can calculate their win rate across all matches.

matches_flipped <- matches |>
  mutate(
    player_1_new = player_2,
    player_2 = player_1,
    result = 1 - result
  ) |>
  select(-player_1) |>
  rename(player_1 = player_1_new) |>
  select(player_1, player_2, result)
matches_full <- matches |> bind_rows(matches_flipped)
ranked_players <- matches_full |>
  group_by(player_1) |>
  summarise(win_rate = mean(result)) |>
  arrange(-win_rate)

win_rates <- inner_join(ranked_players, players, by = join_by(player_1 == id)) |>
  arrange(-win_rate)

head(win_rates) |>
  kable(col.names = c("Player", "Win Rate", "Theta"), caption = "Top 6 players", digits = 2) |>
  kable_styling(position = "center", full_width = F)
Table 1: Top 6 players
Player Win Rate Theta
41 1.00 6.15
88 1.00 6.80
22 0.96 6.93
9 0.95 6.26
60 0.92 4.79
17 0.88 6.53

As expected, the top few players have the highest win rates!

ggplot(win_rates, aes(x = theta, y = win_rate)) +
  geom_point() +
  labs(y = "win rate", title = "Win rate by theta")

The Win rate by theta plot shows us that win rate increases with theta, which is what we expect. However, there is some randomness in the plot! Ratings will never perfectly predict results. A higher rated player does not always beat a lower rated player, although we can expect a predictable outcome if there is a large enough difference in ratings.

Use optim to estimate player’s real ability levels

In the real world, we don’t have the real ability levels of each player, we only have their match outcomes. We can try using optim to maximise the log-likelihood, in order to estimate the ability level of each player.

LogLikMatch <- function(data, theta) {
  p <- plogis(theta[data$player_1] - theta[data$player_2])
  ll <- sum(data$result * log(p) + (1 - data$result) * log(1 - p))
  return(ll)
}
starting_theta <- rep(0.5, 100)

solution <- optim(starting_theta, LogLikMatch,
  data = matches,
  # this method generally works well for box-constrained optimization
  method = "L-BFGS-B",
  lower = 1, upper = 7,
  # fnscale will swap the minimum likelihood to make it maximum likelihood
  # also set maxit to allow for more iterations
  control = list(fnscale = -1, maxit = 1e6)
)

players <- players |> bind_cols(est_theta = solution$par)

ggplot(players, aes(x = theta, y = est_theta)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, lty = "dotted") +
  labs(y = "estimated theta", title = "Estimated theta vs theta")

And directly compare estimated \(\theta\) with the real \(\theta\) we originally created:

Table 2: Estimated theta for players
Player ID Theta Estimated Theta
20 4.89 5.17
69 5.14 4.53
3 2.76 2.70
26 5.30 5.37
8 5.85 5.38
35 3.59 3.99

Estimated \(\theta\) is pretty close to their original \(\theta\)! The original simulated \(\theta\) distribution does not have a perfectly normal distribution, and the maximum likelihood estimator was bounded to 1 and 7, so the estimates are on a slightly different scale from the original.

range(players$theta)
## [1] 1.060522 6.934806
range(players$est_theta)
## [1] 1 7
Posted on:
May 18, 2024
Length:
6 minute read, 1210 words
See Also: