SERVAT wrote:I myself don’t think ELO is the best way to measure skill in party games. Have you considered implementing other ranking systems, such as trueskill? You could look it up and think it’s bad or hard to implement, but at least give it a shot.
Trueskill was designed by Microsoft for a specific purpose, and for games with binary win/loss outcomes (e.g. only one team wins, but one team always wins), so it would need serious surgery to work for ToS as a) the number of teams in play can vary and b) it's possible for more than one nominally opposing team to secure a win (e.g. NE + Town).
There are other major complications with transferring its model to ToS:
1) It assumes that each "team" fundamentally has an equal chance of winning from the start, before player skill is taken into account. This is not a valid proposition for ToS (ie, the goal of balancing is not to see 33% winrates for Town, Mafia and NK respectively).
2) Trueskill's equivalent of placement games require the player to be pitched against players who known to be average to a high degree of the system's confidence. That doesn't really play nicely with ToS' decision to partially reset ELO each season.
3) Trueskill's matchmaking system is built around variable queueing times, as it heavily prioritises trying to get players of similar rank into the same game over ensuring a game begins in a certain timeframe. ToS does the reverse, ensuring games begin within (pretty much) a maximum 3 minute period provided there are enough players to launch the game.
4) Trueskill's calculation is designed for a maximum of 8 players evenly distributed across teams (e.g. 8 free for all, or 8 split into 4 teams of 2). Greater numbers become computationally intensive. ToS has both greater numbers, uneven numbers within the teams and in some cases an uneven number of teams in play (e.g. Legacy Ranked's Any role could alter the number of factions in play).
On the plus side, what Trueskill was aiming to achieve is broadly the same thing Flake's suggestion is trying to achieve;
1) Attempting to better reflect the handicapped nature of faction winrates (ie, even the best players are never going to get 50%+ winrates with NK roles)
2) Attempting to introduce a faster placement system to offset ELO's tendency for slow movements up and down the scale
3) Attempting to introduce a level of "confidence" in someone's ranking; potentially factoring into more significant changes in ELO for wins and losses as the system tries to understand where best to place you
4) Using confidence as a mechanism for rewarding a minor but significant uptick to someone who has consistently performed at high levels for longer (ie. a player at 2,100 ELO after 100 games is less proven at that level than a player with 2,100 ELO after 200 games) as opposed to rewarding this via direct ELO gain, which could open the door to ELO being grindable.
IMO, the one bit which could be helpfully lifted from Trueskill relates to point 4, which is its "conservative skill estimate". Paraphrased into ToS terms, this means your stated ELO would be the lowest limit of your calculated ELO uncertainty range. In principle, it means that it's 95% likely that your skill is at least this ELO.
What this would do is create a numerical "penalty" behind Flake's significance levels. This would be an alternative way of answering rick's hypothetical question of whether we should give the same ELO to someone with a winrate of 75% over 200 games as we would to someone with a winrate of 75% over 500 games.
In Flake's current model, the answer is "Yes, but the one with more games has more bragging rights due to a higher significance level".
In this approach, the answer is "No, uncertainty penalties apply, so the person with over 500 games would have a higher stated ELO. These penalties become pretty minor at higher numbers of games played, but would typically still be enough to slightly separate the ELO of two players with the same winrate but a few hundred games difference until we're into 1,000+ games played."
For the non-statsy people, the example below may illustrate conservative skill estimation more effectively:
Player A
200 matches played this season.
Calculated ELO:
2,000 ELO (this is the ELO you currently see in ToS)
Uncertainty factor: +/- 6.93% (estimate based on 95% confidence interval for a 200 sample of an infinite population)
Uncertainty penalty: - 6.93% (the lowest limit of the uncertainty range)
Conservative skill estimate:
1,861 ELO ... 2000 * (1-0.0693) ... ie, 93.07% of the calculated ELO score
Player B
500 matches played this season
Calculated ELO:
2000 ELOUncertainty factor: +/- 4.38% (estimate based on 95% confidence interval for a 500 sample of an infinite population)
Uncertainty penalty: - 4.38%
Conservative skill estimate:
1,912 ELO ... ie, 95.62% of calculated ELO score
Example uncertainty thresholds using this model (all numbers rounded for simplicity):
10 games = +/- 31%
20 games = +/- 22%
50 games = +/- 14%
100 games = +/- 10%
200 games = +/- 7%
500 games = +/- 4%
1000 games = +/- 3%
2000 games = +/- 2%
5000 games = +/- 1%
This exact model is probably unduly impactful for ToS, and too computationally demanding to track on a game-by-game basis, but the same principle could be applied with fixed modifiers at confidence thresholds; most likely on a flatter curve, so that the effects are less fierce (e.g. taking lower quartile values rather than the minimum value as a starting point). It would also need some different language to that which I've used, because no-one likes to hear the word "penalty" in the same sentence as "applied to your ELO". I've just used it for clarity in explaining the mechanic.
Positives of this are that it does produce an ELO "reward" for more games being played and your ELO therefore being more accurate, but this is applied separately from the ELO calculation, and becomes less and less impactful over time, which evades the issue of ELO being grindable.