Note: for the lastest news on Zwift Racing Score, see our post What’s happening with Zwift Racing Score?
Zwift officially launched their Zwift Racing Score (ZRS) metric this week, and as one might expect, there’s been no shortage of feedback from the racing community. I’ve found myself in an interesting position, hearing feedback from the community but also conversing regularly with the team inside Zwift responsible for creating and continuing to improve ZRS.
Feedback from the community isn’t easy to summarize. On the one hand, we did an opinion poll where 57% of racers said ZRS events have been better than standard category enforced races they’ve recently joined. (21% said “about the same” and 21% said they were worse.) On the other hand, comments on ZRS-related posts here on Zwift Insider have largely been negative, with a few repeating themes:
- The range of abilities in each ZRS band is too wide (“I’m a C racing A and B riders!”)
- Heavier riders and sprinters are disadvantaged while lighter riders are advantaged with ZRS
- “My seed score is too high and racing isn’t any fun because I get dropped every time”
- Riders returning to Zwift are seeded too low due seed score being based on their 90-day power bests
Meanwhile, over at ZHQ I’ve been asking for details about how ZRS works, and why the team built it the way they did. As I’ve taken the time to hear and understand their reasons for doing things the way they have, I’ve generally agreed with the approach they’ve taken.
Both the community and ZHQ have valid points of view. But some comments from the community have simply been inaccurate – people “guessing” as to how things are working, with limited data. ZHQ, for their part, hasn’t answered or corrected many of those comments, because they are keeping some of the details of ZRS under wraps since it is a proprietary engine and they do have real competitors these days.
But I’m a big believer in how true, shared information keeps communities strong. So I asked Zwift if I could get some more detailed answers to questions about how and why ZRS works the way it does. That’s what you’ll find below. Here are my questions, and Zwift’s answers…
Did Zwift consider different approaches before deciding on ZRS? Why not just use ZwiftRacing.app’s algorithm?
Yes, we considered and even implemented several alternatives to Zwift Racing Score, but none made it past internal testing. These included using Machine Learning models to calculate seed scores, variations in score volatility, caps on score changes, and different score decay rates. We also explored automatically moving a user up or down after placing in the top or bottom y% of racers x times in their current category.
We partnered with Tim Hanson, the creator of ZR.app, to discuss our proposed scoring system, and we continue getting feedback from him and from the community about ZRS. While ZR.app has a lot of very interesting features, and is not hard to grasp by people with some experience in cycling races, we were looking for a solution that would be extremely easy to understand by total beginners, as in a rating scale that ranges from 0 to 1000 (versus a typical ELO rating whose range is not just unbounded but also subject to inflation over time).
ZRS has been a few years in the making. What took so long?
Initially, we implemented a purely results-based solution, using USAC’s points system, similar to what ZwiftPower had done. It soon became clear that our USAC-style points system had problems. Points are earned every time one participates in a race, so people who race more frequently tend to have more points than others who race more seldom, but the former are not necessarily stronger than the latter. In short, points may work great as an incentive for engagement, not so much as a criterion for pen categorization.
The project to switch from the USAC points system to ZRS started on 09/29/2023. Its central piece is a complex algorithm that has required plenty of analysis and testing. We also developed multiple tweaks during public testing (e.g. podium bonuses and UI improvements). In addition to developing and testing the scoring system, we overhauled our events logic and tooling to be flexible to the new categorization system and enhanced our race results user experience to display scores.
What are the trickier problems you are trying to solve with ZRS, and how have you gone about attacking each of them?
1: How to best categorize someone who has never raced?
We considered some Machine Learning techniques, but those models proved significantly challenging to maintain and evolve. We therefore evaluated an approach similar to that in ZwiftRacing.app, having worked with Tim Hanson himself, who has been immensely supportive in our quest.
Ultimately ZRS predictions are not too far apart from ZR.app’s “compound score”-based solution in essence. We still employ a user’s recent critical power for some interval lengths (30s and 10min), both absolute and relative to weight, to obtain our predictions. The basic difference is that we were able to fine-tune the model a little bit further. Namely, we experimented with different relative weightings of those parameters, as well as different critical power interval lengths, to make our predictions provide a somewhat better fit in our tests, which we conducted on a massive amount of historical data.
2: How do we avoid Tanking and Sandbagging?
- We use 85% of the predicted score as a racing score floor
- We will soon start considering the player score at join time (rather than at signup, which may be done a long time in advance, in certain cases)
- We are considering discarding low-quality race participation (the player score will not decrease if we detect that there were no significant power efforts)
How did you test your ZRS algorithm(s) for accuracy? Was real Zwift racing data used in the analysis?
We used anonymized racing and power data from every distance-based race that has ever happened on Zwift to train and validate our scoring solution and the different score prediction models that we tested.
How did you decide on the 10min and 30s intervals used to compute seed score?
Intuitively, among the main factors that affect one’s ability to perform well in a cycling race are their skill/strength in sustained efforts (say, for breakaways and climbing) and in sprinting (for mass finishes). While skill (namely, being well-positioned in the peloton, having an accurate sense of timing as to when to launch an attack, being able to decide whether to respond to someone else’s move, pacing yourself well when time-trialing or escaping solo, etc.) cannot be inferred from power figures, raw sprinting power and high sustainable power are certainly crucial for success and can be measured by short- and long-interval critical power, respectively.
To arrive at our final interval lengths for the seed formula, we considered different power intervals and their correlation with the actual scores obtained from race results. The 30s interval was the one yielding the smallest RMSE (root mean squared error) overall. Among the long(ish) intervals, 10 minutes yielded good results and is more likely to be available and accurate in one’s recent activities (versus, say, a 30- or 45-minute interval), owing to the “sliding windows” nature of the critical power computation (smaller intervals are more abundant) and to the fact that, while training, most people tend not to keep their maximum sustainable power for a long period except in the rarer cases that their workout specifically tell them to.
How did you decide on the ratio between 10min and 30s in the final model?
When developing the seed score formula, we tested thousands of different combinations of power duration and rider weight with varying weights to each factor to find what best predicted the outcome of historical race data. What this revealed is 30s power, 10min power, and rider weight were the best predictors, with more emphasis on 10min power and less emphasis on 30s power. This even outperformed the 5min power interval used by traditional compound score formulas.
We are using the results of this analysis as our starting point for the seed score formula fully realizing it may need to be refined in the future as we shift racing categorization away from power-based categories.
What factors might cause an inaccurate seed score, and what have you done to mitigate these?
Sure enough, the seed score is but a best-effort attempt to estimate ZRS based on some input parameters (power and weight). It will never be 100% accurate. The error is inherent to the statistical regression method.
Let’s look into the most obvious problems that may arise and how we mitigate them:
- The outcome of a statistical regression is a curve (i.e., a mathematical function) that adequately maps the input (in our case, the critical power for the chosen interval lengths and the rider weight) onto the intended ZRS seed. For Zwifters who have already completed a scored race, we want to use the seed score as a floor in order to preclude sandbagging. However, since the curve was produced out of a “cloud” of actual scores from existing users (during the regression), some users will naturally fall above, and some below the curve that was found to be the best possible fit. If we were to use the curve itself (i.e., the very ZRS seed) as a score floor, then those users who sit a little below the curve for those input parameters would end up being overseeded (a floor would be set too high). To mitigate that, we use just 85% of the seed as our ZRS floor.
Note that, for someone who has never raced before on Zwift, we do not want to risk underseeding them too badly, because an underseeded race participant will likely spoil the party of a number of other players by winning a race that was too easy (whereas an overseeded participant may have a bad personal experience but will not ruin the fun of anybody else). That is why after the first ZRS race, when we already have at least one instance of power effort and scored race result, we let the player’s score go 15% below the seed. When the score of a user touches the floor (e.g., after performing poorly in a scored race), of course there are reasons to believe that the score of that user must indeed go down; we just don’t want it to go down too much to avoid abuse. A 15% deduction looked reasonable after gauging the number of Zwifters that would touch the floor. - Another possible problem may arise, so to speak, from the very user attitude to training. If, in the last 90 days, the user has failed to perform close to their peak in the relevant critical power intervals (namely, 30 seconds and 10 minutes), their seed score will be lower than it should have been. We are planning to mitigate this by establishing a threshold based on their last year performance. If their performance in the last 90 days falls below the threshold, we will use the full one-year critical power figures to generate a seed score.
Once you have your seed score, what are the main variables in the model that impact progression up and down?
- Race results (win/loss)
- Field quality in the races you participate (how strong are your opponents compared to you, in terms of ZRS)
- Variations in Critical Power (adjusts your ZRS floor)
How have you been tuning these, and do they currently feel accurate?
- We run analyses on anonymized data comparing power and ZRS, for all race participations (and also on sampled production data).
- We check ZRS distributions, and also how many Zwifters are sitting on their floor. Depending on the value, it could indicate that we should tune thresholds and other scoring model parameters.
- Based on results from ranked races we feel confident that ZRS is making racing at Zwift a fairer and more enjoyable experience.
What are the main things you learned during the public testing period that began in June 2024?
- Mechanisms that accelerate score progression are important to increase racing fairness. We implemented a podium bonus system after reviewing test outcomes and the feedback we received from the community during the testing period.
- The daily score decay applied after the user’s last race shouldn’t be too aggressive, or it might impact the score progression of Zwifters who don’t race too often. We tuned our decay rate during the testing period.
- The straightforward multiplication of absolute and relative power does not necessarily produce the best-fitting curve to map critical power intervals and weight onto a racing score that is meant to reflect a user’s likelihood to perform well on a Zwift race. Refined combinations of “weights” (exponents) of the input parameters do indeed provide better results.
How do the current ZRS boundaries differ from the previous category boundaries?
In the past, category boundaries were based on fixed power (to weight) thresholds, regardless of the number of users that would belong in each category. The way it is now, we rank the users by ZRS and impose moving thresholds meant to establish cutoff ZRS levels that will distribute the number of users into each category according to a predefined Gaussian curve (fewer people in the extremes of the spectrum, more people towards the “central” categories).
Another advantage of using ZRS for category boundaries is that it is much more flexible for customization by our event organizers.
Do our current boundaries please everyone? Zwifters who happen to be near the boundaries might feel like they are at a disadvantage. The difference is that, now, if you are in a category that feels too strong for you, you will soon have the opportunity to move down, since your scores will likely reflect the fact that you are not obtaining good relative performances in that former category.
Will Zwift Racing Score ever be ‘finished’? What’s next?
Rating systems need ongoing monitoring and occasional maintenance. We also anticipate continued constructive feedback and feature requests from the community, so we plan to keep evolving ZRS for the foreseeable future.
Future developments include implementing anti-tanking and anti-sandbagging measures, as mentioned above, as well as incorporating course profiles into our scoring system. We’re also exploring ways to improve our category ranges, including increasing the number of pens and also dynamic ranges.
Are you planning to reset scores regularly in the future?
No, unless the scores start to appear inaccurate for some reason. In that case, we would not only reset the scores but also investigate the issue and adjust the model as needed.
We might explore the idea of “seasons,” where scores reset at the beginning of each season, similar to other MMOs. However, this is not currently part of our roadmap.
What are some of the common misconceptions you’re seeing about ZRS, and how would you answer them?
A major misconception about ZRS is the overemphasis on the seed score’s impact on individual scores. While the seed score does set a baseline (which isn’t intended to be very restrictive), most Zwifters are not expected to remain at this score. Their scores will naturally fluctuate up and down as they participate in race events.
Another misconception is that the 30-second critical power used to calculate seeds may inflate certain scores, suggesting other intervals should be used instead. However, our correlation analysis between multiple power intervals and scores found that using both the 30-second and 10-minute intervals provides the best fit, with the 30-second interval having less influence than the 10-minute interval in the formula.
My Takeaways
I think there are a few key takeaways from Zwift’s answers above which speak to some of the recurring concerns I’m hearing from the community.
- First, their explanation of why they didn’t just use the ZR.app algorithm. Basically: Zwift wanted to make ZRS Zwifty. That means keeping it simple.
While ZR.app has a lot of very interesting features, and is not hard to grasp by people with some experience in cycling races, we were looking for a solution that would be extremely easy to understand by total beginners, as in a rating scale that ranges from 0 to 1000 (versus a typical ELO rating whose range is not just unbounded but also subject to inflation over time). - I’ve seen comments like, “Did Zwift even test this? Why did they choose 30s and 10-minute intervals, because those seem like the wrong ones to use!” But Zwift chose those intervals after lots of analysis.
We used anonymized racing and power data from every distance-based race that has ever happened on Zwift to train and validate our scoring solution and the different score prediction models that we tested. - It sounds like Zwift is working to include more historic data in the seed score calculation, which will be welcomed by just about everyone.
We are planning to mitigate this by establishing a threshold based on their last year performance. If their performance in the last 90 days falls below the threshold, we will use the full one-year critical power figures to generate a seed score.
While Zwift’s current implementation of ZRS isn’t perfect, I’ll say what I’ve said in other posts: it’s much better than the old system. Things may feel a bit messy right now as riders haven’t done enough scored races for their ZRS to accurately reflect their abilities, but over time I think we’ll see scores settle in and races become more competitive and fun for everyone.
Questions or Comments?
Hopefully, this deep dive has clarified some things for the Zwift community. Still got questions or comments? Share below!