Put simply: when Zwift started, it was not designed as a competitive system. It was designed to make riding indoors *fun*, and that it does.

But ever since the dawn of Zwift, people have raced on Zwift. And there has been one consistent challenge: can we have racing that is fair and fun please?

To split races into categories for a more engaging/competitive experience, a categorisation system emerged: A, B, C, and D, (effectively) set by the rider’s maximum 20-minute w/kg power. This was a very good start, but an ideal next step would be a ranking system which ranks each rider, say between 0 and 1000 with 10% of riders in each 100. In this article and the next, a reasonable way of doing that will be explored. Let’s dive in!

## The Data

From data that is now easily available online through ZwiftPower, it is possible to see which of 8 possible power rankings most correlates to race results. Those eight measures are: 15-second power, 1-minute power, 5-minute power, and 20-minute power, each measured both in watts and in watts per kilogram (wkg).

Furthermore, a wonderful team of DIRT riders recently put on the Dirt Racing Series. This provided a series of races by the same group of riders. Using the rider power data and the race results allows us to look at how a rider’s power metric links to their race placing:

The left chart shows the 20-minute power (in w/kg) of the riders by place in the race for the lowest category (combined C and D) on the crit-style race in the series. On the right, 1-minute power (in watts) of the riders by place in the same race. The graphs are a bit skewed by the different scales on the left, which make comparison hard. To make things clearer, the data can be scaled so that the highest power in the set is scaled to 1, and the other power values are scaled accordingly. The data then looks like this:

Clearly, the second one (1-minute power in watts) influenced the outcome more than the first one (20-minute power in w/kg).

## The Analysis

There are a number of ways to work out the correlation between data (in this case, each of the 8 power measures), and a result (in this case, race rank). In this analysis, three ways were picked for comparison: The slope of the best line fit, the Pearson coefficient, and the Spearman coefficient. On the graphs above you can see these for the data (as well as the covariance, which was there for sanity checking the results).

They are negative (as expected) because the line goes downwards, (*ie* the less power, the higher (in number) the finish position). The bigger (*ie* more negative) the measures are, the more that that input (rider’s power data) is linked to the output (their finishing position in the race).

One race is interesting, but much more can be learned from combing information from *lots* of races. As an example, using the “B” group in this race series, all of the race results can be combined to get a set of graphs. Here they are for 1-minute power in watts and the 20-minute power in w/kg (with normalisation of the power applied so make the scales are comparable) combining all the information from all of the B group races (except the Team Time Trial, see postscript 1):

For each data collection, there are three possible values (slope, Pearson, and Spearman) for each of the eight measures. To simplify things, the actual values can be turned into a “percentage contribution”. For the Pearson coefficient for all of the C+ category races, here’s what that means:

Becomes:

And this means we now have an idea of the “percentage contribution” that a particular power measure makes to the rider’s result. Those percentages can then be averaged across all of the power categories. Pearson, again, now for all races, all race categories:

To balance out differences between the three measures (Pearson, slope, and Spearman), they can be averaged, which gives:

This now gives an idea of the *influence of each power measure on the race outcome*. In conclusion: Every power measure is linked to the outcome, but some more than others. Not surprisingly, 20-minute power in watts is the least influential, whilst 5 and 1-minute power in w/kg (accelerations and short climbs) are the most influential.

So that’s the analysis. The next article will explore how to use this information to create a clear way of ranking riders that should give good categories which can be determined easily by race organisers to suit the riders they expect in the race.

### Postscript: A Couple of Questions You Might Ask

The categories which riders signed up for in the races used to create the charts above were set using 20-minute w/kg power, meaning that there is less scope for variation in that power than the others. This means that this analysis might be biased *against* its influence. Indeed, it *does* have a bit more influence in the top and bottom categories (A/A+ and combined C/D) than the middle ones, which agrees with this concern. However, it is not very significant, and the combining of all the categories together will come close to resolving that.

Secondly: what about different types of racing? This analysis combines different types of racing (an individual time trial, two general races, and a crit). It doesn’t include the TTT data as no simple way was found to link those team results to each individual’s performance that is consistent with the rest of the analysis. However, examining the race data by race type shows what we would expect given everything else so far: the iTT races are more influenced by the longer power measures (5 and 20-minute), crit racing by the shorter ones (15s and 1-minute), but not to the exclusion of the others. Overall, the data from these races suggests that, for an overall “racer assessment”, the ratios given above remain sensible, although there is one nuance that will be addressed in the next article.

### Postscript 2: Data Sources

The power data used in this analysis is available to any Zwift rider who connects with Zwift Power; it has been anonymised in the processing as can be seen. The race data is from the Dirt Racing Series (who have explicitly given permission for anonymised data use for this analysis).

## Questions or Comments?

Share below!

Were the power numbers used from the races only or from historic data? In other words If a racer could average 3.4 w/kg (historic data) but only did 2.8 w/kg in the races which number would be used?

Interesting question, my best 20 minute efforts (pretty much all from short races that are essentially just Innsbruck KoM efforts), are significantly higher (maybe 0.5 W/kg) than my typical race average power for scratch races, points races, and everything other than iTT and TTT races. As someone who is tries to survive at the back of packs to save energy for hills (my weakness) and sprints, the average power I see at the end of a race compared to my FTP always makes me think that, looking at the numbers, I should have been able to do more/better, but the… Read more »

The power number used was the rider’s (90 day) data from Zwift power. It would be interesting to look at actual race data, but as the ultimate goal was a categorisation scheme, it had to use data which would be available before the race!

Categorization based on more than 20 minute power is an improvement on just straight 20 minute W/kg, but a results-based ranking will be the best outcome and result in the most interesting races.

I do think that results based ranking could be great, true question I wanted to explore was could a system be developed that used that data that was currently available and was an improvement on simply using 20 minute w/kg.

I get that, and it’s cool that you’re thinking about it. It sounds sort of like you’re trying to find a data-driven approach to choose a new set of parameters for Critical Power. That’s cool as that seems sort of a black box to me, so it’s nice to see how one would derive something like that.

The challenge with only using a results-based ranking is that riders may heavily bias courses that favour their strengths (big guys (100kg) like me tend to avoid races with hard climbs). I think results-based rankings work in other video games (at least the ones I’ve played), because the game usually cycles you through all the maps/options randomly, and you’d often play 10-40+ games a week, which provides a nice sample size. I’m lucky to race 6-8 times a month in Zwift. In other words, given we have the freedom to choose the races we prefer, which I’d assume correlates with… Read more »

I can sort of see that if you and I (104 kg), only do flat races and crits and then decide we want to try a hilly race but are way over our heads due to artificially inflating our ranking by racing our strengths. I tend to race series, so I do whatever random races that the organizers call up next, so I tend to have a fairly good mix (though usually light on iTT races), but I realize that’s not true for everyone.

I wonder how much of the 1 and to a lesser extent the 5 min power numbers are contributing because of the insane watts you have to put out to stay with the the front group in the opening minutes, to then stand a chance of actually finishing well.

The data suggests that the shorter powers (15s, 1 and 5 minute) have a greater impact on race result than 20 minute powers, Which makes sense, because Zwift races are generally determined by critical phases which are those kind of lengths.

I did something similar as captain for my ZRL team. I took the teams’ best efforts (using the 8 parameters you did, from their ZP best efforts) and created a simple weighted model for the different races we were doing – flat, punchy climbs, long climbs. This was how we did the selection each week.

Similar conclusions to you i.e. 15s/1min power is critical for flat races, 5 min power becomes more influential for punchy courses, and 20 min power only really has an impact if you’re doing Innsbruck/Epic KOM or bigger.

Great minds think alike 🙂

Very interesting convergence! Perhaps the conclusion regarding the greatest influence of 1min and 5min power on your data set of Zwift race results is partly to do with the make up of the Zwift races themselves. Perhaps, following on from Adam’s observation, most Zwift races are ‘punchy’, so lending themselves to this. Or another way, if Zwift races where all on Ventoux or Alpe DZ, then of course the 20min power would be the best correlator. So, your observations on power vs race result are also dependent on another factor, the type of courses most used in Zwift races

The course and race type does affect things: if you isolate the iTT data it leans more to 5 and 20 minutes power; if you isolate the crit race it leans more the 15s and 1 minute power. You are absolutely right that if all races were up AdZ or Ventoux, then 20 minute power would be more rewarded. Ideally you would have 3 or 4 different category scores for each rider for each race type, with different weightings as @Adam suggests. The risk is that you end up with lots of different ratings, and everything gets very complex.

I think the way Zwift drafting needs to be made more realistic. Also realistic cornering. Things such as Auto breaking so the race reacts like a real race before they start getting all sciency about the data they’re collecting.

Interesting! I think the biggest concern with any power -> Category correlation is sandbagging though.

correlation like this is ok if it’s a quick, easy, short term thing.

but for races, I think categories should be based on a rider’s recent historical actual race results/placing

I think many people would be very happy with a well thought out race result based ranking system, maybe this little contribution help towards a stepping stone between the present and the ideal?

Sounds like this would lend itself to a machine learning approach, so looking at the data for all racers would make sense. 80% of the data to train the model, 20% unseen data to test the model. I’m not a practitioner BTW, but I do work with some bright people who are and given enough training data results can be very good.

Interesting proposal! It could well be worth looking at that approach if the ideas gain traction …

Good points. (1) I agree that there is doubtless correlation between the different measures which should be accounted for in a fuller treatment, but I would hypothesise that some of the 8 measures have more correlation with race result than others, and that that will be related to the correlations seen here. (2) I would be interested in a methodology for this, as I couldn’t come up with a clear one – if you come 1st (say) in the B+ category race, what place in the A race is that equivalent to?

I’m spitballing here, but on (2) could you use “E” races to set the weights for the formula?

Thanks for your work on this.

I think that’s a good idea – although I suspect that “E” races are less entered by C or D category riders, so it might be biased towards B-A+ riders. I’m more than happy to work with Zwift or others with more data – the DRS series guys had some great, clean, data for a proof of concept, and they were happy to work with me on it. I’d love to see it done (intelligently) with the data on Zwift Power …

I’m not a statistics expert, but it doesn’t seem too surprising that 20-min w/kg isn’t as highly correlated as other measures when it is already accounted for once through the existing categorization scheme. It might be interesting to look at a series of races where riders from multiple categories start together, and rank by total time.

That’s a really good idea. Looking at the top end and bottom categories (where there was no limit on upper and lower 20 min wig respectively) suggests that the other powers are still more significant, but your idea makes a lot of sense.

Guys,

You could try looking at the Masters series…these races are based on age not power and should provide a good comparison to the present power based categories

Interesting idea – I must admit I’ve not tried one as I know that I’m not that special for my age bracket, but maybe I should try one out!

Exactly this. It seems very likely in a mass start event that 20 min or longer power will be far and away the biggest predictor of race result. Someone who can do 4w/kg for the duration of the race will always finish mins ahead of someone who can do 2w/kg. Its the shorter duration power that decides the finishing order in a group of riders who get closer to the finish together. So while finessing the ranking based on shorter duration powers will allow different people to win it will always be necessary to primarily rank by FTP or similar… Read more »

The stronger the draft effect gets the less impact the longer duration power has. Because in the draft someone with lower FTP can sit behind someone putting out much more power. This is the same reason Cav can beat Pog in a flat stage but has no chance on a mountain stage. But Average Joe would finish hours behind both of them on any type of stage. So on Zwift any type of power based categorisation, no matter how nuanced, will always mean whomever lucks out to be at the top of the category is always more likely to win.… Read more »

Somehow I missed these when initially posted, sorry. In essence, whoever is at the top of the ranking bracket (however you determine it) is more likely to win, by definition – assuming the ranking has any merit at all. 20min power is a component – as you say it is what’s needed to stay with the pack. But purely using 20 minute power essentially guarantees that the shorter powers must win. By using a spread of powers, you start to push those with only short power but sufficient ftp to compete up a bit (ie actually get tested during the… Read more »