Due to various combinations of massive traffic increases, service provider problems, and much-needed code/database updates Zwift has experienced several planned and unplanned outages in the past few months.
Here’s a quick email conversation I had with Zwift CEO Eric Min about what Zwift is doing to reduce outage issues, and what the Zwift community can do to help.
Zwifters experienced a big “outage” on Tuesday, January 17 right as the KISS Europe/GCN Takeover race was starting at 8PM UTC. I know I personally wasn’t able to pull up the login form to even get into Zwift around that time, while others said they suddenly found themselves alone on course partway into the race, which had over 1100 participants. Can you tell me more about what caused the outage?
We experienced massive delays accessing our database from our cloud service provider, which under normal circumstances would have been fine but Tue also happens to be our busiest day. We suspect our service provider was moving around storage at the worst possible time. The delays caused users to get queued up and in many cases timed out their sessions. This was just bad service timing but we are putting in measures (this week) to combat this scenario in the future.
I know you use Amazon cloud services for most or all of your computing/hosting architecture. Can you give us any more details about how that is set up, so people get a better understanding of what it takes to power something like Zwift?
I can’t go into great detail about our Amazon services but our setup is nothing out of the ordinary. If you know AWS, it’s highly configurable and comes with its own set of headaches! But I will tell you that every aspect of our architecture can scale with more servers.
There have been a few outages in the last few months–a few unplanned, and a couple that were announced beforehand. What is your view of planned and unplanned outages? Are they bound to happen in an environment like Zwift, do you think they can be eliminated entirely, etc?
No one likes unplanned outages! In the past, the planned outages were needed to update our database or because there is a risk of a service disruption but we’re quickly moving to an architecture that allows Zwift to be updated without the service ever being unavailable. This is our goal.
What is Zwift doing to try to reduce or eliminate unplanned outages in the future?
We believe Zwift is already scalable. We’re now working to ensure that it’s also highly available. This is our current focus.
Zwift is a growing company, but still not huge in terms of its support team. When outages happen, what should Zwifters do to make sure the situation is being taken care of without overtaxing Zwift support?
We’ll be pushing out a status.io page so everyone can go to one place to get the service status. The Zwift Riders administrators have been very supportive in managing the discussion threads when we do experience issues. This helps enormously to help streamline the information on Facebook.
Anything else you want to add?
We are still a small company and while we may have recently raised investor capital, it takes time to grow the team but it is our second highest priority. While we all are excited about the new features on our roadmap, our immediate priority is to ensure that Zwift is always available.