Our weekly release process has been mostly the same for quite some time now. Every week at night (PST) a maintenance banner would go up, we’d roll out new versions of server, client and studio and kick every single player out of the game when restarting the servers. Additionally, while the server grid was rebooting new join attempts would frequently go to the server that’s just about to die and you’d get the dreaded ID-17 error. Also with addition of TeamCreate we would shut down your active edit session, which could lead to data loss.
Over the years we’ve done some important improvements to the system, reducing the time it took to do the rollout and reducing the probability that the rollout would go badly and would have to be reverted, but the general shape of the process stayed the same. I’m happy to tell you now that we have migrated to a completely new process and have done one successful release using that - this is what we will be using going forward.
Here’s how the new process works:
Every week at night (PST), we release a new server binary and start deploying it to our server grid. We do smoke tests to make sure you can still launch games without issues, and monitor metrics.
Once we do that, newly started games (private servers, or fresh servers that get created if there is no room on existing servers) use the new server binary; all existing games and TC sessions proceed as is
After the binary is deployed to the entire grid, we mark the old version as “bleed-off”. This means that our matchmaking algorithm will prefer putting people who join games on new version as opposed to old version; you can still join old servers under some circumstances.
~12 hours later, we forcibly shut down servers that are still running with the old version. Due to how matchmaking algorithm works this results in very minimal disruption - this kicks ~0.3% of players from their game sessions (they might be bots! Or people cheating in Weight Lifting Simulator . We don’t know).
Right after that we release new desktop clients and Studio. This might change at some point, but for now it makes sure that the new servers can’t get connections from clients-from-the-future. This forces you to update your client next time you click Play.
When things go well (which hopefully will be most of the time! short of us releasing a bug), only one step here forcibly terminates your gameplay session - step 4, and it only happens if you’ve been keeping the server alive for >12 hours. Most other steps are not disruptive at all; step 5 is somewhat disruptive to desktop players who are forced to download new version but isn’t disruptive to anybody else. The process works the same for Team Create - we will only terminate your TC session if it’s been kept alive for >12 hours during the release night, which should dramatically reduce the effect of releases on TC - we will continue to improve in this area to completely eliminate it.
We are still tweaking the process and implementing more safeguards so that we can interrupt and roll back the process before serious bugs start affecting users/developers.
Huge thanks to the team that worked on this - there are many people who were involved in this, including engineers from client team who had to change client/server installers to make rollouts like this safer, engineers from scalability team who had to make the new gradual bleed-off mechanism, engineers from release team who had to completely rework the automated deployment system with the new flow in mind, and many others who helped manage, guide and test the process. I’m just a messenger
You can’t download the new client when your friend is still in an old server (because that’s the +12h point of the release where we close old servers). If you join your friend after point 3, you will get into an old server - matchmaking algorithm takes that into account.
But real glad to see that servers won’t go into a mass panic shutdown whenever the orange banner pops up! What happens if someone who has a new version tries to join a friend in an older server, though?
See my reply to @TheGamer101 - it works, matchmaking respects that. You can theoretically keep the server alive with short gaming sessions if you keep joining your friends and then they leave and other people join you etc., but it’s unlikely that this can be sustained over 12 hours. And it does happen, we’ll shut this server down the following morning.
It’s 11pm PST on a Tuesday. I’m writing some crazy machine learning code that will change the world
Servers shutdown and my TC dies and I lose my code (maybe cause the server doesn’t exist anymore?)
I just lost an entire night’s worth of work (this has happened to @Sharksie and I on multiple occasions)
We typically have our team create servers running forever since we are a studio with multiple full-time contractors. I’ll be halfway through a block of code and it will lose connection.
I’m not sure what the solution is to this (fixing team create saving is of course critical) but a dedicated release time, or just a studio prompt that gives a warning would help remedy this issue. Do TC servers pull WIP scripts from clients when they shutdown? (The code recovery tool is a nightmare because it says you lost work on identical diffs)
The new release flow should make this specific scenario impossible. The way it can go wrong with the new flow is:
It’s 9 PM PST on a Tuesday. You start a fresh TC session and start working on it.
In 1 hour, we release new server version but this keeps your server running on the old version
In 4 more hours, you are tired but @Sharksie joins you; you continue writing code for an hour and then sign off and leave
In 5 more hours, @Sharksie is kinda tired as well and wants to stop but luckily you woke up and are eager to continue writing your machine learning code that changes the world, so you sign back in
In 3 more hours, it’s 10 AM PST on a Wednesday and we kill your TC session and you lose some amount of unsaved progress + trigger possible TC bugs. Note that we kill your TC session because it has been started on the old server and there’s been 12 hours since the new server version has been available - if the TC session started after 10 PM PST it would be the new version and we wouldn’t touch that up until next week.
We currently don’t have specific stats on TC session shutdowns so we don’t know - this might be happening right now. It would definitely be great to add some kind of notification for this. In theory you should only lose 5 minutes of progress, but there are some fixes Studio team is actively working on (check with @Silent137 for details) that should address data loss that can occur due to disconnects/shutdowns like this.
Could TC sessions get notifications on when a forced shutdown will happen? If you’re working on something all night or forget what day of the week it is, you may not realize a new update is rolling out. At least this way anybody working on something can anticipate when a shutdown will occur, save their work, or start a new updated server instance in advance.
Something as simple as a dialog box for all connected clients would suffice.