:SetAsync() calls are getting stuck at scale - yields thread indefinitely & no datastore error messages reported

Seranok · September 26, 2019, 11:38pm

Both services share the same HTTP threadpool. so it is possible that actions taken on one will impact the other.

As for your issue with simultaneous writes from different servers – the only way I can see this happening is if a player is rapidly joining and leaving games, which seems possible.

Ice7 · September 27, 2019, 1:47pm

Thanks, Seranok. The only time I can see the rapid/leaving joining is in the case of teleports, where the player joins the new server before the data had a chance to save - however to avoid this, we always save before the teleport happens, so it’s likely due to something else.

We track every purchase externally, and at this point in our game theres no way for the player to remove items from their inventory. We have a script that regives the player an item when its not in their inventory, but they have a purchase record of it, and i’m seeing these regives happen about once a minute, which is a very tiny fraction of our total traffic, but still occurring nonetheless. After monitoring these cases, I’ve seen that certain users are regiven items for a while, and then it stops for them. I’ve replicated these users data, and have been unable to reproduce it, so its not something weird in their data. What I think might be happening is that either the SetAsync calls were getting stuck so that it was unable to unset them from the data cache or PlayerRemoving did not fire in these cases, so that the server thinks that the player still is in the server, so when they join a different server, their data will be overwritten by the autosave in the bad server that they’re not actually in. I have no concrete evidence for that at this point, but I will try to gather evidence that there might be cases of two game servers thinking that the same player is in both and get back to you. Thank you for everything!

Ice7 · October 14, 2019, 2:37pm

Hi Seranok

I suspected a while ago that something else might be causing DataStore requests to being queued when they shouldn’t be (ie: not a result of calling setasync on the same key within 6 seconds, nor the SetIncrementAsync budget being full), but didn’t really have anything solid at the time to show other than a couple of specific cases. This morning though, I saw datastore requests being queued across multiple game servers at the same time. These also corresponded with the HTTP 429 too many requests errors you see in the screenshot below. It seems that the two are related and requests in one server can be queued based on things outside of the game server.

Ice7 · October 20, 2019, 7:11pm

I saw the same thing today. Requests being queued across game servers at the same time. This did correlate with a large number of reported stuck datastore requests, which makes sense for reasons that @Unholykowboy described (the fact that these requests are added to the queue when SetAsync is called) - however, the requests should never be queuing as we’re not calling the same key within six seconds, and ensuring that we never make a call that exceeds the DataStore budget. The fact that the spikes happen at the same time in different job ids (place instance ids column in the screenshot below, then) is also an indication that something strange is going on. The first screenshot indicates the most recent spike of queued requests across different game servers. The second screenshot is the next spike in queued requests across servers (these are hours apart). These are not happening consistently nor in the same server, but rather in spikes across multiple game servers at the same time.