:SetAsync() calls are getting stuck at scale - yields thread indefinitely & no datastore error messages reported

Ice7 · August 24, 2019, 5:43pm

Since Thursday morning I have tracked 3,887 cases of :SetAsync() calls getting stuck - not returning anything. These correspond to 90 unique cases. By unique cases, I am referring to 90 different user ids. As 90 may not seem like many, I am very concerned as to why this is happening for these particular user ids.

How do we know they are getting stuck?

Two reasons:

I am setting a flag after our datastore retry function completes. The function accepts a datastore function, such as SetAsync, or GetAsync, and that datastore function is called inside a pcall. It only retries 5 times, or until there is a successful response. If the flag has not been set after all datastore retries complete, then I am logging this event to our external backend. In other words, the :SetAsync() is getting stuck with no error messages.
Another reason I know this is happening is because we are unsetting the player from the session data cache as soon as our datastore function completes. These players are never getting unset from the data cache, meaning that even after they leave, our autosaving function continuously attempts to save their data. I can get around this by checking to see if the player is actually in the server before it tries to save their data, but it won’t change the fact that the calls to :SetAsync() are getting stuck,

What I can gather from all of this is that its happening several times for the same players, so the datastore servers are timing out for these keys (user ids).

The game is Royale High, so the total number of :SetAsync() calls overall in all servers collectively is several thousands per minute. Of these millions of :SetAsync() calls since Thursday, there are only 3,887 cases where they are getting stuck - so it’s an issue that’s only noticeable at this scale.

Unholykowboy · August 24, 2019, 6:05pm

@Ice7 - We will investigate and follow up.

Ice7 · August 24, 2019, 10:31pm

Thank you! Let me know if you need the user ids and job ids for these cases.

Unholykowboy · August 25, 2019, 4:08pm

@Ice7 - Thank you for the additional info, we will review this internally and follow up with you with an update or with a request for additional details.

Ice7 · August 26, 2019, 6:14pm

Hey @Unholykowboy - just letting you know we are seeing a large spike in 502 datastore fails just now.

Seranok · August 26, 2019, 7:09pm

@Ice7 Can you show the full error message? Some of it is clipped.

Ice7 · August 26, 2019, 7:14pm

Sure thing: These errors seemed to all be the following: 502: API Services rejected request with error. HTTP 0 (HTTP 403 (HTTP 403 (Forbidden)))

These started at 17:00 UTC time and went on until around 17:28. There were 952 of these failures during this time (all 403 Forbidden errors).

Seranok · August 26, 2019, 9:50pm

We are looking into both issues (SetAsync calls getting stuck + some Data Store operations resulting in 403s).

OutsideInSM · August 27, 2019, 5:00pm

I’ve also been asked to help and will be DM’ing you for more info shortly.

Seranok · September 10, 2019, 4:43pm

Hey Ice7, we looked into both issues.

For the 403 issue, it’s definitely a bug. We’re adding additional logging so if it ever happens again we’ll have more details. Please let us know if you see additional errors.

For the SetAsync yielding, are you seeing the following warning in the console?

DataStore request was added to queue. If request queue fills, further requests will be dropped. Try sending fewer requests.Key = TestKey

If so, this is intended behavior. If you perform too many SetAsync() calls on the same key, you will see this warning.

Ice7 · September 10, 2019, 6:56pm

Hi Seranok, Thank you - I will definitely continue to monitor this on our end and let you know if we see any more spikes.

I’m preventing :SetAsync() calls from getting called for the same key within six seconds. Essentially I am storing a table where the index is the players id and the value is the last time :SetAsync() was called for that player id. Before I call :SetAsync(), I’m checking to see if was it was called in the last 6 seconds - if that’s the case, then bailing out. I am logging any times when the following warning message gets displayed:

DataStore request was added to queue. If request queue fills, further requests will be dropped. Try sending fewer requests.Key = %

This shouldn’t be occurring at all in our case since I’m checking to see when we called :SetAsync() last, and if it was in the last six seconds, then bailing out. However my logs are indicating I’m seeing it for one key (player id), 3763207098 in particular it is occurring hundreds of times in different place instance ids as well, which is really strange.

I noticed the “AutomaticRetry” property in DataStoreService - not sure if it has anything to do with passive retries. I’m not sure why it’s trying to make the request over and over again (since the warning is occuring). I don’t see how it could be triggered on our end (since we’re bailing out if the last call was less than 6 seconds ago, so the request should never queue), That’s why I’m suspecting it might have something to do with passive retrying, if DataStoreService even enables that?

I also crossed reference this to these warnings with data store fails for this key and I saw the following. This key was failing in particular due to a HTTP 429 error:

There were two fails at exactly the same time (rows with ids 1002186 and 1002187 in the following screenshot). But they were in different place instance ids as well.

I only saw this many request queue warning messages for this key in particular, and there were only 14 in total since yesterday, but as you can see from the first screenshot, the rest were failing either one or two times.

buildthomas · September 10, 2019, 7:27pm

Do you set the last time before or after the SetAsync call, or both?

If you only set it before, it might get into trouble when i.e. the call completes 2 seconds later (so from that point, you have 6 seconds) and you call it 4 seconds later, and the delay will escalate over time because requests will be yielding.

If you only set it after, then you get into trouble when your code can accidentally call the SetAsync several times before the first one in the sequence finishes, also leading to yielding.

Ice7 · September 10, 2019, 7:29pm

Immediately before the call to :SetAsync() (if i were to set it after, theres a possibility of two or more requests calling :SetAsync() if the successive ones are called before the call to the first completes).

buildthomas · September 10, 2019, 7:32pm

So an engineer can correct me if I’m wrong, but AFAIK the 6 seconds count from the time the call returns. That might be the issue that would escalate the amount of warning messages over time because the yielding calls can stack up that way if you assume it’s timed from right before the SetAsync.

Ideally you’d “lock” the datastore for that player from saving while a setasync is in progress, then time from after the call:

if lock[userId] or tick() - (timer[userId] or 0) < 6 then
   return
end
lock[userId] = true
-- do SetAsync call
timer[userId] = tick()
lock[userId] = nil

Seranok · September 16, 2019, 5:42pm

@Ice7 Can you confirm if you are assuming the 6 seconds starts when the call is first made or when the call completes? Would like to confirm whether this is due to the client queueing throttled requests or if there is an issue we need to investigate further.

Ice7 · September 17, 2019, 4:07pm

The 6 seconds starts before the :SetAsync() call is made. Are you referring to the case where if it were to be set afterwards, two or more calls could potentially be made before the lock is set? Also, the requests that were throttled as shown in the screenshot above were the only ones that day (September 10th) - and one user id had 454 throttled requests. I also cross referenced that user id (376327098) with the datastore fails, and as you can see in the screenshot immediately below there were two fails for that user (502 API services rejected Http. 429) at exactly the same time (2019-09-10 17:39:28) in different place instance ids - it seems astronomically unlikely that this could happen for the same user id at exactly the same time. I will check the status on the request throttles today and get back to you.

EchoReaper · September 18, 2019, 7:31pm

FWIW, as someone who was contracted to fix another front page game (Zombie Rush)'s DataStore congestion and the resulting data loss, saving every six seconds is seriously overkill. It is common for developers to encounter DataStore issues and think “I will just save more often” and end up saving every six seconds, saving each version of player data as a new key, keeping track of the entire history in an OrderedDataStore, or backing up player data to some webserver. This just makes your game more prone to error.

When you are treading limits like this, you run into issues like the one you’re encountering now where being off by a small margin has catastrophic consequences for your game. There’s no reason to save this often, and it minimizes the amount of free DataStore requests you have to spare. Instead of haphazardly spam saving constantly, you only need to save once every few minutes (Zombie Rush does five) or in a few select cases like when the player leaves the game (since they may then join another server and need their data asap) or a Robux purchase that you want to be able to return the correct ProductPurchaseDecision to. This will also make your game extremely resistant to failure under events like a large # of players leaving at once (e.g. shutdown) since you have a large reserve of requests and aren’t waiting in BindToClose for 30 seconds hoping you’ll regenerate enough. Queueing up requests like this also makes it easy to handle errors + retry since the code modifying data never needs to handle this itself.

Ice7 · September 18, 2019, 7:37pm

I am not saving every 6 seconds… The six seconds discussed above is specifically referring to a prevention mechanism implemented to prevent requests from queuing - if SetAsync was called last then six seconds ago for the same key, don’t make the call again, so that the request doesn’t queue. What I’m describing is a scenario where my data is showing rare cases of specific keys getting stuck.

Seranok · September 24, 2019, 12:55am

Given that we’ve only received a single report about this issue, I’m tempted to say this is an implementation-specific issue and not a Roblox bug. Happy to revisit if we can get solid evidence that this is an issue, e.g. a simple reproduction case or reports that other games are experiencing the same issue.

Ice7 · September 26, 2019, 6:46pm

Hi Seranok - Thank you for looking into this. We have outgoing Http Requests in our game - is there any way that the Http Budget could be affecting the DataStore budget? Are these two services in any way intertwined?

I just have one final question for now. How is this scenario possible, two datastore fails at exactly the same time in different place ids as show in the screen shot (look at the entries with the timestamp 2019-02-10 17:39:28, the 2nd last row and the 3rd last row) Again, I’m not seeing any requests being queued - it would make sense if the request was made and was queued, player joined a different server, and the datastore request in that server was also queued, and they both failed at the same time - however, I am not seeing any logs of requests queueing, since I implemented my prevention mechanism for that (by checking the DS budget and preventing :SetAsync() from being called)