About an hour ago, we began receiving reports of data loss in our game, even though we haven’t updated it since yesterday.
The issue started at 12:55 PM EST. We’ve temporarily made our game private while we investigate the cause.
A few weeks ago, we switched to a new datastore solution. Previously, we were using DataStore2, but now we suspect Roblox may be returning incorrect dates for keys.
Here’s a code snippet we believe is causing the problem. This code determines whether to use our old datastore system or the new one we developed internally. Our best lead so far is that this code somehow changed its behavior about 1.5 hours ago, despite the game itself not having been updated in roughly 24 hours.
I am from the Data Store team, we did not make any changes to the service and do not see any errors from our dashboard or reported from other experiences.
Can you describe me what data you are storing in your Data Stores?
Can you provide me with some Data Store keys, the value you expect to be stored, and the value you are retrieving out that is different from you expectations?
Hi team, could you post a summary here of what issues you saw with timestamps?
E.g. At 9:22AM, noticed DataStoreInfo not being populated correctly - were all fields corrupted, or certain ones? Which ones did you use in your switch logic?
Hey, thanks so much for the rapid response. We deployed a fix and rollback which were successful.
I’m still investigating what happened. I haven’t been able to find any evidence of corrupted or incorrect data from DataStoreService. The thing that’s really confusing to me is that (at time of incident) all of our data storage logic went unchanged for over a week, and zhongbot mentioned that DataStores, as we have been using them, have been unchanged for a week.
That speaks to me like an issue akin to a feature flag flipping, or backend service changing/failing. However, we confirmed that our own services didn’t experience any issues at or around time of incident. So the last thing that I can think of is maybe there was an FFlag in DataStoreService in the engine (if there are any at all) that flipped?
Either that, or an extremely rare race condition with our code occurred in a few servers at the same time. Very strange.
Our code uses DataStoreKeyInfo.UpdatedTime to decide which is the most recent data. If a server ever decides to use DataStore2 data over our new system, we see logs which indicate that. During the incident, we saw logs saying it chose DataStore2 data because it was more recent, but never any logs indicating we wrote to DataStore2 at any point.
I’ve narrowed it down to this sequence:
A couple/few seconds before the log indicating we overwrote fresh data with stale DataStore2 data appears, something somewhere reads stale DataStore2 data
Whoever read that, writes it again
The server the player joined compares timestamps between our system’s last write and DataStore2, sees that DS2 is more recent, overwrites fresh data with DS2 “fresh” stale data
I’ve gone over all of the relevant code several times and haven’t been able to come up with a plausible way this sequence could happen. Regardless, we fixed it by just completely ripping that bit out since we’re confident now in our new system.
Hi, I’d like to bring up some information that may help.
I recently ran into a problem that i don’t think many people know about.
The issue is that despite waiting for a datastore to save/update in one server, then teleporting to another for gameplay, the datastore can return old data, which i suspect is database replication lagging behind, which seems to show itself more frequently if the player is loaded into a server in another region.
My reproduction and confirmation steps were as follows:
Generate a GUUID, put that into the players data, save that GUUID into memory store as its more guaranteed to be updated.
Save the players data, and wait for that to complete fully before teleporting
Teleport the user to the new server
Load the GUUID from memory store
Load the players data, if the guuid does not match on the players data, warn, and retry until it does.
I’ve seen it take up to 2 GetAsync loads until the data matches correctly, which I have a exponential backoff, so it’d wait 2 seconds after the first one was inconsistent, then 4 seconds for second retry.
The reasoning I think it’s replication is the fact I can keep loading the same key and get newer results eventually.
I’ve also talked with @Raspy_Pi and he said this is an issue as well. I couldn’t seem to find people talking about this problem, but as stated, i believe this is whatever database Roblox uses having slow replication (Slow as in 1-10 seconds for all regions to be updated)
@zhongbot@ivorycastle
I’d like to ping you to maybe get more details the possibility of this happening, and report the fact i’ve seen it happening
According to loleris, this is an issue with GetAsync, and from testing with purely UpdateAsync, it seems to be improved, but still more tesing and analytics to get