X in 2^24: A tale of the birthday problem, and how you can avoid the same mistake

Summary

For the past year, I’ve been working on and off on a remake of Roblox’s old building place, called “Welcome To Roblox Building”.

In an effort to compress save data in the game to allow for much bigger structures, I used a lot of compression techniques to pack the data down as much as possible.

Despite best efforts on my part to make sure everything was handled smoothly without the possibility of data corruption, I made a few critical mistakes along the way which resulted in an elusive bug occasionally corrupting peoples data in my game.

Having finally discovered the flaw that caused this bug, I wanted to tell the story of how this ended up happening, and how you might be able to avoid the same fate should you go down a route of data compression as well.


Save Data: Phase 1

Models in WTRB are built using a catalog of assets which can be stamped onto your baseplate. Each asset is assigned a unique ID.

For every unique stamped object on your baseplate, I assigned it a GUID, so that each unique object could be identified for modification, deletion, serialization, etc, on the server.

When saving the data, I would map the save data into a dictionary like so:

{
	Objects = 
	{
		[GUID] = AssetId;
		...
	}
	
	CFrames = 
	{
		[GUID] = CFrame;
		...
	}
}

This made it very easy to read and write the data I needed to assemble baseplates, but it was very wasteful:

  • AssetIds were written multiple times and were usually 7-9 digits long.
  • The GUIDs were 30 characters long each.

So its clear I can optimize the redundancy of this data. Let’s try a few things!

The Critical Mistake

One of the first optimizations I made to the data was reducing the GUIDs to 6 hexadecimal digits.
This would optimize some of the space consumed by the GUIDs being used to index the dictionaries of my save data.

At a glance this doesn’t seem to be much of a problem. Using 6 hex digits gives you 2^24 or 16777216 unique IDs to work with. Under the assumption that I fully safe guarded any possible case of two IDs sharing the same value, it seemed unlikely that people could place that many objects without hitting the 260 KB data limit of DataStores anyway. I was in the clear with that optimization, right?

Well, yeah! The amount of IDs wasn’t actually the problem here. The problem was that I was making it far more likely for an ID collision to cause problems, should I fail to check for ID collisions correctly. This is eventually what happened, but we’ll get to that in a moment.

Further Optimization

The more I saw people building crazy structures that were pushing my optimizations to their limits, the further I wanted to compress my data. That’s all perfectly reasonable, but the flaw I introduced above would eventually come back to bite me.

To further optimize the data, I employed several techniques. I won’t go into the specifics of how I implemented them, but I’ll summarize what I was doing to the data to help illustrate where the problem comes in.

The Techniques:

  • Packed lists of integers into their raw bytes, with leading zeros removed, as base64 strings
  • Used accumulative arrays with the packed integer lists so I only had to store the differences of each successive value, thus introducing more leading zeros to be optimized away.
  • Store all used AssetIds in a sorted table, to be indexed by the object list
  • Sort the object lists by AssetId, and use RLE (run-length encoding) to crunch down the linkage of objects to AssetIds.
  • Eliminated the need for object IDs to be shortened, because they didn’t need to be stored!

The Collision & Ensuing Catastrophy

When I introduced the RLE optimization to the object list, it became critical that all objects were mapped correctly with no missing data. This is where the 6-hexadecimal digit flaw came back to bite me.

Consider a scenario where someone has stamped 2000 objects onto their baseplate. If a collision bug was active in my code, there would be a 2000 / 16777216 chance of two objects sharing the same ID, roughly 1 in 8388.

Although those odds might seem slim, the threshold is absolutely unacceptable for data integrity.
Players could load their baseplate saves at anytime, and the game would save their baseplate every 15 seconds.

Due to the aforementioned optimizations, I would generate new random IDs for them to use every time an object was loaded or stamped onto the baseplate. In turn, there was a non-zero chance for a collision to occur, and it was more likely to happen if you had more objects on your baseplate.

The possibility of this corruption was CATASTROPHIC for my optimization algorithms. Since I was using ipairs to iterate over the object lists when generating their RLE string, it was possible for the loop to terminate early and fail to map all of the objects out.

This is… unfortunately what happened to me. The probability of the bug happening was extremely low, so I couldn’t reproduce it on my end. Its effects were destructive, and ruined a lot of really awesome builds, to my absolute bewilderment of how or why.

Mitigation

I was able to mitigate the bug to some extent. I identified the ipairs problem relatively fast, which stopped data from being largely destroyed, but it was very difficult for me to trace down the root problem due to the relatively low probability of it happening.

The amount of data I had to comb through when I occasionally did have some sort of repro wasn’t very helpful either. Because the data was already crunched down and optimized by the time I viewed it in the Lua debugger, it was very difficult for me to draw any conclusions to what was happening.

Eventually I resorted to using uncompressed RLE data when I detected misalignment in the serialized object list. This didn’t stop corruption from happening, but at most it would only affect one object at the cost of data growing about 10-20% in size.

Retrospective

This bug left me dumbfounded for months, because it was happening seemingly at random and I had no way of reproducing it. I always had a nagging feeling in the back of my head that it had something to do with the random IDs, but I never felt convinced it was the issue’s source.

See, I’m used to dealing with absolutes when it comes to bugs in my code. Using that mindset at the time drew me away from considering this as the problem, up until today. This discovery is definitely an eye opener for me.

So whats the moral of this story? Use hash tables responsibly. If you need to uniquely identify a large collection of objects, make sure there is absolutely no chance of two objects sharing the same ID in systems where it is critical that such never happens.

Be very careful with data optimization. Always assume something can go wrong and make sure to cover all your bases to possible ramifications of how you’re implementing your code, including the scenario I just described above.

Dealing with cases like these is not fun. The odds are minuscule enough to prevent clear reproduction, but significant enough to have huge ramifications. I hope you guys found this story informative and useful! Best of luck in designing your game’s save data :sunglasses:!

73 Likes

This looks like the birthday problem (i.e. given X people with random birth dates, what is the chance that all of the birthdays are different), so this might shock you, but the actual probability that 2000 objects would have a unique ID out of 16777216 is only 88.77%, and so the chance of at least one pair of objects sharing the same ID is 11.23% or about 1 in 9, not 1 in 8388.

17 Likes

Someone mentioned that analogy to me earlier, which makes a lot more sense.
I saw this error come up in my larger builds far more frequently than I expected.

7 Likes

Why don’t we just use incremting numbers to assign unique indentifiers, in my opinion, it’s the best option. GUIDs will always have a miniscule collision chance no matter what. This is what I do in my forum engine, each post gets the current maxposts in the datastore incremented by 1 as a unique indetfier for the post.

I don’t see the point of GUIDs. An other idea would be since your only using catolog items assign a array for each catalog item placed, and inside that array put all the objects placed with that item id. I think this one would consume more memory tho. I might be wrong about only using catalog items.

I feel like there can be a lot of concurrency issues with that. Especially since there are multiple servers running, things can easily become misplaced

IncrementAsync() is like UpdateAsync() it always returns the latest value, note I am not using getAsync and SetAsync(). If I was using SetAsync() now that would be a problem, instead Im using IncrementAsync().

If you want to read more about it here:

Didn’t realize this was a thing! Not a bad solution then. Just have to make a key keeping track of the last version

1 Like