Massive increase in client crash rate in the last 24 hours without updating our game

LMH_Hutch · November 10, 2022, 9:40pm

Reproduction Steps

My game Apocalypse Rising 2 (Apocalypse Rising 2 - Roblox) has had a massive increase in client crash reports in the last 24 hours. We have not update our game since the 6th of November and we believe this is a Roblox change causing the issue.

Our developer stats confirm the “can only play for about 15 minutes” user stories we’re receiving from our players:

Our game has historically run very high memory usage wise and has been prone to memory leaks in the past that have caused crashing. Usually this doesn’t happen to our players until the ~1 hour mark.

Expected Behavior

We expect players to not crash within 15 minutes of pressing play

Actual Behavior

Players are crashing within ~15 minutes of playing from out of memory issues. Users who have been able to record their sessions report “Out of memory” errors in the client log and a number visually related of out of memory issues (invisible meshes, textures not loading, game features breaking).

Workaround

No known workaround, we cannot diagnose the issue ourselves.

Issue Area: Engine
Issue Type: Crashing
Impact: Critical
Frequency: Constantly
Date First Experienced: 2022-11-09 00:11:00 (-05:00)
Date Last Experienced: 2022-11-10 00:11:00 (-05:00)

LMH_Hutch · November 10, 2022, 10:05pm

At ~4:50PM EST we set EnableDynamicHeads and LoadCharacterLayeredClothing to Disabled from Default in an attempt to see if crashing persists. Our game does not use either feature and we don’t have any reason to assume they’re related to crashing. I will update this post if we see an improvement at all.

Update @ 8:17PM EST, no improvement after setting EnableDynamicHeads and LoadCharacterLayeredClothing to Disabled

We will be taking steps to reduce memory usage over the next 24 hours. We’re not sure how this will affect this issue on the Roblox side of it. The version of the game that is crashing can be privately made available upon request.

LMH_Hutch · November 12, 2022, 1:32am

Still no progress. Despite spending the day trying solutions ourselves we cannot figure out the cause of the increased crash rate.

We’ve tried reverting to builds of the game before the build that is currently crashing and we still have the same issue.

We did not update our game, and then out of the blue it started crashing like crazy. We are seriously dead in the water, we need support. Please help.

SubtotalAnt8185 · November 13, 2022, 6:49pm

I have not seen this issue with any other game. What is reportedly taking up all the memory?

22king232 · November 14, 2022, 7:38am

I work for a game called Pembroke Pines and we have had the exact same issue with a random increase in crash rates. We have been forced to lower the player count per server until we fix the issue.

newtonmetre · November 14, 2022, 10:11pm

Hi, we rolled back a change internally that was the root cause behind this problem. You should be good now!

LMH_Hutch · November 15, 2022, 5:29pm

When should we start seeing improvements? After ~19 hours we’re still getting a lot of crash reports and our average play time hasn’t recovered at all.

We’ll be testing the changes we’ve made to the game since posting this bug report to confirm if it’s an issue on our end or not - I’ll update this post once we have results.

Update: We tested the build that was live when this bug report was made and we still get crashes after the ~13 minute mark. Running the same test with out updated production build gives us similar results.

It doesn’t appear that the rollback was effective in solving the issue for us. If anything it’s getting worse:

newtonmetre · November 15, 2022, 7:21pm

Hi, we’re taking another look into this - we fixed one issue that was causing memory growth but there’s still another memory leak.

LMH_Hutch · November 18, 2022, 3:47pm

Any updates? I’d really appreciate an estimate on when we can expect a fix.

This issue has ruined our Halloween event and I can’t afford to wait on a fix for much longer.

newtonmetre · November 18, 2022, 9:38pm

We are still in progress - I hope to have narrowed down the specific change that caused the regression shortly. The instance/gui and instance/object counts are the culprits here.

LMH_Hutch · November 18, 2022, 9:55pm

Thank you for the update! In the meantime is there anything we can do to mitigate the leaks in our game, or should we just be waiting for the fix?

newtonmetre · November 18, 2022, 9:59pm

I don’t believe we’ll be able to fix this before the weekend but I hope to have something to help you mitigate the problem today.

newtonmetre · November 19, 2022, 2:48am

We found the root cause but it will require a patch to fix. What I’ve noticed is that it will take longer to crash the less players are in a server. I’ll see if we can find a better strategy to use.

LazerPengu · November 21, 2022, 8:59pm

We have narrowed down the cause. This crash increase is caused by a memory leak. The client crashes due to running out of memory. This memory leak happens when an Instance is destroyed from an event handler attached to one of its own Events.

We have found a temporary workaround: attaching and immediately disconnecting a dummy event handler from the same Event prior to destroying the Instance will prevent the leak.

Here is a minimal reproduction of the issue, along with a demo of the workaround: Event leak demo - Roblox

Here’s how to use it:

Launch the place using either the PC or Mac client.
Open the Micro Profiler: Roblox button → Settings → Micro Profiler → On
In the Micro Profiler menu in the top left, hover over “Mode” and click “Counters”.
Expand/watch the counter under memory/instance/gui. (You can view a graph by left-clicking twice on the right side of the “gui” row.)
Click the “Leak 500” button to reproduce the problem. You’ll notice that the counter increases and does not decrease even after waiting a bit.
Click the “Leak 500 w/ workaround” button to test the same code with the workaround inserted. Notice that the memory counter temporarily increases, but eventually decreases after waiting a bit.

Here’s what the Micro Profiler should look like after pressing the “Leak 500” button a few times:

This place is copy-enabled, so you can open it in Studio to see what’s going on. Here’s the code that’s used for the “workaround” version:

local function leak2()
	local testButton = Instance.new("TextButton")

	-- These 500 connections are never called, they're just here to make the leak worse.
	for i=1, 500 do
		testButton.Changed:Connect(function ()
			print("Not printed")
		end)
	end
	
	-- This is where the leak actually happens.
	testButton.Changed:Connect(function ()
		print("Destroying testButton")
		-- Begin workaround
		-- Note that the Event used here (Changed) is the same as the one we're subscribed to.
		local c = testButton.Changed:Connect(function()
			assert(false, "Not reached")
		end)
		c:Disconnect()
		-- End workaround		
		testButton:Destroy() -- Without the above workaround, this would leak all event handlers.
		print("Button destroyed")
	end)
	
	-- Trigger testButton:Changed
	testButton.Text = "Delete me"
end

We are working on an engine change to resolve this leak, but expect it to take a while longer to finish.

LazerPengu · November 21, 2022, 9:14pm

While investigating this, we also found this to be the most efficient set of steps to reproduce this issue in your Apocalypse Rising 2 experience, @LMH_Hutch:

Join an AR2 instance with at least 25 players. (Fewer players work too, but it seems to worsen with more players.)
Open the Micro Profiler counters view and expand memory/instance/gui (so you can see when the issue is occurring).
Hit “Join”.
Run around in the world for a couple seconds.

After completing these steps, you can watch the memory/instance/gui graph continue to grow until the client eventually runs out of memory and crashes. Here’s what the graph looks like standing idle for a while:
Micro Profiler showing leak occurring in AR2

This issue also reproduces in Studio, so you can use Studio to test implementing the workaround.

LMH_Hutch · November 21, 2022, 9:17pm

Awesome, thank you for the insights. We’ll definitely be giving this workaround a try and paying more attention to event connections.

Can we still expect a patch to solve this, or is it mostly in our hands now?

LazerPengu · November 21, 2022, 9:18pm

We will be patching the engine to resolve this leak, but we expect the patch to take a while to be released. This workaround is only intended to be temporary and can be removed once the engine is patched.

LMH_Hutch · November 21, 2022, 10:29pm

Could we get clarification that this is a gui specific memory leak, or should we be applying this workaround to other object types too?

Also is the workaround event specific, or is connecting and disconnecting to only .Changed enough?

newtonmetre · November 21, 2022, 11:51pm

I believe it applies to all object types. It’s just more prevalent with gui objects in your Experience.

LazerPengu · November 22, 2022, 1:03am

This issue happens with all object types and all events. It happens when any event handler destroys its parent object.

The event you connect/disconnect from in the workaround should be the same event that you’re about to destroy the object from. You don’t need to connect/disconnect all events on the object: just the one that you’re about to destroy it from. For example:

local button = script.Parent -- assume it's a TextButton

button.Changed:Connect(function ()
	print("Button changed") -- doesn't destroy the button; no workaround needed.
end)

button.MouseButton1Click:Connect(function ()
	print("Clicked")
	button:Destroy() -- this handler destroys its parent, so it needs to have the workaround
end)

And in this case, since the MouseButton1Click Event is the one calling Destroy(), the workaround would look like this:

-- Edited version of the above event handler to apply the workaround.
button.MouseButton1Click:Connect(function ()
	print("Clicked")
	-- Connect to the same event as this handler.
	connection = button.MouseButton1Click:Connect(function ()
		-- This function isn't ever called, you don't really need anything here.
		assert(false, "Not reached")
	end)
	connection:Disconnect()
	button:Destroy()
end)

You can even make the workaround a one-liner if you want:

-- Shorter edited version.
button.MouseButton1Click:Connect(function ()
	print("Clicked")
	button.MouseButton1Click:Connect(function () end):Disconnect() -- one-liner version
	button:Destroy()
end)

Even though we also use the Changed event on this button, we don’t need to apply the workaround to it because it doesn’t destroy the button.