Finding the elusive cause of an inconsistent client-side crash

crash

#1

About a month ago Shard Seekers started crashing/freezing/“bad allocation”-ing randomly during play-solo. Sometimes I can go an entire day without crashing, it is very inconsistent and difficult to repro. It’s also been crashing on the live game at a similar rate:


Sometimes the client freezes indefinitely with no error messages. I assumed one of my 1500 modulescripts might have an edge-case that causes something to infinitely recurse/loop, so I wrote a script that automatically transforms the lua of every script in my game right as play-solo starts. I made it read tokens and insert a test around every block and statement so that if the client freezes or memory usage sky-rockets, it will throw errors and I will know about it:
image

I continued to develop normally with this active and the crash continued to occur.

I’ve spent hours digging through the roblox log dumps for answers. The crash is often preceded by “bad allocation” and other memory errors, then a delay, then the crash:
image

image

image

I even wrapped the entire Roblox API (mainly Instances and RBXScriptSignals) in tables so I can monitor what’s happening and dump a log into the output once a predictable error message appears. I can also throw errors when deprecated members are used, or when a script attempts to interact with a destroyed instance.
I’ve been wanting to do this for a while, and it’s generally useful when testing. It just requires replacing the Instance, game, workspace, script, typeof, type, and require globals with something that wraps and unwraps the values. :grinning:
I can visualize CFrame changes just by writing this:



More importantly I can write all recent API calls to a log, so I can see if a specific API is causing the crash:
image

I can also check how many instances are currently being referenced and everything cleans up normally, so I don’t think it’s caused by an instance memory leak.

Even with this I can’t seem to get anything meaningful, so I decided to simply take a look at what code I changed before the crash started happening. The first crash was on April 26 at 7:28 pm.

I wrote something that finds differences between place files, and there were a few dozen scripts that were mostly refactored. Mainly it seems I replaced the invisible anchored dummy character with player.ReplicationFocus. I replicate all character physics in Lua, so it made sense for me to make the switch and set it myself.
image

Over a year ago I had similar confusing crashes when using workspace.StreamingEnabled with Custom Characters. I tweaked how I was spawning and cframing dummy characters and the crashes stopped.

Right now Shard Seekers has streaming enabled set to false because terrain would often never load further than the character’s immediate area, and being able to see under the map is just a bad experience.

The server puts a local map with NPCs in workspace.CurrentCamera so it can run physics locally. I designed my game to be very scalable, and to do this I need replicate everything manually on a need-to-know basis. The server reuses the same character physics and map code that the client uses, so locally simulated server parts are essential. It happens in play-solo so I’m not sure if this is the cause, but it’s something to note. The ReplicationFocus parts are also located in the server’s local camera, because they don’t work when parented to nil and it doesn’t make sense to needlessly replicate them to everyone.

I think it’s most likely related to terrain loading/streaming in respect to ReplicationFocus, but Shard Seekers is a huge project and it could still be anything.

I’ve been working on Shard Seekers full-time for over 2 years and I’m not sure what to do. I haven’t made a post until now because I haven’t been able to find a repro, but I’ve been getting lots of complaints recently :persevere:


#2

What is your studio’s memory usage when idling in play solo?

I hit lots of bad allocation errors if I exceed the memory limit (4gb) of studio


#3

It’s usually around 700mb


#4

When the crash ends up happening, do you get to 3.9-4 GB of memory usage on Roblox? It sounds to me like a table is rapidly growing without being cleared, or a set of instances is being created rapidly and not getting destroyed.
Edit: Also, what is your VRAM usage? Noticed there was a VRAM allocation error as well.


#5

My first impression when looking at this report was that it was an issue running out of memory, but after looking at a few uploaded crashes the clients had normal amounts of RAM usage reported (does not look like a standard out-of-memory situation). We are looking a bit further.


#6

The test function checking memory usage at every line in the script using collectgarbage(“count”) will error if Lua memory goes too high at any point. According to collectgarbage, the memory used by Lua hovers around 60mb (this is also when storing a huge log of recent api accesses and every instance and connection is wrapped in a table for the extra tests I’m doing to get info on the crash). When I “Compile” the game, debug code is removed and Lua source is simplified so it’s even less when live.

When I check memory usage after the crash happens it seems pretty normal.

Using a weak table I can keep track of references to wrapped instances. I made a instances become “Toxic” once they’re destroyed, and it seems everything cleans up normally.
I have it set up to do this if it comes across similar errors again:


#7

Hmm…I can tell just by looking at the dragons in your game, there might be a problem with the amount of parts in the game.

What is the total amount of parts that are currently in your game?


#8

The dragon definitely uses more parts than it needs (each tooth is an individual mesh lol), but it only loads nearby character graphics, and they utilize a level-of-detail system so extra details aren’t loaded until players are very close:

It wasn’t an issue before so I think it’s okay.

The map uses a spatial partitioning level-of-detail system, so it varies (and is even less on mobile):


#9

It looks well done and realistic. There is also another theory that what lags games may not necessarily be the bricks but the scripts, too many of them can slow down a player’s game or crash altogether.

Do you have a lot of normal loose scripts in your workspace and serverscriptstorage?


#10

Thanks :slightly_smiling_face: There is 1 script and 2 local scripts (the second one for displaying “the game is updating” messages in case the first one breaks), and everything uses signals so scripts only run when they need to. I have a system for connecting to varying changes in the camera’s position, so far away trees may only update their quality when the camera moves 512 studs for example. I avoid ‘while true do wait()’ loops at all costs, so there’s a lot of Lua activity, but it’s reasonable.


#11

I think with the combination of all your bricks, and a reasonable amount of scripts, on top of a 100 player server. That would crash anyone without a good enough PC.

As a starter solution I would try to decrease your game’s player count to around 30-40 players and see how it goes.


#12

It’s set to a max of 100 players, but it’s also set to 30 preferred so it usually doesn’t get higher than that. The game only replicates nearby characters so it’s as if there are a few dozen at most. It also crashes in studio while I’m the only player and CPU usage isn’t too high, so it’s probably an edge-case in the engine. Thankfully it isn’t really a problem with lag.


#13

Well I sincerely hope your issue gets fixed. Good luck and keep on doing what you’re doing.


#14

Thanks! I’ve considered rolling back to before it started crashing but I’ve done so much in the past month.


#15

I know how you feel. I recently did that for my Viking game, I say make a copy of your game at it’s current state, then revert the main game. And then slowly start adding some of your new assets to your game, seeing if they interfere with the game or if they work just fine.


#16

I save-as every time I save, so I have a log of all the edits I’ve done. I can do this, but the problem is that I can’t be sure if I’m adding what caused it because it happens so rarely. It is an engine-level-crash, so hopefully an engineer can track down the cause or point me in the right direction.


#17

Found a repro thanks to @LordRugdumph :smiley:


#18

Through experimenting more with this code snippet, I’ve found this modification is enough to inconsistently crash studio.

I also caught this in my output one time it didn’t outright crash, but upon ending the simulation Studio crashed.

image



Smells like memory corruption. I better stop playing with this.


#19

I ran the script you had there about 15 times or so in the command bar. Crash logs had some interesting values. (Edit for others: It crashed after those runs)

the INI file had this entry, which really stuck out. UsedMemory=431435776


#20

I can’t help but notice a bug in the CameraScript:

local ClickToMove = FFlagUserNoCameraClickToMove and nil or require(script:WaitForChild('ClickToMove'))()

Lua will end up evaluating the expression to “require(script:WaitForChild(‘ClickToMove’))()” no matter what the value of FFlagUserNoCameraClickToMove happens to be.