My 5 year open world rpg 3m+ visits is failing due to server lag and crashes over time

My game Dystovia has a huge issue with server-sided lag that comes up after hours of uptime making it REALLY hard to test any single change and I’ve tried everything in my scripts and in the game to solve this and I can’t get to a solution. These are some server sided stats of how much “Navigation” and “Physics” takes up:




Navigation from my reading is and should be mostly related to pathfinding and through stresstesting I’ve found out that the Engine itself adds small amount of MB’s in memory to the navigation SIMPLY by calling Path:Computeasync, which does not get garbagecollected, even after 100% destroying a pathfinding object, IF you make a lot of calls that are complicated (The game world is huge). The topic I’ve found this through says that around 2-3 GB’s of memory it garbage collects a whole chunk and it is merely visual, but from my findings the game does get laggier over time and the server does not hold up over time.

For more info:

Details about the game and the bug:

  • Mobs get spawned in and despawned based on if a player is in range, while waiting for a player it checks every 4-5 seconds and can spawn in.
  • Luaheap, Scripts, etc. do not increase significantly in memory.
  • The amount of models and amount of mobs does not increase over time in the game
    (Therefore from my understanding if it is not saved in heap or it isn’t found in workspace idk how there could be more pathfinding calls?)

Here is the essential parts/pseudocode of the pathfinding code for mobs including everywhere the path object in the function is being used:

local function PathFind(Params,Humanoid,Origin,Target)
	local Path = PathfindingService:CreatePath({Params})
	local Success, ErrorMessage = pcall(function()
		Path:ComputeAsync(Origin.Position,Target.Position)
	end)
	if Success and Path.Status == Enum.PathStatus.Success then
		local WaypointList = Path:GetWaypoints()
		for PathNum,PathPoint in ipairs(WaypointList) do
			local PathPointDist = (PathPoint.Position - Target.Position).Magnitude
			if Conditions then
					break
				end
				Humanoid:MoveTo(PathPoint.Position)
				repeat
					if Conditions then Path:Destroy() return end
					NewMob.PathCounter += 1
					if NewMob.PathCounter > MaxPathTries or Conditions2 then
						Path:Destroy()
						return true
					end
				until Conditions3
			else
				break
			end
		end
	else
		if Conditions4 then Path:Destroy() return end
		repeat
			Tries += 1
			if Conditions5 then
				break
			end
			if Conditions6 then
				Path:Destroy()
				CallFunction1()
			else
				CallFunction2()
			end
			if Conditions7 then
				Success, ErrorMessage = pcall(function()
					Path:ComputeAsync(MyRoot.Position,PlayerTarget.Position)
				end)
				Path:Destroy()
			end
			RunService.Heartbeat:Wait()
		until Conditions8 or Tries >= 3
	end
	if Path then
		Path:Destroy()
	end
end

So what approach should I take to fixing this?

7 Likes

All I can tell from these stats is that System and Navigation uses the most memory in general on average. The analytics are nice to use sometimes when viewing overall game health, but it’s only partially useful for isolating problems in specific servers. How many servers are “bad” and how many are “good” in that graph? We don’t know.

Roblox’s documentation for the pathfinding system is horrible. They don’t specify how it should be used. We also don’t know what’s going on under the hood. I think there is some amount of caching or they pushed an update that broke pathfinding (this isn’t the first time).

Can you provide the following information:

  1. when did it start happening?
  2. what do you mean by lag? are your players experiencing high ping? low FPS? are their devices/clients crashing? is the server crashing? is the server shutting down on players?
  3. have you actually entered a “laggy” server to gather data?
  4. have you done a server side micro profile in a “laggy” server?
  5. have you done a snapshot of the lua heap in a “laggy” server?
2 Likes

Looks like memory leak somewhere, you should search through scripts on one of those laggy servers using luau heap to see what really is going on

EDIT: I forget to add that i never saw game that uses path finding made by roblox to create AI, you can know from look, they usually use their custom path finding or simple lerping, you should consider making your own custom path finding by using raycasts and algorithms soo it will be 100% dependant on your knowledge

2 Likes

Noticeably old servers have mobs that don’t move simply due to their AI being throddled so much that their AI works slower, it goes from very responsive instantaneous reactions to the player to waiting 10 seconds before performing any action whatsoever.

I can maybe provide graphs for what happens in game, just know that navigation memory ticks up to 10x its original usage even with a super high amount of mobs active (the amount of active mobs vary), and the server performance through the microprofiler goes down with time. I can’t test this properly since you have to play in a server for like +3 hours with others to make the server really tank in performance noticeably.

  1. It begun a long time ago, but I thought it was related to other stuff so may date even a year back, for sure 6 months.
  2. High ping, fps isn’t impacted as much, client is smooth
  3. Yep and it says navigation is high, physics is high - but a lot of the data is confusing and leads in different directions - the only clear thing seems to be navigation.
  4. & 5. Yes and through that I’ve fixed potential memory leaks, and found out that the “model” count doesn’t go up aka the models / mobs in game doesn’t increase with time
1 Like

Luau heap is SUPER low, and how would I figure out what the luau heap is spent on?

Pathfinding is tricky, because I am using Roblox’ terrain so any other pathfinding algorithm I come up with will 100% be less efficient in cost, worse at navigating and takes a lot of time. If this is the only solution then I guess, but navigating Roblox’ navmesh isn’t the easiest.

Here is some more data for a server that has only been up some hours (not that bad server lag yet)







Here is a microprofiler for a really old server (Super varying things that throttle from microprofiler to microprofiler??)



Here is data from a new server (it is a bit better than if there were more players but not by a huge amount)









Luau Heap is graph that tells about every script and stuff if you don’t know, soo you can determine by comparison what lags

2 Likes

Isn’t a snapshot like I’ve shown in my latest reply?

2 Likes

Ok, so high ping mostly. I can see that your server’s tick rate is dropping. You want to keep that under 16ms per “tick” or “frame”. The image below shows a healthy server.

Your laggy server looks like it takes anywhere between 24-51ms to complete a frame, we need to figure out why it’s doing that.

Looks like major contributors are this “sleep” task and marshalling. Unfortunately there’s not much information to go beyond this. Some of the task names aren’t publicly documented.

Here’s what I recommend you try:

1. Instead of creating a new path every time, create 1 path at the start.

Updated

local Path = PathfindingService:CreatePath({Params})
local function PathFind(Params,Humanoid,Origin,Target)
	local Success, ErrorMessage = pcall(function()
		Path:ComputeAsync(Origin.Position,Target.Position)
	end)

I’m 50% sure this should fix your system.

2. Insert a lot of microprofile markers.

As an example I will show you my recent debugging experience. I got vague markers of what the process is doing, so I started injecting a lot of custom descriptive markers to my scripts and let it run until the issue pops up. Then I was able to diagnose the problem. See debug.profileend() and debug.profilebegin()

3. Focus on reliably replicating the problem.

You can’t really fix stuff unless you know exactly what triggers it. Do everything in your power to get it to behave like the old servers. Once you do that, try different fixes until you cannot trigger it anymore.

You did try doing it but I’m skeptical of it. Memory problems usually lead to the server crashing or the client crashing. You said people were having ping issues and the microprofiler is showing long CPU processing times. This is a different issue. Find the problem or ask your community for help.

2 Likes

use graph view to see every script and compare it to other snapshots through time

I’m afraid this is the closest to a solution, I will test the path thing first and let you know my findings. The annoying part is that through everything I’ve tested it points in 10 different directions making me think I solved the issue, while only doing a small server optimization. I’ve tried to make it consistently happen first but I cannot replicate it within 10 minutes so far.

I have, the graph view is bugged and links to the wrong things so it doesn’t help at all. It will say for an example “Deer mobrunner 2” which just calls their AI, all it really says is that it is related to mobs.

why do you think it’s broken? i use it often and it helps a lot, also i made some research and path finding service is broken… soo there are few possible options:

  • Remove path finding and simply add mobs going to player with a little help of raycasts
  • Shut down the server every few hours to clean it up
  • Make your RPG round based (many games that use path finding are this format only), i don’t reccomend that tho
  • Create your custom pathfinding (can be controlled but performant heavy)

May not be a “bug” but this data helps me 0% I’ve looked through it all and it doesn’t help me solve anything, take a look at it.

  1. The game relies on complicated AI, I cannot just remove pathfinding for this type of game.
  2. I could make some other system but again that isn’t trivial for navigating Roblox’ navmesh and in the end would probably create more frustrations than solving it with the chance that it isn’t even related to pathfinding in which case I’ve wasted time on a system that makes AI worse and doesn’t solve the issue
  3. I already shut down servers.
  4. The game is not turn based nor can be, you should try it out to get an idea - this is not feasible.
  5. Addressed this earlier

See, roblox’s pathfinding as you know doesn’t garbage collect stuff, soo there is no other option, you can see many devs talking about this too in dev forum, sadly only thing we can do is wait for roblox to solve stuff, or you need to replace this roblox’s pathfinding with something else sadly, anyways i wish you good luck with that

Try this first, there’s a good chance it’ll fix it.

The whole memory debugging thing is the hardest route to take out of all of your choices. Take my advice and focus on placing debug markers for the micro profiler. There’ s a good chance if you fix the micro profiler problem you’ll also fix the memory problem because they’re linked.

If you give up, try a substitution. Your code is really easy to sub out with SimplePath. I recommend this one. It’s an open sourced module that has been battle tested in real games and used by multiple developers include myself.

Some more details in the process of debugging.

I have a lot of unique references that are being saved for players, referencing this:


Again upvalue in this spot:


I cannot call debug profile begin and end because the script is a modulescript being run by several mobs simoultaneously, instead I’ll try deactivating pathing to see if it fixes it - any change I do I can check ~8-12 hours later in the main game when a server has been up for a while. I’ve tried also a few other small changes/fixes related to what I’ve seen but not expecting much. There are 2 major things microprofiler/stuff points to: Navigation and Humanoids (Either amount or related to animation)

image

I’ve now disabled pathfinding and that wasn’t the issue, server lag still accumulates over time. Memory is reduced but didn’t have an impact in the bigger picture:


What takes the longest is “Thread” - no idea what it is.

It might actually be better, I was jumping the gun a bit but this different issue is not as severe and the game might actually be good - the issue is finding a replacement for pathfinding completely as using the built in will not work…