Roblox servers being unresponsive to all clients ingame

Speaking from experience, the last time this happened in which I recorded microprofiles, the page did not confirm that the microprofiler was taken, meaning the “Start” button was locked for minutes and the profile was not added to my logs folder until after the server recovered.

Here is the microprofile that was added to my folder after the server had recovered from the lag spike.
log_F2DA9_microprofile_20200525-165432.html (405.2 KB)

1 Like

Your game has a ‘Spread’ script so I probably would consider removing that.
Be sure to check through any models you use!

4 Likes

Thank you for communicating with us about this issue. To give you more information on the question you asked, I have recorded a video during one of our lag spikes.

In addition to my other response, here is a summary of this video in which I recorded two microprofilers; the first one, which was requested at approximately 4:49 PM CST, was added to my logs folder at 4:54 PM CST, which was when the server began to recover. The second microprofile was requested at 4:54 PM CST and was added to my folder at 4:55 PM CST, which was the exact moment that the server returned to normal. You can see this occuring in the video below.

And here are the two microprofiles that it recorded.
log_ACD3E_microprofile_20200527-165443.html (385.2 KB)
log_ACD3E_microprofile_20200527-165522.html (374.2 KB)

3 Likes

About a week or two ago roblox made some update that has basically caused alot of lag spikes I haven’t seen before either. The most common things i see with people having the issue is that Simulation, Run Job and Heartbeat are usually the biggest bars. Yet to hear ona fix or some investigation done.

Here is my latest dump of multiple micro-profiles of the server.

All I notice is a spike in Replicator SendData and MegaJobs when this issue floats. I’ll keep trying to spam ‘microprofile’ start button if this issue happens again.

log_F14EC_microprofile_20200603-164002.html (372.6 KB)
log_F14EC_microprofile_20200603-164406.html (829.4 KB)
log_F14EC_microprofile_20200603-164352.html (389.6 KB)
log_F14EC_microprofile_20200603-164213.html (388.3 KB)
log_F14EC_microprofile_20200603-164209.html (392.1 KB)
log_F14EC_microprofile_20200603-164139.html (393.0 KB)
log_F14EC_microprofile_20200603-164127.html (403.6 KB)
log_F14EC_microprofile_20200603-164121.html (406.4 KB)
log_F14EC_microprofile_20200603-164116.html (358.1 KB)

Here is another video clip showing exactly what’s been happening.

https://streamable.com/tpj52p

Edit:
Here are some more microprofile dumps on another large server frame after I changed the ‘Frames per second’ etc.

log_F14EC_microprofile_20200603-165332.html (731.6 KB)
log_F14EC_microprofile_20200603-165654.html (774.7 KB)
log_F14EC_microprofile_20200603-165612.html (788.0 KB)
log_F14EC_microprofile_20200603-165606.html (822.8 KB)
log_F14EC_microprofile_20200603-165556.html (738.3 KB)
log_F14EC_microprofile_20200603-165526.html (696.9 KB)
log_F14EC_microprofile_20200603-165420.html (684.0 KB)

Noticing quite frequent chunks of these taking a long frame.

This something to worry about? I do notice TimeScript will need optimizations and will work on that asap.

Here are a few server MicroProfiles that I may consider needing a look at.

log_27891_microprofile_20200604-172040.html (3.3 MB)
log_27891_microprofile_20200604-171940.html (4.7 MB)
log_27891_microprofile_20200604-171932.html (4.5 MB)
log_27891_microprofile_20200604-171903.html (4.3 MB)

ServerJoinSnapshot?
log_27891_microprofile_20200604-171817.html (3.4 MB)
log_27891_microprofile_20200604-171759.html (4.2 MB)
log_27891_microprofile_20200604-161958.html (1.7 MB)

PhysicsSteppedSpike?
log_27891_microprofile_20200604-171656.html (7.4 MB)

Server with lots of PhysicsStepped and goes back to ‘healthy server’?
log_27891_microprofile_20200604-165332.html (5.1 MB)

‘Healthy server’ with PhysicsStepped starting again?
log_27891_microprofile_20200604-165339.html (3.7 MB)

Lots of PhysicsStepped and DisconnectCleanup?
log_27891_microprofile_20200604-165405.html (5.9 MB)

I’m going to try considering one fix that I believe would resolve this kind of ‘Physics’ abuse and report back if that solves my issue.


Disconnect Cleanup / Write Marshalled long frame

Some more MicroProfiles from today…

log_E7A5F_microprofile_20200608-175718.html (1.7 MB)

Another Video of the issue, server has been frozen for the past 3 minutes.
heads up, discord sound
https://streamable.com/ra55fh

Will add more later today, life got suddenly busy.

Not sure if this is related, but coincidentally at the same time, I’m getting a lot of HTTP 503 (Service Unavailable) errors causing servers in my games to come to a halt.

It started occurring about a day ago, and at random intervals. I have done nothing that would’ve caused this on my end, but the game is consistently struggling with broken servers due to failed HTTP requests. It just stops working.

Is there anything you could suggest to try to resolve this issue or anything for us to test? Are any of my microprofillers showing anything for the engineering team for us to toggle something? We’ve been having this issue for the past few weeks and it seems we’re running out of solutions.

Logging the use of Click Detectors for click spamming isn’t working.
Logging the use of the Touched event isn’t working.
Making another Click Detector as a honeypot isn’t working.
Making another Touched event as a honeypot isn’t working.
Making all unanchored/non-character parts SetNetworkOwner to nil for it to be handeled by the server only isn’t working.
Lowering MaximumMessageLength in ChatSettings and kicking players that go over this lowered limit isn’t working.
Disabling public admin commands from being used by non-staff users isn’t working.
Making sure that there is 0 un-anchored parts in the game except for characters isn’t working.
Changing ChatService ‘SayMessageRequest’ to a different name to throw off exploits that try to FireServer through the name of an event through a honeypot isn’t working.

Currently going to add a detection on server-side to kick players that say :lag all since apparently some people are saying that and the server always seems to stop responding for a bit afterwards to see if that will help at all to know if it’s some localscript admin commands added by an exploit.

Detecting when clients didn’t ping the server for over 30 seconds used to work, now doesn’t anymore so I need to make another method detecting if the server is lagging because I can’t interface with Avg. Ping from the developer console on a server script as its CoreScript only?

It’s getting tiring that Roblox doesn’t provide an option in the DeveloperConsole to automatically start a microprofiller if avg. ping goes high. I’m having to spam click the ‘start recording’ button just because the request for starting a server microprofiller HAS to start BEFORE this ‘exploiter/lagger’ ‘freezes’ the server or the microprofiller results is just ‘normal’ because the server only gets my request till AFTER the lag stopped.

2 Likes

RemoteEvents are sometimes targeted by exploiters, adding logs for remote events (especially remote events related to chat) may provide useful clues.

Also, if you can retain the game server ip & game instance id from a session where there was a lot of lag, I can check the internal server logs to see if there are any other clues. The information will be in a client log that looks like this:

1591719232.50393,7670,6 ! Joining game '77bc3959-c1bd-4b5b-8009-f78aa071e57e' place 606849621 at 128.116.54.198

Will edit this reply to include other days this has happened once I get onto my other computer, but for now, this is one that I do have on my current computer.
[Jun 15] Again today…

1592260852.65039,4f10,10 RakPeer has distributed 894 packets to plugins since last debug time
1592260852.65039,4f10,10 PacketReturnQueue is empty, no work to do

We’re getting a LOT of

1592260852.65039,4f10,10 RakPeer has distributed 894 packets to plugins since last debug time
1592260852.65039,4f10,10 PacketReturnQueue is empty, no work to do

in our clientlogs…

[Jun 10] Just today, lagged again.

1591825071.25897,1f7c,6 ! Joining game ‘775b9260-07a7-4395-9056-c8eb835c439f’ place 2698066019 at 128.116.42.70

Other days

1591223427.53896,2178,6 ! Joining game ‘7c516762-0873-4afa-a681-5ca99d9de10d’ place 2698066019 at 128.116.54.78

1591392902.27006,19d4,6 ! Joining game ‘7b5eea3b-e821-4b99-bfb7-c336390c7553’ place 2698066019 at 128.116.32.75

1591220395.36002,17ac,6 ! Joining game ‘19d9a285-a007-42ae-8e8d-7b5769b81535’ place 2698066019 at 128.116.43.156

I created this script to monitor the default chat’s remote events.
It was running during a laggy session and did not reveal any abuse of the chat remotes.

local numFires = {}
game:GetService("Players").PlayerAdded:Connect(function(player)
	numFires[player.UserId] = 0
end)

local storage = game:GetService("ReplicatedStorage")
local folder = storage:WaitForChild("DefaultChatSystemChatEvents")
local cService = game:GetService("CollectionService")
local complete 

repeat 
	for i,v in pairs (folder:GetChildren()) do
		complete = i
	end
	wait() 
until complete == 14

local function count(p)
	numFires[p]= numFires[p] + 1
 	wait(1)
	numFires[p]= numFires[p] - 1
end	


for i,v in pairs (folder:GetChildren()) do
	if v.ClassName == "RemoteEvent" then
		cService:AddTag(v,"remote")
	end
end

for j,remotes in pairs (cService:GetTagged("remote")) do
	remotes.OnServerEvent:Connect(function(player)
		if numFires[player.UserId] ~= nil then
			if numFires[player.UserId] >= 6 then
				player:Kick("Kicked for chat spam!")
				print(player.Name.. " was kicked for chat spam!")
			end
			count(player.UserId)
		end
	end)
end

I’m starting to think that this is a RakPeer issue.

It’s apparent that Roblox uses RakNet and RakNet has options from peer to peer traffic for CongestionControl etc…

Is an exploit overloading server traffic to the point it’s hitting a buffer limit and we’re forced to wait this buffer limit out?

http://www.jenkinssoftware.com/raknet/manual/Doxygen/structRakNet_1_1RakNetStatistics.html#7e8881dd2f72099037a69ba3cd0b989d

I’m unable to tell since I don’t think there even is lua access to see this information on a Roblox client to see if isLimitedByCongestionControl is even true.

If you truly want to get rid of exploiters using DoS through your game (application layer), you won’t get around auditing your entire server code, looking for things which can be spammed and are quite expensive.
Look out for:

  • Backdoors (anything requiring a module you dont know about)
  • OnServerEvent connections
  • OnServerInvoke connections
  • Touched events
  • InvokeClient occurences (never do this, delete them)
  • ClickDetector events
  • GuiButton mouse events which are connected on the server (yes that works)
  • Scripts that interact with Instance changes inside characters or player backpacks
  • Scripts that interact with Humanoid properties and events including animations
  • Scripts that interact with Accessories and Tools which are children of workspace
  • Sound playback if RespectFilteringEnabled is disabled

Add debounces, usage trackers and/or rewrite badly performing code. It’s important to note that the “spamming” of a signal is not necessary to cause a DoS as it is possible to send malformed data which can lead to very long/infinite loops or similiar.

2 Likes

This isn’t anything being abused on the server that I can even see of.

The server microprofiler shows a completely healthy server on the over HUNDREDS I’ve recorded. Other than the random occasions of big SendData and Disconnect frames. The server has been entirely healthy pinging a discord webhook through a tick() every 60 seconds sharp.

I’m full on betting this is a RakNet bug or some kind of RakNet issue being abused thats causing this insane queueing.

Our lag has been happening over… and over… and over… It’s starting to be a daily occurance and it’s nothing we can do because we don’t have any RakNet logs. We can’t tell who’s sending the most RakNet data to the Roblox server because we simply can’t get that data…

Here’s yet another video, showing that everyone is complaining about lag. The server freezing like crazy, character positions not normal, chat being chunky.

https://streamable.com/7okxeq
Apparently Roblox Screen Recorder doesn’t record the Network Connection Health stats on CTRL+SHIFT+F4/F6 so I’ll have to record it again next time.

I’ve completely modified ChatService to not rely on Plr:Chatted by removing the C++ legacy fire event. I’ve completely renamed a few of the ChatService remote events so this isn’t a chat spam or command issue.

I’ve checked the game to have 0 unanchored parts. I’ve checked and made a script to force all non character parts to be SetNetworkOwner nil to have it server only. I’ve checked Server Scripts on the DevConsole and clicked on Rate (/s) to sort and see what’s firing the most and it’s nothing abnormal. I’ve reduced the attempts I’ve been using RemoteEvents even though the Server Logs doesn’t say anything about Remotes being fired too much from a player. I’ve even swapped between using Adonis or Basic Admin Essentials only, neither has a difference.

I’ve exhausted all my efforts, I’m blaming this lag issue on a RakNet buffer queue.

2 Likes

I really wish I could slap [ROBL CRITICAL] on this issue because this is happening to multiple games still.

We can’t do anything about it because I have a strong suspicion that this is either some DoS or RakNet exploit. The Roblox server is just fine and isn’t even having a long frame over 10 seconds but every client ingame is just frozen for upwards of 10 minutes, completely unable to play the game…

If a Roblox Staff would also want to know information about my captured packets from the game host IP, please DM me before 10 hours of this post, otherwise I’ll just dump all the zip files here of all the captured packets from RakNet.

7 Likes

This issue is very negatively effecting my game, making it almost unplayable due to these exploits. Please fix this as soon as possible. Exploiters are able to shutdown servers in my game through this and it seems there’s nothing I can do.

5 Likes

My game is being severely impacted by this exploit. Servers are crashing constantly.

5 Likes

Upon every exhausted effort, I can finally conclude that this is someone using a DoS attack or RakNet exploit against the game servers that made the server and that the games we’re playing on.

The only fix for this is to either wait for Roblox engineers to add DoS prevention measures like Cloudflare Spectrum or for Roblox engineers to find a way to work around this unwanted traffic jam.

or for game developers to work on split server cross gameplay, a method of splitting players between multiple Roblox servers and either using MessagingService or HttpService with your own server to relay data between all the other servers to handle showing all the characters and game data.

4 Likes

We’re still yet having this issue for over 40 days.

Here is an entire list of game job ids that I have available that had the game lag for Roblox staff to view through it.

1591825071.25897,1f7c,6 ! Joining game ‘775b9260-07a7-4395-9056-c8eb835c439f’ place 2698066019 at 128.116.42.70

1592254411.40699,4f0c,6 ! Joining game ‘7b8c3158-4946-4e90-a8b5-2103c9a3a508’ place 2698066019 at 128.116.4.103

1592254411.40699,4f0c,6 ! Joining game ‘7b8c3158-4946-4e90-a8b5-2103c9a3a508’ place 2698066019 at 128.116.4.103

1592342524.64920,3f30,6 ! Joining game ‘f8afb967-ef90-4112-84fd-24308e0d5b1e’ place 2698066019 at 128.116.24.153

1592601368.21216,1dfc,6 ! Joining game ‘345ba639-e6f0-49a9-b726-706be203e46b’ place 2698066019 at 128.116.34.25

1592860096.39523,0798,6 ! Joining game ‘0743b4a1-88c5-455d-a0dc-496d904e2595’ place 2698066019 at 209.206.42.108
pcapng for section 1 through 3 events -
6-22-2020-1-3 lag events.zip (2.9 MB)

1592950723.75232,3f8c,6 ! Joining game ‘39fb1318-c90f-4aa0-8099-f5f1ece59768’ place 2698066019 at 128.116.35.88


This is still a game breaking issue as no body can play our game.

2 Likes