How we reduced bandwidth usage by 60x in Astro Force (Roblox RTS)

Atrazine · May 3, 2021, 10:30pm

Summary

Hello! This is a summary of some of the bandwidth optimizations we made in Astro Force, our Roblox Accelerator project (link here)! Astro Force is a real-time strategy (RTS) game @loravocado and I are working on, and the goal is to have a system efficient enough to handle hundreds of independent units. While a lot of these bandwidth optimizations are very specific to our game, I thought it’d still be cool to share what we did! This is also my first time doing bandwidth optimizations, so feel free to share your thoughts and suggestions

Here’s a quick video of 400 units moving around: Roblox RTS - 400 Unit Stress Test - YouTube

First, let’s start off with some stats using 100 moving units to benchmark!

The bandwidth was measured by using the menu that pops up when you press Shift+F3. As a general rule of thumb, bandwidth usage below 40-50 KB/s is exceptional. For comparison, a game of Phantom Forces or Arsenal typically uses around 50-80 KB/s. As you can see from the image above, we brought our bandwidth usage down about 60 times since V1. Here’s how we did it!

Version 1

Version 1 involved representing units using a part on the server and having the server CFrame them to a goal position. The client then rendered the character models on top of this part. Essentially, we let Roblox take care of all the replication for us. This was simple, but took an enormous amount of bandwidth as seen in the previous image. In general, CFraming parts is extremely costly in terms of bandwidth.

Furthermore, the core loop of the game ran each heartbeat cycle, which meant parts were being CFramed about 60 times a second. This contributed to the insane bandwidth usage we saw earlier.

Version 2

We completely scrapped V1 and rebuilt the game from the ground up. We completely threw out using parts on the server, instead opting to store positions of units on the server inside of a script and manually sending positioning data to each client. We also created a fully custom collision system specifically adapted for our game, which resulted in the massive reduction in CPU usage going from V1 to V2. With this new system, we had a much higher degree of control over everything, including replication.

Reducing rate of replication

One of the biggest ways V2 saved bandwidth was by reducing the number of times replication happened. Instead of CFraming (and sending the corresponding data) 60 times a second, we made replication only happen 10 times a second, and smoothed out the movement on the client using linear interpolation.

Sending less data

A huge part of the bandwidth savings also came from not sending data we can determine on the client. For example,

The Y-coordinate of each unit can be determined on the client. Since the terrain in our game conforms to a grid and there aren’t any places where two heights are possible, we were able to make a heightmap where the Y-coordinate of the terrain can be queried based on the X and Z coordinate of the unit. Raycasting would have also worked; however, our heightmap is about 10x faster than raycasting.
We only need one coordinate for the orientation of the unit rather than all three since units in our game only rotate about the Y axis. So we don’t even need to send the X and Z coordinates of orientation!

So, instead of sending 6 numbers to update the position of a unit (XYZ coordinates and XYZ orientation), we only need to send three numbers: the XZ coordinates and the Y orientation. This allows us to send 2x less data, and hence reduces bandwidth usage by 50%!

Only sending needed position updates

In order to avoid sending unnecessary data, the server only sends data about units that are currently moving. The replication loop for positioning data looks something like this:

local function SendPositioningData()
    -- This function is called about 10 times a second.

	local Packet = {}
	
	for _, Unit in pairs(Units) do
		if Unit.PositionChanged then
			Packet[#Packet + 1] = Unit:GetPositionData()
			Unit.PositionChanged = false
		end
	end
	
	if #Packet > 0 then
		PositionChangedEvent:FireAllClients(Packet)
	end
end

where Unit:GetPositionData() returns a table that looks something like this:

function Unit:GetPositionData()
    return {
        self.Hash,
        self.Orientation,
        self.X,
        self.Z
    }
end

Essentially, if the unit’s position changed this replication loop, we send the unit’s positioning data, consisting of the unit’s hash (a unique identifier for the unit so that the client knows which unit the server is trying to move), XZ coordinates, and orientation – a total of 4 numbers.

Using Vector2int16s and Vector3int16s

The last thing we did to optimize bandwidth in V2 was using some obscure Roblox types: Vector2int16 and Vector3int16 (credit to this post for this idea!). This is definitely a micro-optimization, but it brought another 70% bandwidth reduction. The 100 unit moving test consumes 35 KB/s without this optimization, and 10 KB/s with.

By default, when we send numbers using RemoteEvents, they’re sent as 64-bit floating point numbers. So, for each position update, the total number of bits for the positioning data is around 256 bits plus a bit of overhead for the table (for the four numbers from Hash, X, Z, and Orientation). How do we reduce the number of bits we send?

Our answer lies with the use of Vector2int16s and Vector3int16s. A Vector2int16 is capable of storing two int16s (which are 16-bit signed integers). A Vector3int16 is the same thing except it stores three int16s. The range of a single int16 is [-32,768, 32768), and can only store integers.

So, we can probably store the hash of each unit inside a Vector2int16 since the hash is always an integer. We also don’t expect there to be more than a couple thousand unique units in the game at once, so the hash should always fit in the range of an int16. But what about the X, Z, and orientation, which are more than likely decimal values?

In Astro Force, all computations such as distance checks, collisions, etc. that require accurate numbers are done on the server. The client doesn’t technically need to be super accurate with the position or orientation. So we’re okay if the unit’s position on the client is accurate within 0.2 studs and the orientation is accurate within 0.01 radians.

What we can do is multiply each of the numbers we want to send on the server by some multiplier, get the floor of the number (hence making it an integer), send it to the client, and then simply have the client divide the number it receives by that same multiplier! To make this process easy, we wrote an encoder and decoder function just for this purpose. It looks something like this:

local COORD_MULTIPLIER = 5 -- Numbers have accuracy within 0.2
local ORIENTATION_MULTIPLIER = 100 -- Numbers have accuracy within 0.01

function Encoder.EncodePositioningData(Hash, Orientation, X, Z)
    -- Orientation has range [0, 2pi)
	Orientation = math.floor(ORIENTATION_MULTIPLIER * Orientation + 0.5)
	X = math.floor(COORD_MULTIPLIER * X + 0.5)
	Z = math.floor(COORD_MULTIPLIER * Z + 0.5)

	return {
		Vector2int16.new(Hash, Orientation),
		Vector2int16.new(X, Z)
	}
end

function Encoder.DecodePositioningData(PositioningData)
	local Block1 = PositioningData[1]
	local Block2 = PositioningData[2]

	local Hash = Block1.X
	local Orientation = Block1.Y / ORIENTATION_MULTIPLIER 
	local X = Block2.X / COORD_MULTIPLIER 
	local Z = Block2.Y / COORD_MULTIPLIER 

	return Hash, Orientation, X, Z
end

As you can see, instead of sending four 64-bit numbers, we now send four 16-bit numbers. In theory, this is a reduction from 256 bits to 64 bits – a 75% theoretical reduction! In-game, we saw this as a 70% bandwidth reduction – we assume that it’s not a perfect 75% reduction due to a bit of overhead from the Vector2int16s we used.

Of course, doing this has some limitations and downsides:

This is not a good way to achieve accurate numbers. In our case, we can get away with it since we don’t need accuracy.
Tthe range of valid positions is now limited to [-32,768 / COORD_MULTIPLIER, 32768 / COORD_MULTIPLER), which in our case is equivalent to about [-6553, 6553). This works for Astro Force since most maps never exceed around 3000x3000 studs, but may not work for many other games.
Overall readability of code goes a bit down.
If Roblox removes Vector2int16s and Vector3int16s, then we’ll be sad.

Despite the downsides, at the end of the day, a 70% bandwidth reduction outweighed the downsides and we decided to use this hacky method to our advantage.

Version 3

We were already quite happy with the bandwidth usage of V2: 100 units moving around only consumed 10 KB/s. But we wanted to take it to the next level with another 50% reduction in bandwidth.

Warning: this section of optimizations requires a bit of knowledge about how integers are represented.

Thanks to the bit32 library now being on Roblox, we used a technique called bit packing in order to save data by manipulating bits.

In V2, each positioning update for a unit looks something like this on the bit level:

But we can do much better. We came to a few logical conclusions to save bits:

We don’t expect to have more than 2000 unique units in a single game. We could just use 11 bits (allowing 2^11 = 2048 unique units) and save 5 bits.
Our orientation doesn’t need so much precision – we’re happy if the orientation is within a few degrees of it’s actual value on the server. Let’s dedicate 7 bits to the orientation, giving us accuracy within 2.8 degrees (2^7 = 128 unique values; 360/128 gives us intervals of 2.8 degrees). This saves 9 bits.

However, we run into a problem with the X and Z coordinates: we have little room to shave off bits without losing even more precision or severely limiting map size. But what if we can somehow reduce the size of the position coordinates? Instead of sending the global position of each unit for a position update, what if we just send the displacement from some given point – say, the displacement from the corner of a grid cell a unit is currently inside? This could result in sending much smaller numbers!

To do this, we first place a grid over the map where each cell spans 8x8 studs. We then enumerate each of the grid cells with a unique ID. This process is done on both the server and client.

Now, the range of a position coordinate’s displacement from the corner of its grid cell is between [0, 8).

If we multiply the displacement of the coordinate by 8 and take the floor, we’re guaranteed that the value is 63 or less (if the displacement is greater than 8, then the unit would be in the next grid cell). This is perfect – the range of a 6-bit number is [0, 63]! For example, if the unit’s X displacement from the corner of the grid cell is 6.51, we multiply this value by 8 to get 52.08. We then take the floor and get 52 – which can be represented by just 6 bits.

We can transmit this number (52) over to the client, have the client divide 52 by 8, and we get 6.5 (which is indeed very close to the originally intentioned 6.51)! Using this method, we achieve a precision of 0.125 studs, which is actually more precise than V2!

Now we can bring everything together. If we dedicate 11 bits to the hash, 7 bits to the orientation, and 6 bits to the X and Z displacement from the corner of the grid cell the unit is currently in, we have the following:

As we can see, we now use 32 bits for a position update instead of 64 bits. Instead of using two Vector2int16s, we can use a single Vector2int16, resulting in half the bandwidth usage!

The last thing we need to deal with is how to detect which grid cell the unit is currently in. We essentially wrote some code to detect when a unit changes grid cells. When a unit does change grid cells, we simply append the ID of the grid cell the unit is going in to. Since changing grid cells happens relatively infrequently, we rarely have to do this – a vast majority of position updates do not involve grid changes. When there is a grid cell change, we send a Vector3int16 instead of the usual Vector2int16 for a position update:

As visible, we dedicate 18 bits to the grid ID if there is a grid change. This means we can have maps with up to 2^18 = 262,144 grid cells. Assuming our map is square, this means our map can be up to 4096x4096 studs large – plenty large enough for all our maps in Astro Force. If we ever have larger maps, the grid cell size can easily be tuned to be 10x10 studs or even 16x16 studs at the cost of precision.

Bit packing example

To end off this thread, here’s an example of how you could pack two 8 bit integers into a larger 16 bit integer!

Let’s say we have two 8-bit unsigned integers, x = 32 and y = 145. In our 16 bit result, we can dedicate the first 8 bits to x and the last 8 bits to y. Here’s a visualization of what that looks like:

So we can now set up some initial constants! We can use the 0b prefix on a number to tell Luau that we’re inputting a binary number.

local X_BITS = 0b0000_0000_1111_1111
local Y_BITS = 0b1111_1111_0000_0000

First, let’s deal with x. Since x occupies the first 8 bits, we don’t need to shift the bits for x. All we need to do is perform bit32.band to ensure x does not overflow into bits dedicated for y! This is what this looks like:

local x = 32
local y = 145

local result = 0
result += bit32.band(x, X_BITS)
print(result) -- Prints 32!

Now we need to deal with y. It’s a little bit more work now since we need to “shift” the bits of y over 8 bits. We use bit32.lshift for this! Essentially, this will shift the bits of y over 8 places so that the bits of y occupy the upper 8 bits of the result. After we shift the bits of y, we also make sure to perform bit32.band on y’s shifted bits to ensure we only affect the bits dedicated to y.

result += bit32.band(bit32.lshift(y, 8), Y_BITS)
print(result) -- Prints 37152!

Now, we can send this to the client inside a Vector2int16 for example. If we had something else to store inside the Vector2int16, we could also store it in the second int16 slot.

exampleEvent:FireClients(Vector2int16.new(
	result, 
	0 -- We could store something else here! :)
))

Now the fun part, decoding the result! We first retrieve the result from the Vector2int16 as follows:

exampleEvent.OnClientEvent:Connect(function(Data)
	local result = Data.X
    print(result) -- Prints -28384!
end)

Notice the number is negative. This is because the range of an int16 is [-32,768, 32768), so the original result (which was bigger than 32767) wraps around. However, this doesn’t matter since all we care about are the bits. Essentially, we’ll treat the last bit of the integer (which is typically used as the sign bit) as just a regular bit as if it were an unsigned int16.

We can now “unpack” the bits. For this, we use the bit32.extract method! To extract bits, we need three arguments:

The result,
The starting position to extract bits, and
The number of bits to extract from the starting point.

For example, we know that within the result, x starts at bit 0 and takes up 8 bits. So, extracting x looks something like this:

local x = bit32.extract(result, 0, 8)
print(x) -- Prints 32! :D

To finish up, let’s extract y! We know y starts at bit 8 and also takes up 8 bits, so:

local y = bit32.extract(result, 8, 8)
print(y) -- Prints 145! :)))

Hopefully this example gave you a good idea on how to perform bit packing! In my bit arrangement from earlier, you may have noticed that the orientation bits are split over two int16s. To split the bits over two integers, all you need to do is use bit32.extract and extract the parts of the bits you want to store in each integer.

Conclusion

Hope you enjoyed reading! We did a lot of hacky stuff, but the bandwidth savings were definitely worth it. Hopefully in the future, Roblox will implement more control over networking – such as being able to specify that you’re sending an int16 instead of always sending numbers as 64-bit floating point numbers!

Definitely let me know if you think of any more optimizations or if you see a mistake!

Also, feel free to check out this article on how we’re implementing fog of war: Fog of War in Astro Force (RTS)

NINJAMASTR999 · May 3, 2021, 10:49pm

This post is excellent - a great showcase on optimizing and bringing additional complexity into Roblox games while maintaining a smooth experience across all platforms.

If I recall correctly, Roblox recently rolled out a new Vector3 update which claims to be much quicker. Have you tested it compared to using Vector3int16s (or maybe they also updated Vector3int16s? I haven’t seen them widely used in the wild so I can’t imagine they receive much update love, and it seems like much of the load is on sending data and not vector calculations).

boatbomber · May 3, 2021, 10:53pm

This was a very interesting read, thank you for taking the time to so thoroughly document your work. I really enjoyed seeing all the clever ways you stripped excess data. I’m loving Astro Force so far!

The Vector3 performance update doesn’t have any impact on Atrazine’s use case- the update makes mathematical operations on Vector3s faster, but they aren’t doing any math on these Vectors. They’re using them as storage only, so using the int16 version is more useful here.

Atrazine · May 3, 2021, 10:54pm

Thanks! I haven’t tested the performance of vector2int16s, but I believe it’s very minimal. :)))

Atrazine · May 3, 2021, 10:56pm

Thanks! Glad you like Astro Force

exxtremestuffs · May 3, 2021, 10:57pm

Very impressive, nice to see a roblox game optimize replication this well. I can’t imagine bit packing significantly aided performance but the serversided rendering- or lack thereof, definitely did.

CmdrRaine · May 3, 2021, 11:19pm

This is a genius post. Thanks, much appreciated.

Glaring · May 3, 2021, 11:21pm

One of the most interesting posts I’ve read here in a while, the forum needs more of this. Awesome job and good luck with your project!

Elocore · May 3, 2021, 11:23pm

Finally, something interesting to read on the DevForums. This is incredible! I think my team will adopt some of these methods in our own game. I think you will want to post this in community-resources though

VegetationBush · May 3, 2021, 11:26pm

Reading through this thread I see you’ve done a lot of research and testing. I was wondering if it’s even more efficient to send strings. For example, instead of sending a number that might be 32 or 64 bit, why not send a string?:

local number = 123456789
local stringNumber = tostring(number) -- sending this value

local convertedNumber = tonumber(stringNumber) -- convert "stringNumber" once sent

I read on the lua documentation that string are 8-bit so would theoretically help your problem:
“Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros.”

Do you think that this can be another workaround for reducing bandwidth usage? Note that I’m not familiar with “hashes” or “bytecode” or things like that, it’s just something I thought of that might work. I also might’ve misinterpreted the information in the documentation.

Atrazine · May 3, 2021, 11:29pm

Ooh this is a super interesting idea! I’ll definitely need to try this out

Atrazine · May 3, 2021, 11:31pm

Actually, I believe each character is 8 bits, not the whole string. So if you had 123, that’d already by 3x8 = 24 bits.

Fluffmiceter · May 3, 2021, 11:40pm

The jump from gen 1 to gen 2 is something I can strongly relate to. At the beginning of RoKarts, the kart assembly consisted of many welded parts, several constraints (springconstraint and prismaticconstraint mainly) and many moving parts. All of this heavy physics data would get replicated to the server, even though none of it was wanted. I remember having something insane like 60 KB/s send. The story only got worse, as I would have AI racers run on the server, and each AI had its entire kart exist in workspace, physics objects and everything. This meant that the server would replicate all of this physics data back to the clients, resulting in something absurd like 400 KB/s at 6 AIs simulated (and something like 1000 KB/s for the full 11 AIs).

The solution, obviously, was to transition the movement system to something that would require nothing in workspace, so on the server, you could have a perfectly empty environment with just the racetrack itself, and on the client, you would only draw the core ingredients of the karts for visuals sake.

Right now, I have 3 KB/s send, and up to 50 KB/s with a full lobby of 12 players. The baseline for receive is actually 20 KB/s because I need to send some special data back to the client for server authoritative magic. For every additional player in the server, it costs about 8 KB/s. Then you might ask, how do I manage to get just 50 KB/s? If each player costs 8, and there can be up to 11 players, then it should be 108 KB/s right? The trick that I use is, on the server, I do distance checks from each client to every other client. Then, I use this distance information to inform which karts are highest priority to be replicated, and pick those, somewhat like a rate-limit but smarter. I cap the number of karts I can replicate to just 4 per frame, and figure out what are the optimal 4 karts I can send to keep everything looking as smooth as possible. Karts close to you should obviously be updated quicker, while karts far away can be updated only once every 6 frames or slower while being almost imperceptible.

Oh by the way, a cool trick for anyone reading. If you have to represent some sort of state for a player, an object, et cetera, you might end up with a large table of booleans to represent the state. For example, with a character controller, you might have booleans like IsFalling or IsSitting or IsCollidingWall. For these cases, you should definitely make use of bit32 to pack the bools into one or a few characters. You can put 8 booleans into one character! I do believe remotes have some inefficiency in order to communicate what type of data each argument is, so having one giant string to describe everything you need to replicate should be leaner than having the data all separated. At least I saw a fairly significant savings from that.

ThePoinball · May 4, 2021, 12:39am

I wasnt expecting the Optimisation V3 with Bits manipulation ! Thanks a lot for sharing!!

SnakeWorl · May 4, 2021, 1:16am

brb stealing this optimization idea

crazyblocks234 · May 4, 2021, 1:36am

This is very helpful information indeed, but I believe a topic like this belongs in #resources:community-resources.

Great work, though! Thanks!

SourceShahb · May 4, 2021, 6:52am

This is the best post I have ever read on DevForum. Gotta love how you document every detail of your progress since V1 and I gotta say, that is very smart thinking the way you optimized your codes into bits to save time, and data.

I will definitely try your methods for my game with my team since I will be using loads of data.

Keep up the great work Atrazine!

Ukendio · May 4, 2021, 8:37am

Good to see that posts like these are surfacing. I had a similar scenario, where I ran the simulation of “ai agents” completely on the server with no visual component and sent necessary data over remotes. My experience has been to only send these events when “operator” reacts to state changes instead of sending data every lifecycle/network step. Would you mind sharing what you opted for?

I don’t think you mentioned whether you used humanoids or not, I cannot imagine that you would use them given the scale. And looking at the fact that they’re clothless, also with the developer-implemented collision system, most likely no humanoid? I would like to share that if aesthetic is very important to you and you want clothes on your npcs but do not want to use humanoids, I strongly suggest remapping the UV of the npc’s bodyparts. Now you only need to take the assetid of a shirt and put it directly onto the meshpart’s textureid field. Looks great at the very least .

Additionally, I think it is awesome that you are shedding light on the how useful it is to learn binary numbers. 3b1b has tons of videos that explains advanced concepts based on similar mechanics.

Atrazine · May 4, 2021, 8:53am

Hello!

I’m not 100% what you mean by only sending events when “operator” reacts to state changes. What we do is we only send positioning data for units that are moving. To do this, whenever a unit moves, we mark it as “position changed”. Then, when we perform replication, we replicate units that have the “position changed” marked. We then mark the “position changed” as false again. Hope this answers your first question!

We do not use humanoids! Everything is custom (custom collisions, raycasts, etc.) for best performance and control over how everything works. The UV map idea you mentioned is definitely something we’d love to do! Thanks for the suggestion, I didn’t know such a thing was possible.

I’ll have to check out 3b1b’s videos! I didn’t know they had videos on binary numbers. Ty for the suggestion!

Atrazine · May 4, 2021, 9:09am

Hahaha if we can’t figure out the UV stuff, I’ll hit you up

I’ll take a look at his videos! Thanks for the link.