The jump from gen 1 to gen 2 is something I can strongly relate to. At the beginning of RoKarts, the kart assembly consisted of many welded parts, several constraints (springconstraint and prismaticconstraint mainly) and many moving parts. All of this heavy physics data would get replicated to the server, even though none of it was wanted. I remember having something insane like 60 KB/s send. The story only got worse, as I would have AI racers run on the server, and each AI had its entire kart exist in workspace, physics objects and everything. This meant that the server would replicate all of this physics data back to the clients, resulting in something absurd like 400 KB/s at 6 AIs simulated (and something like 1000 KB/s for the full 11 AIs).
The solution, obviously, was to transition the movement system to something that would require nothing in workspace, so on the server, you could have a perfectly empty environment with just the racetrack itself, and on the client, you would only draw the core ingredients of the karts for visuals sake.
Right now, I have 3 KB/s send, and up to 50 KB/s with a full lobby of 12 players. The baseline for receive is actually 20 KB/s because I need to send some special data back to the client for server authoritative magic. For every additional player in the server, it costs about 8 KB/s. Then you might ask, how do I manage to get just 50 KB/s? If each player costs 8, and there can be up to 11 players, then it should be 108 KB/s right? The trick that I use is, on the server, I do distance checks from each client to every other client. Then, I use this distance information to inform which karts are highest priority to be replicated, and pick those, somewhat like a rate-limit but smarter. I cap the number of karts I can replicate to just 4 per frame, and figure out what are the optimal 4 karts I can send to keep everything looking as smooth as possible. Karts close to you should obviously be updated quicker, while karts far away can be updated only once every 6 frames or slower while being almost imperceptible.
Oh by the way, a cool trick for anyone reading. If you have to represent some sort of state for a player, an object, et cetera, you might end up with a large table of booleans to represent the state. For example, with a character controller, you might have booleans like IsFalling or IsSitting or IsCollidingWall. For these cases, you should definitely make use of bit32 to pack the bools into one or a few characters. You can put 8 booleans into one character! I do believe remotes have some inefficiency in order to communicate what type of data each argument is, so having one giant string to describe everything you need to replicate should be leaner than having the data all separated. At least I saw a fairly significant savings from that.