Any DataStores/MemoryService experts out there that can provide input on best practices to implement a global matchmaking system? I do have a working system, but its hard to tell how reliable it is without being able to do large-scale tests, so I need some input on my current methods.
Parameters:
- A match can have a minimum of 10 players and a maximum of 24
- A match should start within a given period of time once reaching minimum requirement
- No ranked/skill based matchmaking yet
- No external services for now
- It does not use a leader based system
- Matches are found exclusively within a lobby
- I want to avoid pre-existing modules as their functionality is different to what I need
Above Explanation
Since I want to start light I want to avoid a matchmaking based on rank or skill level and keep it to just one singular map. With better understanding on how to make a matchmaking system I might expand on this in the future, but for now this is fine.
I also want to avoid external HTTP services for now as it seems like that dives into another layer of complexity that I might not be able to comprehend yet. I know there are pre-existing modules out there that can do this for me, but their functionality is not ideal for what I have in mind so I want to refrain from relying on them.
One key factor that I want this matchmaking system to do is keep players in a lobby place until the queue throws them all into a match at the same time. This allows players to easily back out on a queue and allows them to continue doing things in the lobby in case the queue is a bit longer. I noticed some other games that use a matchmaking service teleport players into a match first, then wait for the server to fill up before closing it off to new users (I don’t know exactly how it works, its just based off observations). I want to avoid that if possible but let me know if there is a reason doing it that way might be better.
I am not using a leader system because they are generally very difficult to implement. I know Roblox has a leader system module and I plan to somehow utilize it, but for now I want to try this method and see how it goes.
The services I used are MemoryStoreService
in tandem with its MemoryStoreSortedMap
, I did not use MemoryStoreQueue
for my current system but I will get to why later.
The sequence in which I track players in a queue are as follows:
-
When a player wants to find a match, the server first retrieves pre-existing queue groups using
GetRangeAsync
with the full 200 range. If there are more than 200 groups in the queue I cycle to the next 200 pages and repeat that until it finds a group that has room, if it cycles through all groups and they are all full it creates a new one. -
When creating or joining a group, I call
UpdateAsync()
to a group ID returned by the cycle function. If no group ID is returned then we create a new group stored as the following:-
local queueGroup = {List = {player.UserId}}
-
If given a group ID, I simply add to the List table using
table.insert(queueGroup.List, player.UserId
-
This means when returning
UpdateAsync()
the table will look something like this.
-
[1] = {
["key"] = 1,
["value"] = {
List = {
[1] = 12345, --Two players are waiting in group 1
[2] = 67890
},
--room for other information pertaining to this group.
}
}
-
When a player calls
UpdateAsync()
and the queueGroup reaches the minimum player threshold, within that same update function I add new information called “Time” which stores a tick() value within the queueGroup. -
Similarly, when a player calls
UpdateAsync()
and the queueGroup reaches the maximum player threshold, I store another variable within that group called “ServerCode”, which stores aTeleportService:ReserveServer(placeId)
value. -
Another separate function runs every 5 seconds calling
GetRangeAsync()
once (or more depending on how many pages there are) and performing the following checks:- Checks the difference in time by comparing a new tick() value with the stored one, if the difference is greater than a given wait time limit, it adds a ServerCode early and waits until the next step
- Checks if the queueGroup has an existing ServerCode variable and if it does, it uses that to teleport all the players within a group to a reserved server
- This group is then cleaned up and removed from
MemoryStoreSortMap
usingRemoveAsync()
Leaving functionality is similar to joining, but instead of adding to the List it just removes. If the List is empty it deletes the queueGroup entirely using RemoveAsync()
. If the List falls below any given threshold, it removes their respective variables and continues to wait.
Key feedback factors
-
Is storing a “Time” variable within a queueGroup after the minimum threshold was reached a reliable way to check if a certain amount of time has passed for a specific group?
-
Is storing a “ServerCode” variable within a queueGroup after the maximum threshold was reached a reliable way to retrieve teleport data that all players within a group can use?
- if not, what are the edge cases I can expect to encounter? What can I do to improve reliability?
-
Is calling
GetRangeAsync()
once every 5 seconds, and once every time someone tries to join or leave a queue a bad practice? Will I encounter any bottlenecks or throttling by doing this? -
Will I encounter some bad edge cases when it comes to a potential larger player base? How reliable is the scaling?
-
What are the potential issues I can run into if
UpdateAsync()
is called concurrently by multitudes of other players? Will this system run into issues with storing things like Time and ServerCode, or will the method handle this type of load pretty well?- I do know that
UpdateAsync()
has checks to make sure the key its trying to update has not changed, and if it is it attempts to retrieve new data and try again, I just don’t know how reliable this is especially on a large scale.
- I do know that
-
When and If should I double check that a queueGroup is in fact full when running into concurrent
UpdateAsync()
calls? Will there ever be a case where two users can retrieve a group with 23 (one from maximum) fromGetRangeAsync()
and both attempt to join this group at the same time? If so, do I perform a check within theUpdateAsync()
function again to make sure that it has room, and if not attempt to join another group?- if so, would it cause issues with ServerCode?
- I want to avoid a leader system for this because I heard that
MessagingService
can sometimes be unreliable when sending messages cross server, so I figured retrieving it straight from the queueGroup might be a more reliable way of getting a teleport call. Let me know otherwise.
-
If multiple servers detect the minimum player threshold and they all attempt to add a Time variable to their group, will that cause any issues? Similarly, if multiple servers detect that the Time difference is greater than the wait time limit and attempt to create a ServerCode, will that cause any issues?
- My assumption is that it might ultimately settle with one ServerCode for all players, but not without running that code multiple times before settling. Essentially what I am worried about here is server performance issues.
- I know that a leader system would solve this particular problem, but making a system like that proves quite a challenge. Any insights on that is appreciated.
MemoryStoreQueue Explanation
Reason I did not use MemoryStoreQueue
is because it has limited functionality, I had ideas on how to use it but I ultimately decided to stick with MemoryStoreSortMap
since it seemed it work better when scaling.
What I was also thinking though was using it to put players in a JoinQueue wait list and periodically check if anyone is on it, and then throw them into a queueGroup using the above methods, effectively reducing UpdateAsync()
calls drastically preventing most concurrent call issues. Obviously it would not eliminate them completely, but if I get a leader system to work properly I would be able to reduce it to one call every few seconds across the entire game, or at least once per server or group.
And similarly to leave a queueGroup, players would be put into another MemoryStoreQueue
LeaveList and when the next cycle comes around to check for them, it uses that data to remove the players.
Let me know if doing it using that method might prove even more reliable and robust, with and without a leader system. I am only concerned about making a leader system if concurrent UpdateAsync()
calls is actually a problem when it scales.
This turned out to be a really long post, but if anyone can provide me on some insight on my current system (which works on a small scale at least), or on a new and better system, please let me know as I am really open to improving and changing my system if necessary.