Parallel Scheduler - Parallel lua made easy and performant

Tomi1231 · August 18, 2023, 9:10pm

/!\ - Recently (no idea when precisely), a 3 thread limit was imposed on parallel lua, on the client, for unknown reasons (for servers, they need a high amount of players per server to get more threads). The consequences of this is that the already rare use cases for parallel lua are even rarer now, and I would actually discourage the use of parallel lua, unless you benchmark your code to measure the benefits

My goal with this module was to make a Worker module that encourages proper use of parallel lua while being easy to use and with minimal overhead

This is a 72x128 screen (rendered using 9216 raycasts) that I made using the module

Simple example on how to use it
Script

How to use the module

After requiring the module, use ParallelSheduler:LoadModule(ModuleScript), where ModuleScript is a Module Script that returns a function directly (This is needed to load a function in a different lua VM)
The module script is cloned and placed into a folder under ParallelScheduler, so using script will NOT return where the module script originally was

-- Module Script's structure
return function()
	
end

The ParallelSheduler:LoadModule() method will return a table containint 5 methods:

ModuleTable:ScheduleWork(…)
This method is used to schedule a task to run when ModuleTable:Work() is called. Arguments passed to this method will be passed to the function from the module script when it will be ran.
Note: An additional argument, TaskIndex is passed to the function (as the last argument) when it runs. That argument represent which task the function is running (task #1 is the first call of ModuleTable:SheduleWork(…), task #2 is the second call, etc)

ModuleTable:Work()
This method runs all the scheduled tasks and yeilds until all the tasks are done. It will return an array with all the results from every task, in the order they were sheduled. If the function return values as a tuple, they will be packed into a table (with table.pack(…))
ModuleTable:Work() SHOULD NOT be called when the previous ModuleTable:Work() is still waiting for every task to be done (It can still be called because in the case the function errors, ModuleTable:Work() will yeild forever)

ModuleTable:SetMaxWorkers(number)
For performance reasons, the module will asign multiple tasks to the same worker (if MaxWorkers is smaller than the amount of tasks)
This method will change how many workers the tasks will use (default 24 on the client and 48 on the server), and deletes workers above the new limit
You don’t really need to use this method, unless you want to modify MaxWorkers during runtime
If multiple modules have been loaded and are in use, it could be beneficial to reduce MaxWorkers as there would be more workers overall

ModuleTable:GetStatus()
This returns a table containing 4 values
ScheduledTasks - How many tasks have been scheduled with ModuleTable:ScheduleWork(…)
Workers - How many workers (aka actors) have been created
MaxWorkers - MaxWorkers lol
IsWorking - true if tasks are being ran, false otherwise

ModuleTable:Destroy()
Deletes everything. Call this when you don’t need the function anymore

How it works

I have tested many different approach, using bindable events/bindable function to send and receive info, the actor messaging api, and shared tables

I ended up using bindable events to fire a Work event and a Result event (start work, and work finished) and Shared Tables (after finding out that they exits…) for sharing information.

However, Shared Tables are much slower than normal tables esspecially when you have nested Shared Tables (from some testing). So, to improve performance, the number of actor is limited and tasks are merged togheter so they can run on the same actor. All the parameters for 1 actor are merged into the same table, with the first index being used to tell which parameter belongs to which task.

I set the actor limit to 24 on the client (per loaded module) (multiple of 2, 4, 6, 8 and 12, so opefully it works well with most cpus) and 48 for the server (probably way too high. I have no idea how many cores the server asigns to a Roblox server)
For some reason, the mircro profiler shows only 8 workers when I have 12 threads

Once the Work event is fired, all the actors start running the function with the provided parameters and inserts the results into a table containing all the results from the tasks a worker was asigned and then put into another Shared Table.
A Shared Table is also used to count down the remaining tasks, when it gets to 0, the Result event is fired. The module then puts all the results in a neat little table that is then returned by ModuleTable:Work()

When using ParalledScheduler:LoadModule(ModuleScript), the ModuleScript is cloned and inserted in a folder under the module. It would work if it required the original module… Hopefully that will become useful for modifying objects when in parallel, by putting them as a child of the module script

Hopefully, Shared Tables could get some optimization and that would reduce some of the overhead with the module

It might be possible to optimize it further by merging parameters into 1 big table, though I think that will return diminishing returns

Performance Tips

Don’t bother with it if the work you are doing is not that heavy
For example, I was testing with a script that had 120 tasks creating 11 CFrames with CFrame.lootAt, doing 10 CFrame multiplications and using the module was about the same as running it serial
Reduce the amount of tasks, and make the tasks do more work/combined work
While the module is made to combine tasks to reduce overhead, it is not entirely gone. Having a couple hundreds of tasks should still be fine, but if you have a loootttt, combine more work into 1 task
Send less parameters to the function
With the use of the TaskIndex argument, it is possible to avoid sending arguments. This can reduce the overhead when calling ModuleTable:Work(). Has a bigger impact when there are a lot of scheduled tasks

I left a lot of profilebeing() and profileend() functions commented. You can uncomment them to take a look at the performance and the overhead of the module in the micro profiler

Some nice parallel utilization

Parallel Scheduler Model
Uncopylocked test place
(The test place contains the raycasting script in StarterGui, and the other CFrame test script in StarterPlayerScripts)

If you encounter any bug, post them as a reply and I will fix them
Hope this module is useful!

Tomi1231 · August 18, 2023, 10:02pm

Additional Notes:

The functions running in parallel are allowed to yeild, just make sure to not call ModuleTable:Work() before the last one returned the results. (after yeilding, if it was from task.wait(), task.delay(), etc, it should continue in parallel but might want to make sure it does)
task.syncronize() and task.desyncronize() can be used but it’s not really recommended? If you do use it, don’t abuse it I guess

IISato · August 18, 2023, 10:12pm

This Module seems really cool. I would love to use it to truly create a Single Script environment, that allows for all the things that need to be run in parallel constantly, to start doing so from one script.

With that said, if you wouldn’t mind posting more examples of its use in the form of Code that would be sweet.

Tomi1231 · August 18, 2023, 11:13pm

LocalScript

ModuleScript

I did a quick and dirty job at converting a chairlift script to run in parallel
The code being ran in parallel calculates the CFrame of the chair, and it is then used with BulkMoveTo()

Something that is neat is that the Results table can be used directly with BulkMoveTo

However, this is slower than running it in serial. I got around 0.72ms in serial compared to 0.94ms in parallel

The code being ran in parallel is not heavy enough for the benefits to outweight the overhead

A bit of time is spent scheduling the task, a decent amount is spent writing to the Shared Table and then firing the remote
There seems to be some overhead during the execution of the function in parallel as that phase alone doesn’t seem faster than running it in serial

Uses for parallel lua are still quite limited. It might become more useful if the properties of certain objects become write safe. Changing the CFrame of a part in parallel would benefit this particular script

There are the raycasting example and the other test I did on the uncopylocked place if you haven’t seen them

Deb0rahAliWilliams · August 18, 2023, 11:20pm

That seems really interesting to use, can’t wait to see what comes out of using it!

IISato · August 18, 2023, 11:32pm

Is it okay to Yield in a runservice event? I always avoid doing that with the assumption it would be bad otherwise.

You can have multiple Schedulers loaded returning ModuleTables, running work at the same time correct? And the Scheduler will divide up the work between Actors for various different modules accordingly?

Also is calling the ModuleTable:Work() Method whilst its already running something that you think can happen easily and needs to be checked for or nah?

Tomi1231 · August 18, 2023, 11:50pm

the yeild happens for a fraction of a frame (if the function from the module doesn’t yeild)
I don’t think yeilding in a runservice event does anything? The connected function runs whenever Runservice.Heatbeat is fired

Yes, however, every loaded module has its own set of actors and will not share them with other loaded modules. If you call ModuleTable:Work() at the same time (like if you use heartbeat) for different ModuleTables, the tasks should run in the same parallel phase

ModuleTable:Work() yeilds so as long as you don’t avoid the yeild, that wont happen. Otherwise, if the function from the module doesn’t yeild, the work will be completed by the next serial phase, so very unlikely that you run into such an issue.
It could happen if your code errors, but at that point, you wont receive the results ever so… yeah. If it does happen, there will be a warning in the output

IISato · August 19, 2023, 2:26pm

I really appreciate how responsive you’re being. Thank you.

IISato · August 19, 2023, 2:28pm

How many actors per module? The total of 24 or something? Two different Modules running work would run on the Same parallel phase, but on different threads if they have their own actors…? Right.

Tomi1231 · August 19, 2023, 4:31pm

yeah, the default is 24, in the module there’s a DEFAULT_MAX_WORKERS variable at the begining of the module, currently set to 24 for the client and 48 for the server

They do run on their own threads, but then roblox distributes those threads onto your cpu threads (so like lets say you have 24 actors doing work, on a 4 thread cpu, roblox will distribute the work of those actors to the 4 cpu threads). This is why lowering that value can be beneficial if you have multiple modules doing work in parallel, to reduce overhead

I don’t know exactly how the Roblox serial and parallel system works, but there can be multiple parallel phases/serial phases during a frame. I would guess that after a serial phase, it starts a parallel phase with whatever was told to run in parallel, then serial, back to parallel if needed…

welololol · May 6, 2024, 3:11pm

I am 90% this is due to this update Deferred Engine Events: Rollout Update but now if you schedule a bunch of work using :ScheduleWork, and then immediately use :Work() this module breaks. I think is because the actors are created on the fly, but they don’t have their :ConnectParallel work until the frame after, and thus the first :ScheduleWork will break since the event is just nonexistant. Every :ScheduleWork after the first broken one works however since it’s just reusing the Actors from the previous attempt which have existed for longer than a frame and do have their :ConnectParallel set. If you also put a task.wait() it also fixes itself.

I edited the module script myself to make it generate all the actors as soon as it is required() (if you require and instantly schedulework it might still break, but in my use case that never happens, so I don’t care)

welololol · May 6, 2024, 3:50pm

This is one of the weirdest bugs I’ve ever seen. Every four times you use this function, it yields. No matter if you unroll the loop or not. Really strange. It has something to do with ResultEvent getting it’s :Wait() interrupted one frame late, but why I have no idea

Edit: if you yield the script yourself using something like task.wait() the script not longer yields suggesting that there is a limit to the amount of times you can wait on the bindable (in a single frmae) before it forces the script to yield. Very strange indeed. Maybe someone could get around this by using multiple bindables however for me this doesn’t matter cause I’m not gonna be using this more than once a frame, it just so happens that when I was testing performance, I was calling the parallel scheduler multiple times per frame causing this chain of events (pun unintended).

Tomi1231 · May 7, 2024, 1:16am

That is definitely odd, I’ll have to look into it

Tomi1231 · May 8, 2024, 10:36pm

I GOT A FIX :D

I really did not expect the fix to be this easy, the scripts not initializing before :Work() was one issue, but then, for some reason, the shared table with the results was empty (something I didn’t get to investigate before I found the easy fix)

It was indeed due to Deferred mode… Um, I think deferred mode also makes the initialization of scripts deferred, meaning the scripts don’t initialize in time, but only after the thread (that called :ScheduleWork()) finishes

As you have said. However, it breaks because the Work event is fired before the scripts connects to the event. :ConnectParallel(). It connects way before the next frame, and actually starts working right after, so using task.defer(coroutine.running()) fixes it, no need to wait a full frame (though, then the shared table for the results is empty…)

So the fix

Deferring WorkEvent:Fire(), like this

(This is the equivalent of task.defer(function() WorkEvent:Fire() end))

This seems to fix everything, and the results are in the shared tabled. (I thought maybe the issue for that was that I put task.defer(coroutine.running()) before the place where the shared table is cleared, but putting it after doesn’t work either…)

Thank you for pointing this out. I’m somewhat surprised that I didn’t notice this before, seems like none of the games I’ve used this in are in deferred mode

The test place and the model have been updated

I’m too lazy to test that, but can you tell me if you are still running into the weird bug you are having? (it yielding every four times)
It might have to do with task.synchronize and task.desynchronize only being able to be used 4 times per frame, idk. I thought there was no limit for that

smashman65 · May 17, 2024, 3:43am

Hello, great module youve made!

Im curious if you could give an example of you using it on the server as opposed to just the client

Im not quite proficient enough with lua to interpret this as easily as id like to, and im currently trying to impliment parallel lua into my game which has a pretty heavy physics load. Id like to transfer this load to multiple threads/workers for better performance but am at a road block in the implementation.

Thanks in advance!

Tomi1231 · May 17, 2024, 6:30pm

It works the same way on the server
What is the “physics load” in question? If you are talking about the roblox physics engine, you cannot make that parallel, roblox would have to do it themselves by adding a :SetThreadOwnership(), kind of like :SetNetworkOwnership()

Anyway, here is an example where I use it on the server, to deserialize a table that can get pretty big

local GameDatabase = Games[Task.TaskIndex]

if not GameDatabase then 
	local Data = DatastoreFunctions:GetGamesData(Task.TaskIndex) or tostring(GamesSerializer:GetLastestVersion())
	Data = string.split(Data,"_")

	local _version = tonumber(table.remove(Data,1))
	local DataLenght = GamesSerializer:GetDataLenghtFromVersion(_version)

	local GamesAmount = table.maxn(Data)/DataLenght
	local Index = 0

	for i = 0, Settings.ServerThreads -1 do 
		local Tasks = math.floor(GamesAmount/(Settings.ServerThreads-i))
		GamesAmount -= Tasks

		local a = Index + 1
		local b = Index + Tasks*DataLenght

		Index += Tasks*DataLenght

		SerDeserModules.Games.Deser:ScheduleWork(table.concat(Data,"_",a,b),_version)
	end

	local FinalData = {}

	Data = SerDeserModules.Games.Deser:Work()
	for i, v in ipairs(Data) do
		table.move(v,1,table.maxn(v),table.maxn(FinalData)+1,FinalData)
	end
	Data = nil -- TODO -- what the hell is happening here, why is this needed

	local DataSize = table.maxn(FinalData)

	Games[Task.TaskIndex] = {
		Data = FinalData,
		DataSize = DataSize,
		Index = math.random(1,math.max(DataSize,1)),
	} 
	GameDatabase = Games[Task.TaskIndex]
end

Here is the Deser module

local Settings = require(game.ReplicatedStorage.Settings)
local TasksPerThreads = Settings.SponsorsPerDatastore/Settings.ServerThreads
if TasksPerThreads ~= math.round(TasksPerThreads) then error("Invalid settings, cannot spread tasks evenly)") end

local Serializer = require(game.ServerScriptService.ServerTasks.TaskScript.Sponsors.SponsorsSerializer)

return function(String, _version, TaskIndex : number)
	return Serializer:DecompressArray(String, _version)
end

the DecompressArray method is for decompressing multiple elements at once. I made my code to make it specifically Schedule work for every thread available (aka the number of actors, which is set in the settings for the module), instead of doing ScheduleWork for every element separately and having the ParallelScheduler merge them. It’s better performance wise to do it like this. It does complicate things a bit though

It is much simpler when not merging tasks together, though if you are getting into the territory of maybe 300-500 smallish tasks or more, you should merge them
e1900e8b4f8364c30fe6bf8fd628e8379d38da1f

smashman65 · May 21, 2024, 5:17am

Much appreciated!

Unfortunately, looking deeper into my Micro Profiler, most of my lag is from defualt roblox physics and only about 4-6ms is from my computations.
Ive confirmed the Micro Profiler Physics report looks exactly the same in other physics based games

Any idea why roblox is such an un-performant platform?
looking at all other platforms, doing something as simple as what im doing would never cause any lag, but on roblox everything screams at the mere sight of physics and parts needing to interact with the world around them.

Tomi1231 · May 21, 2024, 3:13pm

It seems like most of it is coming from aerodynamics, try disabling that to see if things improve. I don’t know why roblox is so unperformant, even reading and especially modifying properties of parts is quite slow. It can’t be the C/lua boundary for physics as physics are fully written in C++ I’m pretty sure

foilplays · June 1, 2024, 3:14pm

When I use LoadModule, it requires two parameters, self and then the modulescript? What does this mean? Fyi, Im calling the scheduler inside a module script to call another script that uses runservice heartbeat. Im inexperienced with parallel luau, sorry! (ps, this is in serverscriptservices)

Tomi1231 · June 1, 2024, 6:43pm

self is syntactic sugar in lua when using the : notation

function Table:Method()
	print(self) --> The contents of table will be printed
end

Table:Method()

self is a hidden first argument in this case. We can see that by using . notation instead

local Table = {}

function Table.Method(self)
	print(self)
end

Table:Method() -- Table is passed implicitly when using : notation, it's the same as doing Table.Method(Table)
Table.Method(Table)

All you have to do is this, to use LoadModule (if the module script is a child of script). If you use the : notation, you don’t have to pass self. You can rely on the autocomplete to figure out what you have to pass to the functions. Every function uses the : notation

To make a function run with the Parallel Scheduler, you need to have a module (the one you pass to LoadModule) return a function, like shown in this figure (at the bottom, where it says Module Script)
The script calling the ParallelScheduler can be a LocalScript, or a ServerScript, doesn’t matter
Script