Parallel Luau [Version 2 Release]

bluebxrrybot · June 5, 2023, 3:30pm

I just started learning paralell lua, and it definitely improved the preformance of my game. I have a few questions but they’re probably answered in the documentation. With the SharedTable thing, I think this would be a nice addition to the interface:

SharedTable.new().Property = true -- Original
SharedTable { -- New
    Property = true
}

They would both be the same, but the second one allows you to define a table more efficiently. Having both at the same time would be fine!

rep_movsb · June 5, 2023, 5:13pm

With the SharedTable thing, I think this would be a nice addition to the interface…

SharedTable.new() takes an optional table argument, so you can say, for example,

local st = SharedTable.new({PropertyA=true, PropertyB=false})

matheus1109110 · June 12, 2023, 1:40am

It would be lovely to be able to use WorldRoot:BulkMoveTo(...) with SharedTables.
My use case here (and i’m not sure if this is a bad practice) would be having a task module that can compute in parallel a CFrame for every point inside a 2D grid and then return those CFrames to a script that moves these points.

local SharedTableRegistry = game:GetService('SharedTableRegistry')
local Task = require(path.to.module)
local Points = { ... } -- This would be a list of points in that grid

Task.Run() -- This would run some computation that in the end outputs a result in SharedTableRegistry
local Results = SharedTableRegistry:GetSharedTable('Results')


-- workspace:BulkMoveTo(Points, Results)
-- This would be beautiful, but unfortunately we have to do:

local ResultsTable = {}
for i, v in Results do
	ResultsTable[i] = v
end
workspace:BulkMoveTo(Points, ResultsTable)

Prototrode · June 12, 2023, 2:44am

You can try to minimize the data exchange drawbacks by simply implementing the point-moving functions into the actor scripts themselves. That way you don’t have to constantly shuffle around data and you get to use it immediately.

matheus1109110 · June 12, 2023, 3:30am

Excuse my ignorance on the topic, but assume we have a scenario where I want to move X parts and have Y actors, let’s also assume that the ratio of actors to cores is 1, which means all actors can run in parallel to one another.

This means each actor would be responsible for the computation of X/Y CFrames, and it can do this in parallel.

After completion, we will be left with X CFrames and will have to synchronize in order to move the parts. Since each actor has X/Y parts to move, the more actors you have, the more you will lower the amount of parts to move per actor, but in the end that does not increase performance because you will still be moving X parts, and if your data is distributed like that, you can’t use functions to move the parts in bulk.

Isn’t there a point where X is so big that moving the parts individually is slower than collecting the data and using WorldRoot:BulkMoveTo(...)? And if so, what is that point?

Prototrode · June 12, 2023, 2:17pm

I’m talking more about the costs of transferring data, like how you have to move the data from the SharedTable into a proper table. Bandwidth is a real issue to consider, there will be a point where the bandwidth will become a bottleneck that outweighs the cost of CFraming parts. My original suggestion is based on the assumption that the data can all be independently processed; you wouldn’t need to access any data from other actor’s authority, or else that would necessitate a synchronization nonetheless and you can just ignore everything I said.

To actually answer your question though, it depends and you need to do some testing yourself. Keep in mind that there are already costs of creating a massive array of the parts and CFrames to be used by :BulkMoveTo, like I mentioned above.

Also, who said that you couldn’t run :BulkMoveTo individually on each actor script after they each reenter serial execution?

matheus1109110 · June 12, 2023, 6:31pm

Actually, after some testing it does seem like calling :BulkMoveTo in the actor scripts is more performant than what I was thinking originally. My original guess was that since you had way less parts in each actor script to move, the performance of :BulkMoveTo wouldn’t speed up things substantially.
But anyway you’re right, I guess I will have to do the benchmarks myself to figure out what works best in my case. Thank you for all the new knowledge

darthskrill · June 12, 2023, 8:34pm

We have a ticket to investigate this request. Hopefully in a future update we will add support for TweenService:GetValue. Thank you for the suggestion!

SuperDave6502 · June 14, 2023, 9:59pm

Hi @colbert2677, it looks like Bone.TransformedCFrame is already safe to read in parallel:

Is it possible you actually meant that you need Bone.TransformedWorldCFrame to be readable in parallel?

colbert2677 · June 14, 2023, 11:23pm

Oh, that’s my mistake! That would actually be correct, yes. Our team uses a bit of both properties in our experience and I picked up the wrong one from our bug report involving bone CFrames.

SuperDave6502 · June 15, 2023, 10:05pm

Hi @colbert2677, I just wanted to follow up that our physics team will be looking into whether or not TransformedWorldCFrame can be made parallel read safe or not. Their initial impression is that it likely can be with some work, but they need to do a bit of investigation before they can be certain.

Thanks for raising this API as important for your use of Parallel Luau.

SuperDave6502 · June 15, 2023, 10:36pm

Hi @rickje139,

This is a good question. If you had some “workers” filling the table at one parallel resumption point, then you would be guaranteed that all the workers complete before the next resumption point. So you could safely use the data in a future resumption point. I mention this just so anyone reading my reply doesn’t assume a more complicated solution is always required.

However, you are asking specifically about running a function that depends on the data in the same parallel step. In that situation you need some way to either signal that the work is complete, or have the function that requires the data do some sort of polling (i.e. “yielding/deferring” until the work is complete). I think the best option is likely to track how much work is completed and then notify the function when all the work is completed.

Here is an example script that takes this approach. It uses Actor messaging to send a message so that the function that requires that data can execute when all the table results have been written to the table. It also uses SharedTable.increment to safely increment a counter safely from multiple actors.

-- This script assumes it has not been parented to an Actor (it automatically parents clones of itself to Actors)
local numWorkers = 8
local actor = script:GetActor()
if actor then
	
	actor:BindToMessageParallel("DoWork", function(workIndex, resultTable, actorToMessageOnCompletion)
		-- Do some work to compute a "result"
		-- For this very simple example a string is generated.  In theory any data that could be stored in a
		-- shared table is possible.
		local result = "result:" .. workIndex * workIndex	
		resultTable[workIndex] = result

		-- Increment 'numResults' to track how much work has been completed
		local resultCount = SharedTable.increment(resultTable, "numResults", 1) + 1
		-- If all work has been completed, then have the last worker signal 'actorToMessageOnCompletion' indicating
		-- the work is complete.
		if  resultCount == numWorkers then
			-- Send a message that all work is complete
			actorToMessageOnCompletion:SendMessage("ProcessingComplete", resultTable)
		end	
	end)
	
	actor:BindToMessageParallel("ProcessingComplete", function(resultTable)
		-- When this callback is called, all of the workers will have completed their work
		print("Received table with results from parallel actors")
		assert(resultTable.numResults == numWorkers)
		print(resultTable)
	end)

else
	-- This is the codepath for the "main" script.  It will perform initialization and create workers 
	
	-- Create a shared table with a counter for the amount of work completed initially set to 0.	
	local resultTable = SharedTable.new({numResults = 0})
		
	-- Create child actors to do the work
	local workerActors = {}
	for i = 1,numWorkers do
		local workerActor = Instance.new("Actor")
		workerActor.Parent = workspace
		local cloneScript = script:Clone()
		cloneScript.Parent = workerActor
		table.insert(workerActors, workerActor)
	end
	
	-- Send a message to all workers so they begin working.  We select the first child (arbitrarily) as the one that will
	-- be sent a message when all the work is complete.  Some other actor, even one that is not a worker, could be used instead.
	for index, childActor in workerActors do
		childActor:SendMessage("DoWork", index, resultTable, workerActors[1])
	end		
end

To make it easy to test this for yourself, I put all the code in a single script. But in practice it would probably be cleaner to split this into at least two separate scripts.

Also, although I used Actor messaging to notify the function that all the work is complete, other mechanisms could be used (e.g. a BindableEvent). Please consider this as an example and not the only possible solution to this problem.

rickje139 · June 15, 2023, 11:32pm

If shared tables would only grab newly added or newly changed data automatically from the shared table and cache the rest, it could reduce the performance impact by a lot and make shared tables perform a lot better with calling the tables, because of this i noticed bindable events work a lot better for my situation due to it not requiring constantly checking every index in the table.

I will definitely make sure to use shared tables for quick data that wont be called as much or send that much data or for single increments because its really helpful for those situations.

Thank you for the explanation!

CoderJoey · June 17, 2023, 12:45pm

We need the ability to query the amount of threads/cores available on a system so we can avoid relying on Roblox’s parallel scheduler for task counts that could be 1,000-10,000. It is much more efficient to create let’s say 8 parallel workers that handle 1/8th of the work rather than leaning on Roblox’s scheduler for work when the work pool is that large. Currently we just have 8 of our own “schedulers” that handle 1/8th of the work, but this is not efficient when the core count is lower or higher than 8.

Prototrode · June 18, 2023, 5:27am

This. Even though the whole idea of partitioning tasks into actors is to encourage granularity, there will be a point where the costs of parallel-serial data transaction will outweigh the benefits of having many actors to more evenly distribute the workload. There needs to be a way to determine the exact number of cores/threads present in the underlying hardware so we know exactly how much actors should be created.

If you have 8 actors perfectly balanced with your own distribution algorithm on a 6-thread CPU, that’s two actors left out that will hold back everyone else and double the processing time on each parallel resumption. If you try to spread it out more into 40 individual actors, that will be 4 actors left out that will still create an overhead but will overall make a smaller impact because each actor is doing less. However, having 40 individual actors makes it super complicated and expensive to manage and possibly even synchronize.

Currently the only way around this is to ask the player to manually input the number of cores their PC has, but what if they don’t know either?

Ruuuusty · August 12, 2023, 8:44pm

How about scripts that are parents of actors? The descendant model is atrocious for usability and inconsistent with the rest of the engine…, meanwhile the unintended behavior of desynchronize was actually useful and definable even if unintended, while having meaningful benefits over task.wait()

Tomi1231 · August 16, 2023, 7:34pm

The addition of SharedTables is great! I just found out they existed yesterday and they actually solve an issue I had (avoiding firing a ton of remote events).
It would be great if more of the stuff normal tables get added to SharedTables have like table.pack, table.unpack, table.sort, etc. Cool to see that SharedTables have their own unique functions though. It would also be nice if we could easily create normal tables that are clones of SharedTables without needing to make a for loop

Tomi1231 · August 16, 2023, 7:44pm

I am unable to use a for loop on any SharedTable

tnavarts · August 16, 2023, 7:55pm

Just loop directly over the table, omit the ipairs. Even for normal tables the ipairs is no longer needed in contemporary Luau code (outside of some very rare niche scenarios).

Tomi1231 · August 16, 2023, 8:01pm

Good to know, I guess

Doesn’t change anything with the issue though