ComputeLua - Make Parallel Luau easier using buffers!

bIorbee · April 24, 2024, 1:01am

Latest Stable Release: 1.3.0 (9/24/2024)
ComputeLua Module
Github
Documention
API

External resources used

Why ComputeLua?

Simply using ComputeLua you can easily send a bunch of workers to run a function all in parallel, then get them to work on some data and send them back to the main thread.

ComputeLua is great for large computations that need to happen quickly, such as:

Terrain generation
Editing the vertices of an EditableMesh
Simulation like waves

ComputeLua is fully and statically typed, so autocompletion will happen and errors will throw is incorrect types are inputted.

Limits

The data that can be sent over to workers can only be these data types. This is due to the limitations on SharedTables, but these data types should be all you need.

Vector2
Vector3
CFrame
Color3
UDim
UDim2
number
boolean
string

No, you cannot have functions in these buffers.

To make everything run faster, the keys for ComputeBuffers must be a number. A worker can easily access the current data they are working on within the massive table by using their dispatch ID if the keys are all numbers

How do I use this?

A detailed description can be found on the documention site
Getting Started

Anymore information you may need can be found on the API site
API

How would you rate ComputeLua?

Amazing
Great
Good
Bad
Very Bad

0 voters

bIorbee · April 25, 2024, 12:21am

ComputeLua - Stable Release 1.1.0 (4/24/2024)

Created a Github with documentation and an API page
Added ComputeLua.CreateThead(actor: Actor, threadName: name, callback: (number, VariableBufferDataType) -> ())
Changed Dispatcher:Dispatch() to return a Promise.defer()
Variable Buffer is now a SharedTable

bIorbee · April 26, 2024, 8:07pm

ComputeLua - Patch 1.1.1 (4/26/2024)

Performance increase

bIorbee · April 29, 2024, 9:10pm

ComputeLua - Stable Release 1.2.0 (4/29/2024)

Added the ability to edit the batch size
Fixed batch size not working correctly
Fixed batches breaking if it was less than the amount of threads
Fixed typo in documentation

Gegecito · May 28, 2024, 9:09pm

I don’t quite understand how to use this

ComputeLua.CreateThread(actor, "CalculatePositions", function(id, variableBuffer)
	local value = variableBuffer[1] -- Get the first variable within the Variable Buffer
end)

I would assume this script would be inside the worker script right? If so how is data buffer accessed?
edit: Oh I’m sorry i am quite blind

Gegecito · May 28, 2024, 9:21pm

Also I have a question, does this split the work between all workers, or do i have to do this myself?
example given i have 100 points of data and 4 workers, each would have to calculate 25 data points

Rylvns · May 28, 2024, 9:27pm

are you saying: “is this multi-threaded”

Gegecito · May 28, 2024, 9:35pm

No I understand that this is multi threaded, what I am asking is does it split the work automatically across the workers for me or do I have to do this myself manually

bIorbee · May 28, 2024, 11:01pm

Yes that CreateThread function is placed inside of a worker script. It is a separate script anywhere the script that is dispatching the workers can access and clone it

You can simply get a ComputeBuffer by running ComputeLua.GetComputeBufferData(bufferName: string) inside of a worker, more detailed information can be found on the documentation (Worker | Getting Compute Buffer data)

When you dispatch a group of workers ComputeLua will automatically seperate the workers itself, the very important variable that each worker is given is the dispatchId or in your case, the id in the function you inputted for CreateThread. You can use this id to get which data of the buffer the worker is currently editing. You can of course always access any other data of the buffer if you wish by just changing the index. You can see an example of this on the documentation (Worker | Example)

The only thing ComputeLua doesn’t do for you is, make the ComputeBuffers’ data in a format that could be used correctly. If it becomes a problem, I can create some helper functions that will help put the data together)

On the latest release on Github, there is a complete example of the module as a .rblx, if you need help then I suggest going there Releases · blorbee1/ComputeLua · GitHub

Just go to the Dispatcher script in ServerScriptService and check out the format of the data it puts into a ComputeBuffer (no keys, each element is by itself {not a table})

To be more specific to your question, the script will not split it up like that right now. You could always use the method of giving a stride value. A stride value is how many indicies of a list does one element take up. For you, that would be a stride of 25, and you will use the dispatchId to figure out the start index of the element and the end index (25 + start index)

Here’s an example

-- In this example, there are 3 points as Vector2s
-- (of course you could use Vector2.new(), but to show the stride, I will do this)
local points = {
   100, 300, 
   50, 10, 
   60, 10
}

-- Each point is 2 numbers, so 1 element takes up 2 indices, therefore a stride of 2
local stride = 2

-- Imagine this is the worker function
function(dispatchId)
   -- Due to Lua tables start index being 1, the dispatchId starts at 1
   -- therefore, to get this correct we need to counter this
   local point = points[((dispatchId - 1) * stride) + 1]
   -- At dispatchId = 1, the index put into "points" would be 1
   -- ((1 - 1) * 2) + 1 = 1
   -- DispatchId = 2, the index would be 3
   -- and so on
end

Of course you could always just put each element into its own table (in your case, 25 points in one table), since nested tables are allowed to be sent, but not encouraged. From my testing, nested tables slow down some workers

In short, ComputeLua will not split the data you put into the Compute Buffers to match the amount of workers automatically. Each worker has one dispatch ID then it will be called again with a new ID, you will have to manage multiple data being changed with one worker if your dispatch amount is smaller than your data size

But ComputeLua works best with lots amount of data, for example I used this to calculate nosie values on a sphere mesh which has around 10k vertices, so using this sped up the calculation by a lot (1 worker per vertex)

If you got anymore questions just ask

Gegecito · May 29, 2024, 9:59am

Place1.rbxl (112.2 KB)
could you please review my implementation, I am trying to get this to work at mesh deformed ocean simulation but it crashes, the script is located at StarterPlayer.StarterPlayerScripts, the code is quite disgusting i’m sorry.

bIorbee · May 29, 2024, 9:38pm

You are trying to dispatch 4 workers for around 4000 data entries. Since you didn’t add a stride system like something I said in my last post, the workers just stop after the 4th entry and everything else doesn’t get updated then it will throw an error saying Vector3 instead of CFrame

The crashing is actually my fault, there was an issue with the code and I’ve gone ahead and made a patch to fix that

I’ve made some changes to your code, this time it is now using 256 workers with a dispatch size of #WaveData. This way every element in WaveData will be processed. Even though there are only 256 workers, each worker is processing multiple at a time just at different calls

More detailed explaination of the above

When the Dispatcher is dispatched, a for loop is started. (if serial dispatch is disabled) It will pick a random worker to run its function, this could be a worker that already got called in a previous loop, or a brand new one that is sitting around doing nothing. Since when you call CreateThread it will put your function binded to a parallel actor message, this will prevent workers that get called more than once during the dispatching from pausing each call

This allows you to have way less workers than the data size required, for you thats 256 workers processing around 4000 data

If you really want to have less workers (which isn’t recommended) then you are going to have to create your own method of getting the 4 workers to process all of the data instead of stopping at the 4th entry

Place1.rbxl (112.3 KB)
If you got anymore questions, just ask

Artzified · May 30, 2024, 9:35am

I’m reworking my ant simulation project to be more performant, since the previous version lags a lot with 1 hour worth of pheromones (I’d say about ~2k pheromones for each pheromone group); a pheromone can be a home pheromone, or a food pheromone. Each of them have a strength value that decays overtime, with a Vector2 position associated with it

This is the previous version’s pheromones structure:

{
	HOME = {
		[Vector2.zero] = 4; -- [position] = strength
		[Vector2.yAxis] = 6;
		[Vector2.xAxis] = 2;
		[Vector2.new(1, 2)] = 0.2
	};
	FOOD = {
		[Vector2.one] = 2; -- [position] = strength
		[Vector2.zero] = 1; -- pheromone groups can overlap other groups
		[-Vector2.xAxis] = 3
	}
}

(the pheromones are changing constantly, it grows at it goes and they will be removed if the strength goes below 0 to save some space and time for falloff function)

I don’t really know how to manage my buffers with this structure, since I need to have quick access to the pheromone on a given position. I’m currently using 2 separate dispatcher that runs sequentially for Pheromone workers and Ant workers, should I just stick to one dispatcher and one worker script instead of 2?

I just slapped 3 compute buffers together and pretty sure I’m doing this wrong

The ant workers are handling steering and behavioral logic, while the pheromone workers are handling pheromone decay and pheromone falloff strength.

bIorbee · May 30, 2024, 11:36am

Using a buffer for the positions and a different one for the strength works great as long as you match the index of the two (position buffer index 1 should use the strength at the strength buffer index 1)
(if you haven’t already)

You can use two dispatchers as long as you know two things

ComputeBuffers are global, they will ignore a dispatcher so you can access the data no matter what worker under whatever dispatcher
If you dispatch the same thread while another one is working, you are going to get incorrect data

I suggest you use 1 dispatcher and just 2 threads. You can create multiple threads in one worker

Now for the format for the buffers, the best thing to do is keep the size of the buffer fixed so writing and reading to it will happen quickly

The worst thing you want to happen is the workers adding onto the data of the buffers (adding new elements/new indices). The buffers use SharedTables and increasing the size on SharedTables is very slow so you want to avoid it at all costs.

As long as you don’t let any worker add new elements you should be fine, adding onto regular luau tables is much faster so you can build the table beforr dispatching

I’m going to assume the phenomenon groups are the HOME and FOOD keys in the original table. To replace this you can use a identifier bit in the position buffer. If you will only ever have two types of group, then this bit would be 1 for one group, and 0 for a different one. If you have more than 2, just go 1, 2, 3, etc and check for that in the worker

The reason I’m using a number instead of a string since strings take longer to process

To put this bit with the positions you got 3 ways: the easy, slow way; the fast, bit more complicated way, the fast, slightly easy way

Let’s see the easy way first:
So you could just instead of having 1 position per index of the position buffer, you could use nested tables, index 1 of that table would be the position, and index 2 would be the group bit

This is slow because nested tables take longer to send over

{
    {Vector2.zero, 1},
    {Vector2.one, 0}
}

The slightly more complicated way is using a stride value. So you won’t have any nested tables and it’s just the position before the group bit. You can find and example in my previous post

bIorbee:

Here’s an example

-- In this example, there are 3 points as Vector2s
-- (of course you could use Vector2.new(), but to show the stride, I will do this)
local points = {
   100, 300, 
   50, 10, 
   60, 10
}

-- Each point is 2 numbers, so 1 element takes up 2 indices, therefore a stride of 2
local stride = 2

-- Imagine this is the worker function
function(dispatchId)
   -- Due to Lua tables start index being 1, the dispatchId starts at 1
   -- therefore, to get this correct we need to counter this
   local point = points[((dispatchId - 1) * stride) + 1]
   -- At dispatchId = 1, the index put into "points" would be 1
   -- ((1 - 1) * 2) + 1 = 1
   -- DispatchId = 2, the index would be 3
   -- and so on
end

The last way is what you are currently doing, simply just adding a new buffer. Just make sure the indices match again so you get the correct data

Now dispatching, for the number of workers 256 usually is a good starting point works in most cases. For how many threads to dispatch, well that’s just the size of the data you have

For example, let’s say you have 1000 positions, so you will dispatch 1000 threads. You can just get the size of the table you put into the buffer to figure this out. The number of threads to dispatch should always be the same as the size of the data you have

Now the most important part. Since this is running every tick, you want to wait for the dispatcher to finish before dispatching again otherwise the buffer data will be all messed up. The dispatcher when dispatched returns a Promise, so you can just do :await() and it will wait for it to finish

If you are going to use 2 dispatchers, just track when both of them finish by using something like a counter, once it’s 2 you can move on. Waiting for both of them you can just use a repeat loop

repeat
    task.wait() -- could be task.wait() or Runservice.Heartbeat:Wait() depends on you
until counter >= 2

The number of workers and the batch size (default is 50) will really depend on how fast your calculation code is, so there isn’t a generic number that works for everything. But, here’s some tips that will help the code run faster

Never loop through big tables. This will take a while and pause the worker for a while, instead you should pre-calculate anything that required looping through a big table before dispatcher. For example, when I used this for a noise sphere generation, I needed to get the vertices that are next to the current vertex. I did this by looping through every vertex and doing calculations to get the ones next to current vertex, then I saved that into a buffer with the indices that point to the vertices in the vertex position buffer
Keep your calculations fast. This whole system is based off Unity’s compute shaders which work best with small task but that need to happen a lot. One individual calculation should take very little time, but doing that serially for thousands will take a long time. That’s why you split it up into small tasks that all run at the same time
Pre-calculate any data you may need but it will never change and put that in a buffer (if it’s not a lot a data, then the variable buffer)

Artzified · May 30, 2024, 2:06pm

Thanks for the info! I still have a question regarding the pheromones

What do you mean by adding onto regular luau table? How should I add new pheromones to the grid?

Also, is it ok to create more than 2 threads (I use 3 as of right now) for a single dispatcher? cause It needs to decay the pheromones and calculate the falloff strength of all of the ant’s sensors, and then do the rest of the ant stuff

bIorbee · May 30, 2024, 3:14pm

The buffers used a SharedTable created by doing SharedTable.new(data). A regular table is just: t = {}

You want to add everything to the buffer you want to modify before you dispatch the dispatcher. Setting the buffer data is done by giving it a regular table

You can create as many threads as you want

Artzified · July 6, 2024, 2:50pm

just wanted to share what I made with this resource:

(sorry for the low quality)
Without obs, it hovers at ~16 fps; can’t seem to improve it with higher resolutions (above is 60x60)

	Dispatcher:Dispatch(ResolutionSquared, 'RaycastScreen'):andThen(function()
		Dispatcher:Dispatch(ResolutionSquared, 'DrawTextOnScreen'):andThen(function()
			local BufferData = CharacterComputeBuffer:GetData()
			for i = 1, ResolutionSquared do
				labels[i].Text = BufferData[i]
			end
		end):expect()
	end):expect()

is there a way to make this faster (the raycast is the slowest of them all)
used the native and optimize flag and now its around ~24 fps in 60x60, and 60 fps in 30x30

edit: ~~added edge detection thing, but i dont know how to quantize the gradient down to /, , |, - so ig color will suffice~~

figured it out, albeit hacky

Kealomon · July 6, 2024, 11:14pm

I am BEGGING, you to release this code. I have been looking for a way to make edge detection to ascii for so long

Artzified · July 7, 2024, 6:15am

Its nothing too fancy, its just a glorified gaussian blur and sobel filter to get the gradient (dont worry I believe in open-source supremacy so I’d likely open source it once im satisfied)

Cffex · July 12, 2024, 4:26am

I’ve just implemented your module into a gravitational fractal renderer of mine. I’m wondering if I had done anything wrong, since the time elapsed for the non-threaded version is faster than the threaded one.

ComputeLua:

Without using ComputeLua:

In the version using ComputeLua, I added a task.wait() in the worker template script so that it wouldn’t trigger script timeout.

So what I’m doing is having dispatched resolution^2 tasks for 32 workers, each of which handles 50 pixels which is the batch size.

Each worker calculates the color of each pixel which will then be stored back into a buffer data storage.

local resolution = 100
local num_workers = 32

local dispatcher = compute_lua.CreateDispatcher(num_workers, worker_template)

dispatcher:Dispatch(resolution^2, 'pixel', 50):andThen(function()
	local data = pixel_buffer:GetData()

	for x = 0, resolution-1 do
		for y = 0, resolution-1 do
			local color = data[y*resolution+x] ~= nil and bodies[data[y*resolution+x]].color or Color3.new(0, 0, 0)
			canvas:DrawCircle(Vector2.new(x, y), obj_radius, color, 0, Enum.ImageCombineType.BlendSourceOver)
		end
		task.wait()
	end

end):expect()

While on the worker template:

compute_lua.CreateThread(actor, 'pixel', function(id, variable_buffer)	
	local buffer_data = compute_lua.GetComputeBufferData('pixel')
	local body_data = compute_lua.GetComputeBufferData('body')

	local resolution = variable_buffer[1]
	local body_count = variable_buffer[5]
	local max_iter = variable_buffer[6]

	----------------------------------------------------------------

	local x = (id-1)%resolution
	local y = (id-1)//resolution

	--for y = 0, resolution-1 do
	local obj_position = Vector2.new(x, y)
	local obj_velocity = Vector2.zero
	local delta_force = Vector2.zero

	local collided_index = nil

	for i = 1, max_iter do
		if collided_index then break end
		
		for body_n = 1, body_count do
                          --// calculate force for each body
                          --// if certain criteria is met then collided_index = body_n
		end
		
		obj_velocity += delta_force
		obj_position += obj_velocity
		--if i % 20 == 0 then task.wait() end
	end
	
	buffer_data[(y*resolution)+x] = collided_index
	--end
end)

I’m not sure if I had given enough info so please inquire me if you want more.

Keep in mind I know NOTHING about threading including their practical usages.

bIorbee · September 24, 2024, 11:10pm

ComputeLua - Stable Release 1.3.0 (9/24/2024)

After a couple of months of working on the module, I managed to increase the performance by a lot! Here is the changelog

Huge performance increases
Reworked how workers return their data
ComputeBuffers are now linked to Dispatchers
The VariableBuffer has been removed
Added ComputeLua.GetBufferDataKey: (bufferName: string) -> number
Removed ComputeLua.CreateComputeBuffer
Added Dispatcher.SetComputeBuffer: (self: Dispatcher, bufferName: string, bufferData: ComputeBufferDataType) -> ()
Added Dispatcher.DestroyComputeBuffer: (self: Dispatcher, bufferName: string) -> ()
Removed Dispatcher.SetVariableBuffer

Benchmarks

Got this benchmark from (Parallelizer - A Packet-based Parallelism Module)
Processing the square root of 2, 8192 times and repeating that 100 times

Using 256 works with a batch size of 32 (8192 / 256)

13th Gen Intel(R) Core™ i9-13900F @ 2.00 GHz

1.3.0: 20ms average; 2s total; 50fps average
1.2.1: 40ms average; 4s total; 40fps average

11th Gen Intel(R) Core™ i7-1195G7 @ 2.92 GHz

1.3.0: 30ms average; 3s total; 45fps average
1.2.1: 70ms average; 7s total; 17fps average

local ReplicatedStorage = game:GetService("ReplicatedStorage")

local ComputeLua = require(ReplicatedStorage.ComputeLua)

local worker = script.Worker
local numWorkers = 256

local Dispatcher = ComputeLua.CreateDispatcher(numWorkers, worker)

Dispatcher:SetComputeBuffer("buffer", table.create(8192, 2))

local total = 0
local count = 0
local function process()
    local start = os.time()
    
    local _, data = Dispatcher:Dispatch("ProcessSquareRoot", 8192):await()
    total += os.time() - start
    count += 1
    
    if count < 100 then
        process()
    else
        print(`Average time to process: {(total / count) * 1000}ms`)
        print(`Total time to process: {total}s`)
    end
end

process()

local ReplicatedStorage = game:GetService("ReplicatedStorage")

local actor = script:GetActor()
if actor == nil then
    return
end

local ComputeLua = require(ReplicatedStorage.ComputeLua)

local BUFFER_KEY = ComputeLua.GetBufferDataKey("buffer")

ComputeLua.CreateThread(actor, "ProcessSquareRoot", function(id: number, bufferData: SharedTable)
    local value = bufferData[BUFFER_KEY][id]
    return {BUFFER_KEY, math.sqrt(value)}
end)