Simply using ComputeLua you can easily send a bunch of workers to run a function all in parallel, then get them to work on some data and send them back to the main thread.
ComputeLua is great for large computations that need to happen quickly, such as:
Terrain generation
Editing the vertices of an EditableMesh
Simulation like waves
ComputeLua is fully and statically typed, so autocompletion will happen and errors will throw is incorrect types are inputted.
Limits
The data that can be sent over to workers can only be these data types. This is due to the limitations on SharedTables, but these data types should be all you need.
Vector2
Vector3
CFrame
Color3
UDim
UDim2
number
boolean
string
No, you cannot have functions in these buffers.
To make everything run faster, the keys for ComputeBuffers must be a number. A worker can easily access the current data they are working on within the massive table by using their dispatch ID if the keys are all numbers
How do I use this?
A detailed description can be found on the documention site Getting Started
Anymore information you may need can be found on the API site API
ComputeLua.CreateThread(actor, "CalculatePositions", function(id, variableBuffer)
local value = variableBuffer[1] -- Get the first variable within the Variable Buffer
end)
I would assume this script would be inside the worker script right? If so how is data buffer accessed?
edit: Oh I’m sorry i am quite blind
Also I have a question, does this split the work between all workers, or do i have to do this myself?
example given i have 100 points of data and 4 workers, each would have to calculate 25 data points
No I understand that this is multi threaded, what I am asking is does it split the work automatically across the workers for me or do I have to do this myself manually
Yes that CreateThread function is placed inside of a worker script. It is a separate script anywhere the script that is dispatching the workers can access and clone it
You can simply get a ComputeBuffer by running ComputeLua.GetComputeBufferData(bufferName: string) inside of a worker, more detailed information can be found on the documentation (Worker | Getting Compute Buffer data)
When you dispatch a group of workers ComputeLua will automatically seperate the workers itself, the very important variable that each worker is given is the dispatchId or in your case, the id in the function you inputted for CreateThread. You can use this id to get which data of the buffer the worker is currently editing. You can of course always access any other data of the buffer if you wish by just changing the index. You can see an example of this on the documentation (Worker | Example)
The only thing ComputeLua doesn’t do for you is, make the ComputeBuffers’ data in a format that could be used correctly. If it becomes a problem, I can create some helper functions that will help put the data together)
On the latest release on Github, there is a complete example of the module as a .rblx, if you need help then I suggest going there Releases · blorbee1/ComputeLua · GitHub
Just go to the Dispatcher script in ServerScriptService and check out the format of the data it puts into a ComputeBuffer (no keys, each element is by itself {not a table})
To be more specific to your question, the script will not split it up like that right now. You could always use the method of giving a stride value. A stride value is how many indicies of a list does one element take up. For you, that would be a stride of 25, and you will use the dispatchId to figure out the start index of the element and the end index (25 + start index)
Here’s an example
-- In this example, there are 3 points as Vector2s
-- (of course you could use Vector2.new(), but to show the stride, I will do this)
local points = {
100, 300,
50, 10,
60, 10
}
-- Each point is 2 numbers, so 1 element takes up 2 indices, therefore a stride of 2
local stride = 2
-- Imagine this is the worker function
function(dispatchId)
-- Due to Lua tables start index being 1, the dispatchId starts at 1
-- therefore, to get this correct we need to counter this
local point = points[((dispatchId - 1) * stride) + 1]
-- At dispatchId = 1, the index put into "points" would be 1
-- ((1 - 1) * 2) + 1 = 1
-- DispatchId = 2, the index would be 3
-- and so on
end
Of course you could always just put each element into its own table (in your case, 25 points in one table), since nested tables are allowed to be sent, but not encouraged. From my testing, nested tables slow down some workers
In short, ComputeLua will not split the data you put into the Compute Buffers to match the amount of workers automatically. Each worker has one dispatch ID then it will be called again with a new ID, you will have to manage multiple data being changed with one worker if your dispatch amount is smaller than your data size
But ComputeLua works best with lots amount of data, for example I used this to calculate nosie values on a sphere mesh which has around 10k vertices, so using this sped up the calculation by a lot (1 worker per vertex)
Place1.rbxl (112.2 KB)
could you please review my implementation, I am trying to get this to work at mesh deformed ocean simulation but it crashes, the script is located at StarterPlayer.StarterPlayerScripts, the code is quite disgusting i’m sorry.
You are trying to dispatch 4 workers for around 4000 data entries. Since you didn’t add a stride system like something I said in my last post, the workers just stop after the 4th entry and everything else doesn’t get updated then it will throw an error saying Vector3 instead of CFrame
The crashing is actually my fault, there was an issue with the code and I’ve gone ahead and made a patch to fix that
I’ve made some changes to your code, this time it is now using 256 workers with a dispatch size of #WaveData. This way every element in WaveData will be processed. Even though there are only 256 workers, each worker is processing multiple at a time just at different calls
More detailed explaination of the above
When the Dispatcher is dispatched, a for loop is started. (if serial dispatch is disabled) It will pick a random worker to run its function, this could be a worker that already got called in a previous loop, or a brand new one that is sitting around doing nothing. Since when you call CreateThread it will put your function binded to a parallel actor message, this will prevent workers that get called more than once during the dispatching from pausing each call
This allows you to have way less workers than the data size required, for you thats 256 workers processing around 4000 data
If you really want to have less workers (which isn’t recommended) then you are going to have to create your own method of getting the 4 workers to process all of the data instead of stopping at the 4th entry
Place1.rbxl (112.3 KB)
If you got anymore questions, just ask
I’m reworking my ant simulation project to be more performant, since the previous version lags a lot with 1 hour worth of pheromones (I’d say about ~2k pheromones for each pheromone group); a pheromone can be a home pheromone, or a food pheromone. Each of them have a strength value that decays overtime, with a Vector2 position associated with it
This is the previous version’s pheromones structure:
{
HOME = {
[Vector2.zero] = 4; -- [position] = strength
[Vector2.yAxis] = 6;
[Vector2.xAxis] = 2;
[Vector2.new(1, 2)] = 0.2
};
FOOD = {
[Vector2.one] = 2; -- [position] = strength
[Vector2.zero] = 1; -- pheromone groups can overlap other groups
[-Vector2.xAxis] = 3
}
}
(the pheromones are changing constantly, it grows at it goes and they will be removed if the strength goes below 0 to save some space and time for falloff function)
I don’t really know how to manage my buffers with this structure, since I need to have quick access to the pheromone on a given position. I’m currently using 2 separate dispatcher that runs sequentially for Pheromone workers and Ant workers, should I just stick to one dispatcher and one worker script instead of 2?
Using a buffer for the positions and a different one for the strength works great as long as you match the index of the two (position buffer index 1 should use the strength at the strength buffer index 1)
(if you haven’t already)
You can use two dispatchers as long as you know two things
ComputeBuffers are global, they will ignore a dispatcher so you can access the data no matter what worker under whatever dispatcher
If you dispatch the same thread while another one is working, you are going to get incorrect data
I suggest you use 1 dispatcher and just 2 threads. You can create multiple threads in one worker
Now for the format for the buffers, the best thing to do is keep the size of the buffer fixed so writing and reading to it will happen quickly
The worst thing you want to happen is the workers adding onto the data of the buffers (adding new elements/new indices). The buffers use SharedTables and increasing the size on SharedTables is very slow so you want to avoid it at all costs.
As long as you don’t let any worker add new elements you should be fine, adding onto regular luau tables is much faster so you can build the table beforr dispatching
I’m going to assume the phenomenon groups are the HOME and FOOD keys in the original table. To replace this you can use a identifier bit in the position buffer. If you will only ever have two types of group, then this bit would be 1 for one group, and 0 for a different one. If you have more than 2, just go 1, 2, 3, etc and check for that in the worker
The reason I’m using a number instead of a string since strings take longer to process
To put this bit with the positions you got 3 ways: the easy, slow way; the fast, bit more complicated way, the fast, slightly easy way
Let’s see the easy way first:
So you could just instead of having 1 position per index of the position buffer, you could use nested tables, index 1 of that table would be the position, and index 2 would be the group bit
This is slow because nested tables take longer to send over
{
{Vector2.zero, 1},
{Vector2.one, 0}
}
The slightly more complicated way is using a stride value. So you won’t have any nested tables and it’s just the position before the group bit. You can find and example in my previous post
The last way is what you are currently doing, simply just adding a new buffer. Just make sure the indices match again so you get the correct data
Now dispatching, for the number of workers 256 usually is a good starting point works in most cases. For how many threads to dispatch, well that’s just the size of the data you have
For example, let’s say you have 1000 positions, so you will dispatch 1000 threads. You can just get the size of the table you put into the buffer to figure this out. The number of threads to dispatch should always be the same as the size of the data you have
Now the most important part. Since this is running every tick, you want to wait for the dispatcher to finish before dispatching again otherwise the buffer data will be all messed up. The dispatcher when dispatched returns a Promise, so you can just do :await() and it will wait for it to finish
If you are going to use 2 dispatchers, just track when both of them finish by using something like a counter, once it’s 2 you can move on. Waiting for both of them you can just use a repeat loop
repeat
task.wait() -- could be task.wait() or Runservice.Heartbeat:Wait() depends on you
until counter >= 2
The number of workers and the batch size (default is 50) will really depend on how fast your calculation code is, so there isn’t a generic number that works for everything. But, here’s some tips that will help the code run faster
Never loop through big tables. This will take a while and pause the worker for a while, instead you should pre-calculate anything that required looping through a big table before dispatcher. For example, when I used this for a noise sphere generation, I needed to get the vertices that are next to the current vertex. I did this by looping through every vertex and doing calculations to get the ones next to current vertex, then I saved that into a buffer with the indices that point to the vertices in the vertex position buffer
Keep your calculations fast. This whole system is based off Unity’s compute shaders which work best with small task but that need to happen a lot. One individual calculation should take very little time, but doing that serially for thousands will take a long time. That’s why you split it up into small tasks that all run at the same time
Pre-calculate any data you may need but it will never change and put that in a buffer (if it’s not a lot a data, then the variable buffer)
Thanks for the info! I still have a question regarding the pheromones
What do you mean by adding onto regular luau table? How should I add new pheromones to the grid?
Also, is it ok to create more than 2 threads (I use 3 as of right now) for a single dispatcher? cause It needs to decay the pheromones and calculate the falloff strength of all of the ant’s sensors, and then do the rest of the ant stuff
The buffers used a SharedTable created by doing SharedTable.new(data). A regular table is just: t = {}
You want to add everything to the buffer you want to modify before you dispatch the dispatcher. Setting the buffer data is done by giving it a regular table
just wanted to share what I made with this resource:
(sorry for the low quality)
Without obs, it hovers at ~16 fps; can’t seem to improve it with higher resolutions (above is 60x60)
Dispatcher:Dispatch(ResolutionSquared, 'RaycastScreen'):andThen(function()
Dispatcher:Dispatch(ResolutionSquared, 'DrawTextOnScreen'):andThen(function()
local BufferData = CharacterComputeBuffer:GetData()
for i = 1, ResolutionSquared do
labels[i].Text = BufferData[i]
end
end):expect()
end):expect()
is there a way to make this faster (the raycast is the slowest of them all)
used the native and optimize flag and now its around ~24 fps in 60x60, and 60 fps in 30x30
edit: added edge detection thing, but i dont know how to quantize the gradient down to /, , |, - so ig color will suffice
Its nothing too fancy, its just a glorified gaussian blur and sobel filter to get the gradient (dont worry I believe in open-source supremacy so I’d likely open source it once im satisfied)
I’ve just implemented your module into a gravitational fractal renderer of mine. I’m wondering if I had done anything wrong, since the time elapsed for the non-threaded version is faster than the threaded one.
ComputeLua:
Without using ComputeLua:
In the version using ComputeLua, I added a task.wait() in the worker template script so that it wouldn’t trigger script timeout.
So what I’m doing is having dispatched resolution^2 tasks for 32 workers, each of which handles 50 pixels which is the batch size.
Each worker calculates the color of each pixel which will then be stored back into a buffer data storage.
local resolution = 100
local num_workers = 32
local dispatcher = compute_lua.CreateDispatcher(num_workers, worker_template)
dispatcher:Dispatch(resolution^2, 'pixel', 50):andThen(function()
local data = pixel_buffer:GetData()
for x = 0, resolution-1 do
for y = 0, resolution-1 do
local color = data[y*resolution+x] ~= nil and bodies[data[y*resolution+x]].color or Color3.new(0, 0, 0)
canvas:DrawCircle(Vector2.new(x, y), obj_radius, color, 0, Enum.ImageCombineType.BlendSourceOver)
end
task.wait()
end
end):expect()
While on the worker template:
compute_lua.CreateThread(actor, 'pixel', function(id, variable_buffer)
local buffer_data = compute_lua.GetComputeBufferData('pixel')
local body_data = compute_lua.GetComputeBufferData('body')
local resolution = variable_buffer[1]
local body_count = variable_buffer[5]
local max_iter = variable_buffer[6]
----------------------------------------------------------------
local x = (id-1)%resolution
local y = (id-1)//resolution
--for y = 0, resolution-1 do
local obj_position = Vector2.new(x, y)
local obj_velocity = Vector2.zero
local delta_force = Vector2.zero
local collided_index = nil
for i = 1, max_iter do
if collided_index then break end
for body_n = 1, body_count do
--// calculate force for each body
--// if certain criteria is met then collided_index = body_n
end
obj_velocity += delta_force
obj_position += obj_velocity
--if i % 20 == 0 then task.wait() end
end
buffer_data[(y*resolution)+x] = collided_index
--end
end)
I’m not sure if I had given enough info so please inquire me if you want more.
Keep in mind I know NOTHING about threading including their practical usages.
Using 256 works with a batch size of 32 (8192 / 256)
13th Gen Intel(R) Core™ i9-13900F @ 2.00 GHz
1.3.0: 20ms average; 2s total; 50fps average
1.2.1: 40ms average; 4s total; 40fps average
11th Gen Intel(R) Core™ i7-1195G7 @ 2.92 GHz
1.3.0: 30ms average; 3s total; 45fps average
1.2.1: 70ms average; 7s total; 17fps average
local ReplicatedStorage = game:GetService("ReplicatedStorage")
local ComputeLua = require(ReplicatedStorage.ComputeLua)
local worker = script.Worker
local numWorkers = 256
local Dispatcher = ComputeLua.CreateDispatcher(numWorkers, worker)
Dispatcher:SetComputeBuffer("buffer", table.create(8192, 2))
local total = 0
local count = 0
local function process()
local start = os.time()
local _, data = Dispatcher:Dispatch("ProcessSquareRoot", 8192):await()
total += os.time() - start
count += 1
if count < 100 then
process()
else
print(`Average time to process: {(total / count) * 1000}ms`)
print(`Total time to process: {total}s`)
end
end
process()
local ReplicatedStorage = game:GetService("ReplicatedStorage")
local actor = script:GetActor()
if actor == nil then
return
end
local ComputeLua = require(ReplicatedStorage.ComputeLua)
local BUFFER_KEY = ComputeLua.GetBufferDataKey("buffer")
ComputeLua.CreateThread(actor, "ProcessSquareRoot", function(id: number, bufferData: SharedTable)
local value = bufferData[BUFFER_KEY][id]
return {BUFFER_KEY, math.sqrt(value)}
end)