lore
The other day I was revisiting my ASCII Shader project, which uses ComputeLua. They used a sharedtable called “ComputeBuffer” to get the data from parallel to serial. I was only getting 300ms refresh rate on my shader, which is not at all playable - so after careful considerations and research, I decided to make my own with a twist.
fast travel
Usage
Main Idea
Download
Benchmark & Comparisons
Credits & Footnotes
Usage
Place the module somewhere both the main script and the actor script can access, like ReplicatedStorage.
Main Script:
local Parallelizer = require(path.to.module)
local Scheduler = Parallelizer:CreateNewJobScheduler(script.Job, 256) -- 256 workers/actors
task.wait(1) -- Ensure all the scripts ran fully
-- Calculate the root of 1 to 4096 with each actor equally doing the same number of root operations
Scheduler:DispatchWithBatches('CalculateRoot', 4096, function(result)
print(result) -- The result in an array
Scheduler:Destroy() -- Destroy when no longer used (only for memory cleanup purposes)
end, {2}) -- {2} is the arguments you want to pass into parallel land, preferably for constants
-- The Dispatch function is asynchronous, meaning it won't yield - so the succeeding code will run unobstructed
Actor/Job Script (under the main script):
local Actor = script:GetActor()
-- Ensure we don't run as a non-actorized script?
if not Actor then
return
end
local Parallelizer = require(path.to.module)
-- id is the index of the thread, which is in the range [1, threadCount]
Parallelizer:CreateThread(Actor, 'CalculateRoot', function(id, instruction)
return id ^ (1/instruction[1])
end)
Notice: The create thread callback function expects a return value.
The code above will calculate the sqrt of one through 4096. The 2 in the instruction table serves as an argument to be passed into the parallel work load.
Oh why use a callback you may ask? Well, it’s the only way I could think of that wouldn’t really add more delay to the job processes, If I used polling, then I would need to go add a delay in between the poll iterations - which proved inefficient. And If I used promises, then It acts as a middleware which would add another delay (and also kind of bloated)
What the hell is the difference?
It uses a bindable event to send packets of data from parallel to serial instead of a SharedTable - which SharedTables are frankly quite slow to deal with.
Download
Parallelizer.lua (2.4 KB)
Version History
0.1.3: Parallelizer.lua (2.4 KB)
0.1.2: Parallelizer.lua (2.2 KB)
0.1.1: Parallelizer.lua (2.1 KB)
0.1.0: Parallelizer.lua (2.4 KB)
Release: Parallelizer.lua (2.2 KB)
Benchmarks & Comparisons
Benchmark settings:
- 256 actors
- 8192 threads
- 32 task assigned per actor
- 100 iterations
- Non-native environment
Hardware (the only stuff I have ):
- Intel(R) Core™ i7-4770K CPU @ 3.50GHz
Parallelizer Benchmark Code
local Parallelizer = require(script.Parallelizer)
local Scheduler = Parallelizer:CreateNewJobScheduler(script.Job, 256)
task.wait(1)
local DeltaTimeSum = 0
local Count = 0
local function Benchmark()
local Start = os.clock()
Scheduler:DispatchWithBatches('CalculateRoot', 8192, function(result)
local DeltaTime = os.clock()-Start
DeltaTimeSum += DeltaTime
Count += 1
if Count < 100 then
Benchmark()
else
print(`Average: {DeltaTimeSum/Count}`)
end
end, {2})
end
Benchmark()
ComputeLua Benchmark Code
local ComputeLua = require(script.ComputeLua)
local Dispatcher = ComputeLua.CreateDispatcher(256, script.Worker)
Dispatcher:SetVariableBuffer({ 2 })
local ComputeBuffer = ComputeLua.CreateComputeBuffer('Root')
ComputeBuffer:SetData(table.create(8192, 0))
local DeltaTimeSum = 0
local Count = 0
local function Benchmark()
local Start = os.clock()
Dispatcher:Dispatch(8192, 'CalculateRoot', 8192//256):expect()
ComputeBuffer:GetData()
local DeltaTime = os.clock()-Start
DeltaTimeSum += DeltaTime
Count += 1
if Count < 100 then
Benchmark()
else
print(`Average: {DeltaTimeSum/Count}`)
end
end
Benchmark()
Parallel Scheduler Benchmark Code
I hope I’m doing this right?
local Scheduler = require(script.ParallelScheduler)
local ModuleTable = Scheduler:LoadModule(script.mod)
ModuleTable:SetMaxWorkers(256)
local DeltaTimeSum = 0
local Count = 0
local function Benchmark()
for i = 1, 8192 do
ModuleTable:ScheduleWork(2)
end
local Start = os.clock()
ModuleTable:Work()
local DeltaTime = os.clock()-Start
DeltaTimeSum += DeltaTime
Count += 1
if Count < 100 then
Benchmark()
else
print(`Average: {DeltaTimeSum/Count}`)
end
end
Benchmark()
Task | Parallelizer | ComputeLua | Parallel Scheduler |
---|---|---|---|
Sqrt | 20ms | 81ms | 26ms |
(computelua benchmark is not up to date)
Credits & Footnotes
WARNING: Messing with generally all parallel code is prone to crashes, save a backup or publish the place to avoid your progress being loss.
I don’t really plan to maintain this project seriously unless I have a reason to.
Also could anyone suggest a stress test method I could use? I’m trying to test the capabilities of this module before I’m going to redo my shader.
Thanks to ComputeLua once again for inspiring me and giving hope to make the shader - and this. Most of the API is similar to ComputeLua’s (and also a bit of unity’s)