Parallelizer - A Packet-based Parallelism Module

lore

The other day I was revisiting my ASCII Shader project, which uses ComputeLua. They used a sharedtable called “ComputeBuffer” to get the data from parallel to serial. I was only getting 300ms refresh rate on my shader, which is not at all playable - so after careful considerations and research, I decided to make my own with a twist.

fast travel

Usage
Main Idea
Download
Benchmark & Comparisons
Credits & Footnotes

Usage

Place the module somewhere both the main script and the actor script can access, like ReplicatedStorage.

Main Script:

local Parallelizer = require(path.to.module)
local Scheduler = Parallelizer:CreateNewJobScheduler(script.Job, 256) -- 256 workers/actors
task.wait(1) -- Ensure all the scripts ran fully

-- Calculate the root of 1 to 4096 with each actor equally doing the same number of root operations
Scheduler:DispatchWithBatches('CalculateRoot', 4096, function(result)
	print(result) -- The result in an array
	Scheduler:Destroy() -- Destroy when no longer used (only for memory cleanup purposes)
end, {2}) -- {2} is the arguments you want to pass into parallel land, preferably for constants
-- The Dispatch function is asynchronous, meaning it won't yield - so the succeeding code will run unobstructed

Actor/Job Script (under the main script):

local Actor = script:GetActor()

-- Ensure we don't run as a non-actorized script?
if not Actor then
	return
end

local Parallelizer = require(path.to.module)

-- id is the index of the thread, which is in the range [1, threadCount]
Parallelizer:CreateThread(Actor, 'CalculateRoot', function(id, instruction)
	return id ^ (1/instruction[1])
end)

:exclamation:Notice: The create thread callback function expects a return value.

The code above will calculate the sqrt of one through 4096. The 2 in the instruction table serves as an argument to be passed into the parallel work load.

Oh why use a callback you may ask? Well, it’s the only way I could think of that wouldn’t really add more delay to the job processes, If I used polling, then I would need to go add a delay in between the poll iterations - which proved inefficient. And If I used promises, then It acts as a middleware which would add another delay (and also kind of bloated)

What the hell is the difference?

It uses a bindable event to send packets of data from parallel to serial instead of a SharedTable - which SharedTables are frankly quite slow to deal with.

Download

Parallelizer.lua (2.4 KB)

Version History

0.1.3: Parallelizer.lua (2.4 KB)
0.1.2: Parallelizer.lua (2.2 KB)
0.1.1: Parallelizer.lua (2.1 KB)
0.1.0: Parallelizer.lua (2.4 KB)
Release: Parallelizer.lua (2.2 KB)

Benchmarks & Comparisons

Benchmark settings:

  • 256 actors
  • 8192 threads
  • 32 task assigned per actor
  • 100 iterations
  • Non-native environment

Hardware (the only stuff I have :sob:):

  • Intel(R) Core™ i7-4770K CPU @ 3.50GHz
Parallelizer Benchmark Code
local Parallelizer = require(script.Parallelizer)
local Scheduler = Parallelizer:CreateNewJobScheduler(script.Job, 256)
task.wait(1)

local DeltaTimeSum = 0
local Count = 0
local function Benchmark()
	local Start = os.clock()
	
	Scheduler:DispatchWithBatches('CalculateRoot', 8192, function(result)
		local DeltaTime = os.clock()-Start
		DeltaTimeSum += DeltaTime
		Count += 1
		
		if Count < 100 then
			Benchmark()
		else
			print(`Average: {DeltaTimeSum/Count}`)
		end
	end, {2})
end

Benchmark()
ComputeLua Benchmark Code
local ComputeLua = require(script.ComputeLua)
local Dispatcher = ComputeLua.CreateDispatcher(256, script.Worker)
Dispatcher:SetVariableBuffer({ 2 })

local ComputeBuffer = ComputeLua.CreateComputeBuffer('Root')
ComputeBuffer:SetData(table.create(8192, 0))

local DeltaTimeSum = 0
local Count = 0
local function Benchmark()
	local Start = os.clock()

	Dispatcher:Dispatch(8192, 'CalculateRoot', 8192//256):expect()
	ComputeBuffer:GetData()
	
	local DeltaTime = os.clock()-Start
	DeltaTimeSum += DeltaTime
	Count += 1

	if Count < 100 then
		Benchmark()
	else
		print(`Average: {DeltaTimeSum/Count}`)
	end
end

Benchmark()
Parallel Scheduler Benchmark Code

I hope I’m doing this right?

local Scheduler = require(script.ParallelScheduler)

local ModuleTable = Scheduler:LoadModule(script.mod)

ModuleTable:SetMaxWorkers(256)

local DeltaTimeSum = 0
local Count = 0
local function Benchmark()
	for i = 1, 8192 do
		ModuleTable:ScheduleWork(2)
	end

	local Start = os.clock()
	ModuleTable:Work()

	local DeltaTime = os.clock()-Start
	DeltaTimeSum += DeltaTime
	Count += 1

	if Count < 100 then
		Benchmark()
	else
		print(`Average: {DeltaTimeSum/Count}`)
	end
end

Benchmark()
Task Parallelizer ComputeLua Parallel Scheduler
Sqrt 20ms 81ms 26ms

(computelua benchmark is not up to date)

Credits & Footnotes

:warning: WARNING: Messing with generally all parallel code is prone to crashes, save a backup or publish the place to avoid your progress being loss.

I don’t really plan to maintain this project seriously unless I have a reason to.

Also could anyone suggest a stress test method I could use? I’m trying to test the capabilities of this module before I’m going to redo my shader.

Thanks to ComputeLua once again for inspiring me and giving hope to make the shader - and this. Most of the API is similar to ComputeLua’s (and also a bit of unity’s)

11 Likes

#resources:community-resources
PS: Change the topic to this ^^^

1 Like

I was about to say I wasn’t really going to maintain this project as often; but I guess it doesn’t matter

2 Likes

Added a :DispatchWithBatches helper function to dispatch and calculate the BatchSize automatically for you (divides the work into equal parts)

1 Like

Made it so you can pass in arguments directly into both the dispatch functions since I realized it would get tedious to set the Instruction table repeatedly

1 Like

Oh yeah and it’ll not break when your thread count is not divisible by your actor count

Also added benchmarks and stuff

1 Like

Fixed a silly if statement oversight, previously a false return value would be flagged as a missing return value

I’m pretty sure I did benchmarkes wrong, It should be intensive repeated tasks. I’m gonna go fix the benchmark section now

1 Like

Updated the benchmark section, it now compares the average instead of just plain attempts

1 Like

Fixed an issue where batchSize can be 0, resulting in the actor loop not looping - caused by thread count being smaller than the actor count

1 Like