Measure the player's CPU for optimizing Parallel Luau

NOVEIGMA · January 22, 2024, 3:49am

Please read
After testing this a bit with the community (and still needs more testing), for now the assumption is that Roblox has imposed a hard limit of 8 parallel threads. As such, this tool does not exactly measure your CPU’s core/thread count. Thanks to @Tomi1231 and @RuizuKun_Dev for pointing this out. However, this tool is still useful for finding a local limit lower than 8 on other types of devices, such as mobile ones.
Another disclaimer, it is also discovered that Parallel Luau does not work well on efficiency-oriented devices, which can include running on battery power and low-power CPUs. The results on these devices are surprisingly inconsistent.

Demo:

Live demo:

This is a pretty hacky workaround, but from my testing it seems to be pretty accurate and stable for the most part. All it does is run a test to measure the number of usable RBX Workers for parallelization in the task scheduler.

Why does this matter

On your PC there is a certain number of cores (or threads, the term is interchangeable in this post) that can be used for parallelization. You can divide up your workload in your code among these cores to accelerate computation time.

Let’s just say, you have 4 cores on your PC. Your workload can be divided up among 4 Actors, so that’s one Actor per core.

You can double up your number of Actors to 8, and it can still be equally divided among the cores. This time it’s 2 Actors per core. Each core finishes processing their Actors at around the same time and it all totals to 6 milliseconds needed for your parallel stuff.

But what if you accidentally made one more Actor, totaling it to 9? The extra Actor randomly chooses one of the core to be worked on, and that chosen core needs more time to finish their extended assignment. But because the parallel resumption phase must wait for all cores to finish, it will extend it to 9 milliseconds!

This is a big problem for your game’s performance. The frametime gets bigger because of that one Actor holding back the parallel resumption phase. But this can be easily migitated by creating more and more Actors so that each of them gets smaller and individually takes less time. So, any extra Actors that didn’t fit won’t waste as much time and your performance loss gets smaller. In fact, this is the solution that Roblox wants you to use… with an unpleasant side effect that will be explained below.

image956×240 57.7 KB

As quoted above, this granularity has a problem. Each Actor is an instance that contains a script. If you have hundreds of these Actors, it can easily inflate your memory usage and also make it an absolute nightmare to manage.

Because parallelization is meant for mathematical workloads (it’s like the only thing it’s good at), you can imagine needing to do a lot of data transaction between your Actors and a central consolidation point. After all, you must know the result of your calculations! This means calling tons of BindableEvents, sending tons of Actor messages, and tons of SharedTable accesses. All of this adds to even more performance degradation, and eventually it can even become slower than just doing everything in serial. This phenomenon is called parallel slowdown and is something you need to watch out for.

Parallel slowdown is typically the result of a communications bottleneck. As more processor nodes are added, each processing node spends progressively more time doing communication than useful processing (Wikipedia).

So, the smart person would try to minimize the number of Actors they create. But they also want to make enough Actors to fully utilize the player’s CPU, without going past the limit. And so you have to measure and know their CPU core count beforehand to make the right amount of Actors, which is the goal of this tool.

How it works

It’s pretty simple. Referring back to my explanation above, eventually you will create too much Actors and one of them will be an outlier that holds back everyone else. This can be used to find the “breaking point” by measuring the the duration of the parallel resumption phase. If it looks like the time has doubled after adding one more Actor, it’s more than likely you’ve reached the thread limit and the scheduler has to start stacking the Actors.

The attached rbxm file simplifies the entire process and outputs the calculated result as an attribute under workspace. I’ve also added the original demo so you can test it out immediately.

Note: The live place demo has a slightly different source code, but it is uncopylocked.

CPU core count finder.rbxm (4.0 KB)

measure CPU core count.rbxl (56.0 KB)

And the script themselves if you’re too lazy to open them:

Main script

local testLimit: number = 32 --number of test cycles; also the maximum limit of cores to be reported
local testDuration: number = 0.02 --yield time of each Actor script
local minimumIncrementFlag: number = testDuration * .75 --threshold for concluding the test
local stableFrametimeThreshold: number = 1/58 --see section below

local send = script:WaitForChild('Send')
local receive = script:WaitForChild('Receive')
local exActor = script:WaitForChild('ExampleActor')
type Worker = typeof(exActor)

do --wait for the framerate 2 b chill, because an unstable task schedule WILL ruin the test
	warn('waiting for FPS to stabilize...')
	local stableFramesCount: number = 0
	local connection: RBXScriptConnection connection = game:GetService('RunService').Heartbeat:Connect(function(dt: number)
		if dt <= stableFrametimeThreshold then --a frame is considered stable if the instantaneous FPS reaches the threshold
			stableFramesCount += 1
		end
	end)
	
	repeat task.wait() until stableFramesCount >= 60 --wait until you get 60 stable frames
end

local record: {number} = {}
local i: number = 0
while i < testLimit do i += 1
	
	do --create a new Actor for this next cycle of the test
		local ready: boolean
		receive.Event:Once(function()
			ready = true
		end)
		
		local new = exActor:Clone()
		new.Name = i
		new.Parent = script
		new:WaitForChild('Worker').Enabled = true
		
		while not ready do task.wait() end
	end
	
	warn(`testing with {i} Actor(s)`)
	local now: number = os.clock()
	local received: number = 0
	local breakOut: boolean
	local onRecieve: RBXScriptConnection onRecieve = receive.Event:Connect(function()
		received += 1
		
		if received == i then --at the conclusion of this test cycle
			onRecieve:Disconnect()
			onRecieve = nil
			
			local timeTaken: number = os.clock() - now
			print(timeTaken)
			
			local average: number = 0
			for _, v in record do average += v end
			average /= #record		
			
			if math.abs(timeTaken - average) >= minimumIncrementFlag then
				warn(`discrepancy at i={i}; average time is {average}, this cycle took {timeTaken}`)
				breakOut = true
				return
			end
			table.insert(record, timeTaken)
		end
	end)
	send:Fire(testDuration)
	
	while received ~= i do task.wait() end
	if breakOut then
		print(`LocalPlayer CPU core count measured to be {i-1}`)
		break
	end
	
	task.wait()
end

workspace:SetAttribute('LocalplayerCPUCoreCount', i-1) --output
script.Parent:Destroy()

Actor script

local actor = script.Parent
local main = actor.Parent
local send = main:WaitForChild('Send')
local receive = main:WaitForChild('Receive')
local id: number = assert(tonumber(actor.Name), `Unable to format ID "{actor.Name}"`)

send.Event:ConnectParallel(function(testDuration: number)
	local goal: number = os.clock() + testDuration
	repeat until os.clock() >= goal --yield until the duration has passed
	receive:Fire()
end)

receive:Fire() --let the central script know that this Actor is ready

NOVEIGMA · January 22, 2024, 4:38am

Also here’s what it may look like in production code:

This is my raycaster minimap

nothing_1649 · January 22, 2024, 4:50am

this is pretty neat, but how would i go about implementing it? i’ve never really understood the concept of cloning multiple actors to achieve a common goal but i really want to optimise my code that runs in parallel

NOVEIGMA · January 22, 2024, 4:56am

Whatever’s in this thread is only gonna be useful if you already know about parallelization. You can start with these two resources, and the rest you will have to do the digging and experimenting yourself

RuizuKun_Dev · January 22, 2024, 5:41am

Good work! You’re very clever to use this method and it seems to be pretty accurate

one of the best resource I’ve found so far this year, keep it up!

Tomi1231 · January 22, 2024, 7:54am

This can definitely be very useful, especially for workloads that are purely mathematical and not tied to a roblox object (like a npc or whatever). I made a module to make working with parallel lua for this kind of workload easier and I just set the actor count to 24 (since it’s a value that should work nicely with 2, 4, 6, 8, 12 and 24 threads). I did think about making something similar, never did.

One thing that is interesting is that I have a 6 core 12 thread cpu (ryzen 5 5600g), but in the micro profiler, I could only see 8 (or 10?) parallel threads running

Same result from your module

NOVEIGMA · January 22, 2024, 2:41pm

Check the microprofiler!

Edit: I have no idea what I’m saying

This seems to be because some CPUs have these so-called efficiency cores that run at a fraction of the performance of a regular core and are meant to be used on battery power, and thus can’t really be used like a regular one. They are ignored by the task scheduler. If you go beyond 8 Actors, it’s still going to start stacking even though those extra logical processors are registered in the scheduler as additional RBX Workers.

NOVEIGMA · January 22, 2024, 2:47pm

It technically just counts the number of cores that are useable in Roblox for parallelization. This can exclude special cases like efficiency cores.

minkmink · January 22, 2024, 3:35pm

you do know this doesnt matter, right? roblox automatically handles multithreading, you can have 1000 actors and it’d be the same as if you used 8 (if you have a core count of 8)

NOVEIGMA · January 22, 2024, 3:53pm

I hope you read my thread before making this reply. Creating too much Actors will bloat your memory usage and severely undermine parallelization’s benefits for mathematical workloads because you will run into communication bottlenecks. Imagine firing thousands of BindableEvents or Actor messages every frame just because you wanted the Actors to do something.

Tomi1231 · January 22, 2024, 6:02pm

The ryzen 5 5600g doesn’t have any efficiency cores, they are all equivalent (other than silicon lottery) and it also is a desktop cpu in a desktop computer with no battery.
The micro profiler only shows 8 RBX Workers, going from 0 to 12, and skipping 4, 6, 9, 10 and 11. Well 10 is shown sometimes but it has nothing. No idea what the logic is…

NOVEIGMA · January 22, 2024, 6:42pm

I was talking about efficiency cores because it’s relevant to my CPU. And actually I’m not even sure about what I said earlier anymore lol. The 8 cores available to me could be because it actually does still use efficiency cores as regular cores:

OR it could be that it’s using hyperthreading, which is only available to the performance cores (4 performance cores, 2 threads per core = 8 threads that can be used to parallelize)

But then neither of this can be used to explain your case. It’s got 6 regular cores, but only 8 threads can be used???

At least it is consistent, the task scheduler will only utilize a set number of cores/threads for parallelization. Regardless, this only makes this core-finding tool more useful because it can find the exact number of cores that can be used for parallelization

minkmink · January 23, 2024, 4:08am

while technically correct, its practically just… not really something that happens (and if it does then too many actors is your least issue)

NOVEIGMA · January 23, 2024, 7:22am

You were the one who gave an example of making a thousand Actors, not me lol. You now talking about practicality is kinda contradictive. And also, firing an event every frame for each Actor is a practice more common than you think!

Even if you were conservative with how many Actors you make, you still have to be careful because an unbalanced parallelization where the Actors aren’t evenly distributed can effectively double the frametime and halve your performance. It is still more than reasonable to make just enough Actors, hence the purpose of measuring just how many, which is the sole purpose of this thread.

Eternity_Devs · January 24, 2024, 1:08am

cpu: i5-12450H, by adding task.wait(6) for letting others (specially for roblox cores) loaded & power saving mode enabled.

NOVEIGMA · January 24, 2024, 2:38am

Check the microprofiler. There was a frametime jump in a weird spot. But even then the numbers aren’t right. Basically, the frametime should double once the Actors start stacking and has saturated all the cores. In your case it’s a 33% jump. And, assuming you haven’t changed the settings, the numbers should average at around 0.02, not 0.03.

I’ve done some testing too on my Surface Pro and it’s got similar results. Weird to say, it’s consistent that these types of devices (efficiency > performance) report inconsistent results . One such example is when the Actors don’t even start at the same time when parallel resumption phase starts, which screws up the test result:

And the numbers also averaged closer to 0.03 instead of 0.02 just like your case:

Beloathed · January 24, 2024, 3:23am

I have a 16-core CPU.

NOVEIGMA · January 24, 2024, 5:43am

16 threads or 16 cores? Your test results are pretty normal, the elapsed time pretty much perfectly doubled after saturating the RBX Workers.

Also, check the microprofiler! That’s the only way to prove if something’s Roblox’s fault or this module’s fault. If it consistently only fills up 8 RBX Workers, then that’s the hard limit.

Beloathed · January 24, 2024, 5:28pm

NOVEIGMA · January 24, 2024, 6:34pm

show me yo microprofiler and screenshot the area where it starts stacking them Actors

Actually just screenshot the entire extent of the test