Parrallel lua is just as slow as serial lua

I’m working on a terrain generator, and it takes aprox 0.6 seconds to generate a chunk.

I am passing the generation job to 24 actors at a time, so I’d expect it to make 24 chunks in 0.6 seconds, but it seems that it takes the same time as serial (14.4 seconds for 24 chunks).

This is the code I’m using to generate a individual chunk:

local PerlinNoise = require(game.ReplicatedStorage.Modules.PerlinNoise)
local HttpService = game:GetService("HttpService")

--task.desynchronize()

return function (chunk, position, extras)
	task.desynchronize()
	local chunkSize, seed, scale, amplitude, cave_scale, cave_amplitude = unpack(extras)
	local blocks = table.create(chunkSize.X)
	
	for x = 1, chunkSize.X + 1 do
		if not blocks[x] then blocks[x] = table.create(chunkSize.Z) end	
		local real_x = position.X * chunkSize.X + x

		for z = 1, chunkSize.Z + 1 do
			if not blocks[x][z] then blocks[x][z] = table.create(chunkSize.Y) end
			local real_z = position.Z * chunkSize.Z + z

			for y = 75, chunkSize.Y  + 1 do
				local real_y = position.Y * chunkSize.Y + y

				local cave_density = PerlinNoise.new({real_x, real_y, real_z, seed}, cave_scale) * cave_amplitude
				local density = y + chunk.splineValue + PerlinNoise.new({real_x, real_y, real_z, seed}, scale, 3) * amplitude

				local block = {
					--position = Vector3.new(real_x, real_y, real_z),
					material = "unknown",
				}

				--if density < 130 then
				if cave_density > 25 then
					block.material = "air"
				else
					if density < 130 or (density > 110 and density < 145) then
						block.material = "stone"
					else
						--block.light = 0
						block.material = "air"
					end
				end

				blocks[x][z][y] = block
			end

			for y = 1, 75 do
				local real_y = position.Y * chunkSize.Y + y

				local cave_density = PerlinNoise.new({real_x, real_y, real_z, seed}, cave_scale) * cave_amplitude

				local block = {
					--position = Vector3.new(real_x, real_y, real_z),
					material = "unknown",
				}

				if y < 3 then
					block.material = "barrier"
				else
					if cave_density > 25 then
						block.material = "air"
					else
						block.material = "stone"
					end
				end


				blocks[x][z][y] = block
			end
		end

		if x % 10 == 0 then
			task.wait()
		end
	end
	
	task.desynchronize()
	
	task.synchronize()

	for x = 1, chunkSize.X + 1 do
		for z = 1, chunkSize.Z + 1 do
			for y = 1, chunkSize.Y + 1 do
				local block = blocks[x][z][y]

				if block.material ~= "air" then
					--[[local topBlock = blocks[x][z][y + 1]
					if not topBlock or topBlock.material == "air" then
						block.visible = true
					end]]
				else
					for neighbor_x = -1, 1 do
						for neighbor_y = -1, 1 do
							for neighbor_z = -1, 1 do
								if neighbor_z ~= 0 or not (neighbor_z == neighbor_x and neighbor_z == neighbor_y and neighbor_x == neighbor_y) then
									local real_x = x + neighbor_x
									local real_y = y + neighbor_y
									local real_z = z + neighbor_z

									if blocks[real_x] and blocks[real_x][real_z] then
										local neighbor = blocks[real_x][real_z][real_y]
										if neighbor then
											neighbor.visible = true
										end
									end

								end
							end
						end
					end
				end
			end
		end

		if x % 10 == 0 then
			task.wait()
		end
	end
	
	task.synchronize()
	
	return blocks
end

I’m loading this code on 24 different actors, each of them being a different script and everything, so I would expect this to work, but it seems it doesn’t

3 Likes

why are you doing

task.desynchronize()

task.synchronize()

between the two for loops? that’s unnecessary as it makes the code run in serial mode, while it works fine in parallel

task.synchronize() at the end is unnecessary too, considering that “return blocks” should work just fine without serial mode?

i tested the code with the fixes applied (not sure if it’s the same way you ran it, but should be approximately it) & it gave me a result of ~0.6 seconds per chunk

TL;DR:
Delete all the mentions of task.synchnoize()/task.desynchronize() except for the first one due to them being unnecessary & only slowing down the script

EDIT: fixed a typo

2 Likes

The overhead of having actors is extremely high and I got similarly awful performance. I was able to get better than serial performance by grouping generation into multiple-chunk “jobs” and only starting more actors if there are more than some number of jobs to generate.

1 Like

I’ve saw a comment on the Parralel Luau thread from 2021 lua waits for all the desyncronised threads to finish, and then it continues working.

I’ve removed all of them except of the first one, and its still just as slow.

Is there any example on how to do this? From my understanding, you are saying that you are generating chunks into the main thread, and then only using parralel threads if there are more jobs that need to be generated? Is this correct?

Anytime I need to generate more terrain I get all the chunks that need generated. I put them in blocks of 32 and only start one generator thread for each. So if I need to generate 100 chunks then 4 threads will start with 32 / 32 / 32 / 4 chunks in them. The threads return when they are done. The main thread does not itself generate chunks because it has strict timing constraints in my specific use case.
Also note that having threads running, even if they are just waiting for a condition, uses a lot of performance because the game has to wake and then sleep them every tick. I had an issue with this where a bunch of threads would pile up waiting for their turn to run and the generation would gradually get slower and slower.
The amount of “chunking” that gives you the best performance is hard to know, you’ll have to try different amounts of “block size” and max threads.

1 Like

Are the chunk generators running in actors, or on the main core? Because my generating function uses 60% of the core, so it’s impossible for more than 2 chunks to generate at the same time on the same core

They are running in actors. I have a shared pool of 24 actors and if at any point all of them are running then the main thread stops producing jobs until one is available. This is the only part of my code that “waits” for anything.

1 Like

Thats exactly what I’m trying to do, however it seems that if one job is started then all other jobs will stop and wait for the first job to finish. I can send here all the code for the thread manager if it’d help us debug it

I’ve tried to debug it further, but it seems like the scripts running inside actors block every other core for some reason. I’ll continue debugging it further tommorrow.

I’ve tried debbuging it further, but I found no fix, this may be a bug in roblox’s multi threading?

Still no fix found, can anyone help me, or am I doing something fundamentally wrong about Actors?

May this be the problem,

		if x % 5 == 0 then
			task.wait()
		end

making it jump to serial instantly? How else am I supposed to use task.wait?

You should avoid task.wait in actors

1 Like

Is there any way to fix this? adding task.desynchronize() after the task.wait() makes it 200ms slower, and wait switches to serial lua the same as task.wait, heartbeat may work. How can I prevent my loops from crashing without waits?

You still want task.desynchronize() at the start, try just removing the wait. I have seen other instances of parallel lua not stalling high CPU usage. If it does stall and kill the thread try splitting up your chunks into smaller pieces and higher distribution with more actors.

1 Like

I’ve changed the size of chunks to 6x320x6, however it still seems like the client has a ping of 4000 when generating chunks, does it mean one of the actors is on the main thread?

Also it takes 0.06 seconds now to generate a chunk, and ~0.8 seconds to generate 16 chunks, but should’t it take 0.1 seconds, since it generates all the chunks at the same time? Or am I thinking it wrong?

generating this many parts will still cause a high receive rate and thus high ping. the server has to update the client about each part you are modifying. If all the chunks are in separate actors it should run at the same time if you have the CPU cores to do so

1 Like

I’m not sending the parts to the client, this is just for generating chunk data. After all of this generation then it sends parts to the client, however what I want to optimise is the chunk generation.

Had the same issue, couldn’t find a fix. I have a nice algorithm but takes 6 minutes to calculate the geometry with and without coroutines.

1 Like

First thing you should do if you haven’t yet is open the microprofiler and see if the tasks are actually being distributed among different threads. It should look something like this where there are multiple layers of processes:

If you don’t see something similar to that, continue reading. Otherwise, it’s probably a problem with parallelization itself, and I’ll still help you with that.


This is a red flag. Parallelization is on a per-actor basis, and using it inside the module itself could be the problem. I’ve used Parallel Luau before, and the way I did it is by putting the entire to-be-parallelized code inside the actor script, and not in a module. This also means that the thread is desynchronized in the script directly and not from a module function.

For comparison, here’s my own multithreaded terrain generator that has a similar framework to yours:

And at the bottom is a snippet of the actor scripts in my project:

Summary
...
instructEvent.Event:ConnectParallel(function(instruction: taskInstruction)
	print(`Processor {id} received a new task`)
	local encodeds: {string} = {}
	
	for k, corner: Vector3 in instruction.corners do
		local voxelDim: number = wCFG.lodVoxelDims[k]
		local chunkDim: number = wCFG.lodChunkDims[k]
		local scalarField: Tensor<boolean> = Tensor.new() --these are random names lol
		local surfaceMap: Tensor<boolean> = Tensor.new()
		
		local function getVoxel(x: number, y: number, z: number): boolean
			local filled: boolean? = scalarField:get(x, y, z)
			if filled == nil then --if the voxel doesn't exist, make one and store it
				filled = perlin.noiseBinary(corner.X+x*voxelDim, corner.Y+y*voxelDim, corner.Z+z*voxelDim)
				scalarField:set(x, y, z, filled::boolean)
			end
			return filled::boolean
		end
		
		local function isSurfaceVoxel(x: number, y: number, z: number): boolean --just check to see if the voxel is next to air (nothing)
			if not getVoxel(x+1, y, z) then return true end
			if not getVoxel(x-1, y, z) then return true end
			if not getVoxel(x, y+1, z) then return true end
			if not getVoxel(x, y-1, z) then return true end
			if not getVoxel(x, y, z+1) then return true end
			if not getVoxel(x, y, z-1) then return true end
			return false
		end
			
		for x = 1, chunkDim do
			for y = 1, chunkDim do
				for z = 1, chunkDim do
					if getVoxel(x, y, z) and isSurfaceVoxel(x, y, z) then
						surfaceMap:set(x, y, z, true)
					end
				end
			end
		end
		encodeds[k] = tensorEncoder.encodeBinaryTensor(surfaceMap)
	end
	
	local _r: taskResult = {
		id = instruction.id,
		encodedTensors = encodeds
	}
	returnEvent:Fire(_r)
end)
...

As you can see, everything is in the actor script, aside from the bare essential modules like the Perlin noise generator and the Tensor data structure I’m using to store voxels.

Next point:

This can also be a problem if it wasn’t already. You’re not supposed to use task.wait in desynchronized threads, because by definition it will just involuntarily put the thread back in synchronized mode, because of how the task scheduler works. If your parallelized code depends on task.wait to function, then I’m afraid you have to rewrite it so that it doesn’t.

@ClientCooldown also has a good point; you should minimize changing the synchronization state of the threads as that can be a big bottleneck in your code. Perform all the parallel tasks together at the same time, temporarily store their results, and then resynchronize the thread and do what you need to do in serial.

3 Likes

I’m running it on the server so I have a limited microprofiler, but from what I remember around debbuging a few days ago, there was only one process layer that said 12 Scripts (or something like that), and I had 12 actors so I assumed it was that, but Ill look again tommorrow and tell you

I’ve fixed this, and removed all task.waits from my code

I’ve added all of the code inside the module script and my perlin noise module, but it seems to perform the same, note that each actor had a independent module script, so they are not sharing the same module script, but I still decided to try the usual way

I tried to implement this, Im not sure if I did it correctly, but I will send tommorrow the new code and the microprofiler from the server. Thank you for your help!