Parallel Luau Developer Preview

Even if I need to spam scripts, it’s possible to have a module that interfaces with multithreading so I have a just one example script that gets cloned to spawn jobs through requiring modules. The biggest hurdle for me would be changing my game’s data system so that some data can be loaded by different threads (right now it requires modules/data once when the server replicates it then dereferences it.)

I think I would spawn threads using actors until require starts returning a cached value within an actor’s script (not all at once, just as jobs are created.) You could think of each thread almost like a client without FilteringEnabled; There’s no reason to run the same script on the same client multiple times when its jobs could be handled by a single script, especially if your game is already modular. From there I would set up a lightweight job manager on each thread, and allocate jobs to each thread by firing BindableEvents.

Here are a few use cases that would benefit from multithreading:

  • Simple predictable CFrame animations. This includes things like tree animations, candle flames, custom water graphics, cloth movement, etc.) A BulkMoveTo equivalent for bones would really help with performance for these uses. This is the first thing I would multi-thread.
  • NPC AI and character physics on the server. NPC responsiveness in relation to players might take a hit if its inputs are off by a few frames though.
  • Skeletal animation systems in Lua. I would need to make a lot of changes in my game to multithread the game’s current characters, but it has a lot of potential. My main worry would be that multithreading local character graphics might cause events or animations to be off by a frame. This would also benefit from a bulk method for bones.

Multithreading can really help make your systems scalable. For games with huge worlds and high player counts, the server has the most to gain.

5 Likes

So if the microprofiler is to be trusted the current build will only utilize a maximum of 8 logical cores even though more are available.
My system uses a Ryzen 3600 so I should have 12 logical cores available to me.

This is what the microprofiler looks like when starting an arbitrary (in this case 17) number of equal synthetic workloads:

Some things of note:

-All 17 tasks got completed in 8 threads (Again, 12 logical cores are available to my system). This is most likely a bug.
-(Not seen on screenshot but tested) Apparently in Play Solo 1 less logical core is available to “Parallel” Lua because one thread gets “blocked” by the Renderer (maybe this is a bug caused by using Renderstepped?)
-Within a frame “Parallel” and “Normal” Lua always run seperate from each other and “Parallel” Lua will always run first
-“Parallel” lua will always run at the start of a frame whereas “Normal” Lua will always run afterwards
-As such using task.desynchronize() will give you a delay of 1 frame whereas task.synchronize() will complete in the same frame
-“Parallel” lua is only as fast as the slowest thread so splitting your workload into more than 8 chunks is actually helpful as it reduces the idle time of your fastest threads, though with diminishing returns.
-For the same reason though splitting your workloads into more chunks can rarely decrease your overall performance (e.g splitting the workloads into 9 equal chunks is probably slower than splitting it into 8)

4 Likes

Is there any chance that in the future we will be able to use GPU power?
As far as I know the GPU is only used for rendering, but we could do some heavy tasks that require a much larger amount of cores than a typical CPU has.

6 Likes

What’s the overhead for having parallel code execute on-demand? task.desynchronize/task.synchronize seem to behave similar to wait(), spawn(), delay(), etc.

I know this system is designed for optimizing “doing many unrelated small tasks in parallel” over “splitting one big tasks into multiple parallel threads”, but when doing frame-perfect code, the latter is exactly the kind of thing I was hoping to optimize with parallel luau.

Seems like this sort-of works for things like state machine NPCs, but calling task.synchronize() to safely affect state in other modules will defer my code affecting that module state to the next frame’s task resume area (the same block of time that resumes code when you call wait(), spawn(), etc.). That’s unacceptable even for code involving NPCs/state machines, at least for the way I’m structuring my game!

Resuming asynchronous threads that call task.synchronize() should happen immediately after every block of time that parallel code executes in, right? I understand why task.desynchronize() might need to defer execution, but why should task.synchronize() do the same thing?


Ignore the lengths in the microprofiler; I was printing out a message to the console for each thread after they synchronized, so the print statements take a long time, but they happen in the same thread as parallel code called on Heartbeat, immediately after task.synchronize() is called.

Place file: StringReverseMultithreadedTest.rbxl (23.2 KB) . I was attempting to do a multithreaded string reverse algorithm just to learn how to use the API. All the debug stuff (print/profilebegin/profileend) can be removed to show what I was intending to code here though.

Bottom line is… seems like the only output possible through parallel code comes out much, much later than needed—during the next frame or later.

Maybe I’m missing something, but why can’t task.synchronize() seem to be able to resume my code immediately after the block of parallel execution ends? Is there any particular overhead to this? Is there any way I can get immediate output from multithreaded code (synchronizing then reading output)?

This seems pretty useless to apply to game, unless you’re fine with your output being deferred a few frames out. Most of the code I want to optimize needs to complete and return output immediately, and without yielding though.

5 Likes

Should be the same frame, not next frame. If right now it runs on the next frame, this is a bug - not the intended behavior.

2 Likes

Ah, good to know! I hope the place file I linked above is useful in fixing this issue.

Sidenote—does this have anything to do with the fact that even though I’m calling ConnectParallel on a utility module parented to an Actor instance, the actual callback I pass to that module is not parented to an Actor? Or would that not affect anything?

Found another bug, passing a function to a bindableevent that is connected in parallel will produce a crash 100% of the time as long as the script passing the function to the event and the script connecting to it are running in seperate actors (which makes sense I suppose)

Script1:

local event = workspace.BindableEventTest
local crash = function() end
wait()
event:fire(crash)

Script2 running in a seperate Actor from Script1:

local event = workspace.BindableEventTest
event.Event:connectParallel(function() end)
1 Like

Added reflection and surface light support to my path tracer. Without parallel luau and the recent luau optimizations, this would’ve taken over 24 hours to render. Instead it only took about 3.

1250 samples per pixel, then denoised externally.

Edit: Here’s another render, only 600 samples this time since the scene is much more complex and thus takes longer to render.

64 Likes

One day, this will be rendered in real time.

10 Likes

Boy, I’m gonna have fun once this is released…

Fractals, anyone?

It’s amazing how this can run in realtime! Really shows what multithreaded Luau is capable of.

14 Likes

I’m impressed, a multi-threaded render engine, written in Lua inside ROBLOX?
I’m astonished, that’s amazing.

This is a step in the right direction, although I’m greatly struggling trying to see if it’d be of any use for my game which uses a core loop to update tons of objects per frame (creates new coroutines for each object so each object can be updated independently). The player can move certain objects around on the map, and it also updates static objects on the map per frame as well. The issue currently is that script activity tends to go way up due to everything being ran “on the queue”, and some delays as well, rather than splitting up that work accordingly. It’s an absolute nightmare trying to optimize a game like this.

Do you see there being a light in the future for something like this?

3 Likes

I’m thinking of Kolmogorov Complexity

Does this mean that games can use more cores?

Wait!!! This came out in December and my what’s new banner just lit up! Jaw drop! So is it close to being a true feature or are there still gobs of testing to be done? I will need a ton of help understanding how to take advantage of multithreading.

If done correctly, could we get away from StreamingEnabled in places with like a million parts? Sorry, this is a long thread. It will take a while to read every comment.

StreamingEnabled is meant to stream in/out parts for the client to see and limit the amount of RAM needed at a given time.
Parallel Lua lets you be able to make multiple calculations on the CPU at the same time (instead of doing 1+1 then 1+2, it does 1+1 and 1+2 at the same time). If you have thousands of moving parts controlled by scripts at the same time, using Parallel Lua can greatly increase your performance.

2 Likes

Got the terrain colors!
image

4 Likes

Just began experimenting with parallel lua, it’s super cool! I’ve noticed there are some functions/methods missing that I’d love to be whitelisted for use within parallel lua functions.
One specifically is ScreenPointToRay - it would make something like raycasting (such as in your example) much easier.

It’s possible to do it by manually calculating the 2d point to 3d position, however that is quite difficult. Since it is possible to do manually, I see no issue with whitelisting it for parallel lua.

Please direct all feedback to Parallel Lua Beta - this thread will be closed and the old preview builds are going to be deleted in favor of the official beta.