Parallel Luau Developer Preview

Oops yeah sorry, this is a regression from memory leak fixes in V3 :sweat: There’s going to be a V4 after all!

5 Likes

A new version (V4) of the builds were uploaded, the links in the original post were updated.

Changes:

  • Fix random crashes on exit / when game session is stopped
  • Add task. library to autocomplete and add proper type information
  • Unlock Part.GetMass, IsGrounded, GetConnectedParts, GetRootPart, GetJoints for use from parallel code
  • Increase microprofile render buffer size to make massive multithreaded profiles a bit easier to look at in Studio

Please note that the build version has been updated by our build process to 0.459 instead of 0.460, here’s the latest build’s version - don’t be alarmed!

image

This is going to be the last build of the year. Please continue experimenting with the builds and sharing feedback in this thread but be ready for that feedback to be ignored for the next two weeks as we get some needed (and dare I say deserved?) rest.

Hope y’all have a good holiday break and see you in 2021!

16 Likes

I made a system a few years ago that distributes Roblox Studio rendering workloads to local clients in my studio test-server. It gives me a ton more performance.

I rendered this image of a map at 512x512 with 16 Ambient-Occlusion samples & 16 Sun (only light source) shadows.

Then I re-rendered this using my multi-client system and plotted the data.

I’ve got a 12c/24t 4.1Ghz Ryzen 3900x & 32GB 3200hz RAM and I’m only able to utilize half of the cores with multi-threading (green dot on graph). I also run into some sort of bottleneck when I run multiple clients - after 8 or so cores the performance gains start to really diminish. I’d like to note that on all of these tests I never used more than 45% CPU usage.

Learnings:

-Multi-threading doesn’t scale as well as I’d hoped. I had a 6x performance gain in this benchmark, but in a lot of other benchmarks it only gave a 4-4.5x improvement.

-Running ton of studio clients has diminishing returns because of some sort of bottleneck. Perhaps RAM speed or something?

I’m now tempted to see what happens if I combine multi-threading with multi-clients :eyes:

24 Likes

We’ve never profiled this on an AMD system so we may be misconfiguring the scheduler. You also might be running into some other bottleneck, hard to be sure. We’ll look into this next year. A microprofile capture or two would be welcome though.

17 Likes

Wow! I can see the difference between non-multi threaded version of the voxel terrain generation in BLOX and this version. I will definitely consider trying this build of studio out.

1 Like

Got bored so i upgraded the rayTracer example to support shadows, fog, light emission and raytraced reflections.

its cool that we can run these things in real time now

38 Likes

Cool Stuff.
I made something like that raytracer a little under a month ago.

Rn I’m working on parts in viewport frames, I’ma see if I can port my code to run in parallel

Really excited for this to come out into beta, this opens so many possibilities and lightning fast processing for multiple operations. :astonished:

1 Like

Out of curiosity, has anyone been able to practically use this for anything that isn’t contrived or ray-based? I know that @Elttob had to scrap their plans to multithread their terrain generation because of the cost of data transfer (or at least that’s what they said on Twitter) and I know that people like @Tomarty aren’t going to get very much use out of it because of its relationship with single-script setups… So what are people actually using this for?

I mean heck, I can’t even use this to speed up my plugin without refactoring it entirely because it’s set up to use one script for everything. I’m not usually one to dismiss new features just because I don’t like their design but… Does this actually fulfill anyone’s use case?

8 Likes

Oh heck yea, multi-threading?
I’ve been waiting for this ever since I knew it was a thing in coding, if I can use this to handle like lots of game objects and entities and such this will be very useful to me.

This is actually a very good point. I think they mentioned they are working on reducing the cost of data transfer above so potentially these cases will become viable. I’m planning to try and implement optimised matrix multiplication myself by the end of the year which I think will be useful for trying to implement some ML techniques on roblox- I believe this has been tried before but not yet with multithreading.

From an understanding of how the actors work so far I can definitely see some encoding / decoding jobs becoming much faster, like object serialisation which is useful for state saving in something like building games. Also without specific tasks being optimised for multithreading anyway, if most scripts are chucked inside actors there will likely be significant speed improvements in even scripts outside of actors as any processing at all which can be put into another thread I’d imagine at least means minor improvements to scripts running on the main thread. I’d also expect anything using some form of custom physics to improve significantly. Other than just the performance improvements from raycasts being multithreaded, say in the simulation of a hover bike you are often making thousands of calculations a second or even a frame and all of these could be parallelised.

From what I’ve read online, another advantage could be that UI is still responsive if other calculations are taking up cpu.

EDIT:

Just to add two examples from the front page, the RTS games “Rise of Nations” and “The Conquerors 3” can both very obviously benefit greatly from multithreading- all units have to be moved on the server in order to avoid exploits so almost all of the significant lag is due to many units having to be simultaneously in combat. This is a non-raycast use case and mostly is just basic vector equations of thousands of different parts. I imagine all it would take is checking distance on a different thread in order to massively lower performance costs.

3 Likes

Even if I need to spam scripts, it’s possible to have a module that interfaces with multithreading so I have a just one example script that gets cloned to spawn jobs through requiring modules. The biggest hurdle for me would be changing my game’s data system so that some data can be loaded by different threads (right now it requires modules/data once when the server replicates it then dereferences it.)

I think I would spawn threads using actors until require starts returning a cached value within an actor’s script (not all at once, just as jobs are created.) You could think of each thread almost like a client without FilteringEnabled; There’s no reason to run the same script on the same client multiple times when its jobs could be handled by a single script, especially if your game is already modular. From there I would set up a lightweight job manager on each thread, and allocate jobs to each thread by firing BindableEvents.

Here are a few use cases that would benefit from multithreading:

  • Simple predictable CFrame animations. This includes things like tree animations, candle flames, custom water graphics, cloth movement, etc.) A BulkMoveTo equivalent for bones would really help with performance for these uses. This is the first thing I would multi-thread.
  • NPC AI and character physics on the server. NPC responsiveness in relation to players might take a hit if its inputs are off by a few frames though.
  • Skeletal animation systems in Lua. I would need to make a lot of changes in my game to multithread the game’s current characters, but it has a lot of potential. My main worry would be that multithreading local character graphics might cause events or animations to be off by a frame. This would also benefit from a bulk method for bones.

Multithreading can really help make your systems scalable. For games with huge worlds and high player counts, the server has the most to gain.

5 Likes

So if the microprofiler is to be trusted the current build will only utilize a maximum of 8 logical cores even though more are available.
My system uses a Ryzen 3600 so I should have 12 logical cores available to me.

This is what the microprofiler looks like when starting an arbitrary (in this case 17) number of equal synthetic workloads:

Some things of note:

-All 17 tasks got completed in 8 threads (Again, 12 logical cores are available to my system). This is most likely a bug.
-(Not seen on screenshot but tested) Apparently in Play Solo 1 less logical core is available to “Parallel” Lua because one thread gets “blocked” by the Renderer (maybe this is a bug caused by using Renderstepped?)
-Within a frame “Parallel” and “Normal” Lua always run seperate from each other and “Parallel” Lua will always run first
-“Parallel” lua will always run at the start of a frame whereas “Normal” Lua will always run afterwards
-As such using task.desynchronize() will give you a delay of 1 frame whereas task.synchronize() will complete in the same frame
-“Parallel” lua is only as fast as the slowest thread so splitting your workload into more than 8 chunks is actually helpful as it reduces the idle time of your fastest threads, though with diminishing returns.
-For the same reason though splitting your workloads into more chunks can rarely decrease your overall performance (e.g splitting the workloads into 9 equal chunks is probably slower than splitting it into 8)

4 Likes

Is there any chance that in the future we will be able to use GPU power?
As far as I know the GPU is only used for rendering, but we could do some heavy tasks that require a much larger amount of cores than a typical CPU has.

6 Likes

What’s the overhead for having parallel code execute on-demand? task.desynchronize/task.synchronize seem to behave similar to wait(), spawn(), delay(), etc.

I know this system is designed for optimizing “doing many unrelated small tasks in parallel” over “splitting one big tasks into multiple parallel threads”, but when doing frame-perfect code, the latter is exactly the kind of thing I was hoping to optimize with parallel luau.

Seems like this sort-of works for things like state machine NPCs, but calling task.synchronize() to safely affect state in other modules will defer my code affecting that module state to the next frame’s task resume area (the same block of time that resumes code when you call wait(), spawn(), etc.). That’s unacceptable even for code involving NPCs/state machines, at least for the way I’m structuring my game!

Resuming asynchronous threads that call task.synchronize() should happen immediately after every block of time that parallel code executes in, right? I understand why task.desynchronize() might need to defer execution, but why should task.synchronize() do the same thing?


Ignore the lengths in the microprofiler; I was printing out a message to the console for each thread after they synchronized, so the print statements take a long time, but they happen in the same thread as parallel code called on Heartbeat, immediately after task.synchronize() is called.

Place file: StringReverseMultithreadedTest.rbxl (23.2 KB) . I was attempting to do a multithreaded string reverse algorithm just to learn how to use the API. All the debug stuff (print/profilebegin/profileend) can be removed to show what I was intending to code here though.

Bottom line is… seems like the only output possible through parallel code comes out much, much later than needed—during the next frame or later.

Maybe I’m missing something, but why can’t task.synchronize() seem to be able to resume my code immediately after the block of parallel execution ends? Is there any particular overhead to this? Is there any way I can get immediate output from multithreaded code (synchronizing then reading output)?

This seems pretty useless to apply to game, unless you’re fine with your output being deferred a few frames out. Most of the code I want to optimize needs to complete and return output immediately, and without yielding though.

5 Likes

Should be the same frame, not next frame. If right now it runs on the next frame, this is a bug - not the intended behavior.

2 Likes

Ah, good to know! I hope the place file I linked above is useful in fixing this issue.

Sidenote—does this have anything to do with the fact that even though I’m calling ConnectParallel on a utility module parented to an Actor instance, the actual callback I pass to that module is not parented to an Actor? Or would that not affect anything?

Found another bug, passing a function to a bindableevent that is connected in parallel will produce a crash 100% of the time as long as the script passing the function to the event and the script connecting to it are running in seperate actors (which makes sense I suppose)

Script1:

local event = workspace.BindableEventTest
local crash = function() end
wait()
event:fire(crash)

Script2 running in a seperate Actor from Script1:

local event = workspace.BindableEventTest
event.Event:connectParallel(function() end)
1 Like

Added reflection and surface light support to my path tracer. Without parallel luau and the recent luau optimizations, this would’ve taken over 24 hours to render. Instead it only took about 3.

1250 samples per pixel, then denoised externally.

Edit: Here’s another render, only 600 samples this time since the scene is much more complex and thus takes longer to render.

64 Likes

One day, this will be rendered in real time.

10 Likes