Parallel Luau Developer Preview

My game is 100% ModuleScripts, with 1 Script and 1 LocalScript that initialize the game

I feel most people who come from a computer engineering background will follow the same approach, including me.

From my point of view, the devs on roblox who would even put the extra effort in implementing parallel processing into their games are usually already educated on industry standards of multithreading / multiprocessing - that of which is drastically different from this “instance-based” parallel computing structure.

So I’m curious what the intuition was to make parallel computing this way in the first place…

I was excited to run this on my own game too, but after reading the documentation I realized it was fairly unfeasible to implement given that my entire codebase is modularized and any architecture is only replicated or scaled at runtime (usually in the form of json/metatables that are not interactable with the Actor subtree)

20 Likes

Try out the library I posted above - it should be much more ergonomic to use with a modulescript-heavy code base (I’m currently using it in Blox, for example)

A less instance-centric API would be nice to have in the standard library tho :slightly_smiling_face:

2 Likes

I feel the same way about what I’ve read so far on this API implementation.

The architecture I’ve developed for my current and future projects is the same as you, a single ServerController (Script) and single ClientController (LocalScript). Everything else is ModuleScripts.

I’ve always frowned upon a multi-script architecture such as where you have the same script in every NPC or Entity.

Even before ModuleScripts came out I never liked that approach and would instead write global controllers that would control all entities of a certain type (like Systems in an ECS).

This implementation of Parallel Lua seems to take a step backwards and encourages the “Script Per Entity/Model” architecture that a lot of us threw out years ago.

We’ll need to develop a robust Jobs/Tasks Library (I wish they’d just done that in the first place) to wrap around this API to make it cleaner and easier to use.

7 Likes

I’ll just leave this quoted verbatim from our tech. design document:

We’ve considered various designs around upvalue-less functions but they are all rather surprising. It might be that a ModuleScript-centric parallel API is a better fit, smth like task.dispatch(ModuleScript, function-name-to-invoke, arglist) (which would run require(MS)["function"](arglist) in a separate VM). Or maybe some other design is good.

All of these have restrictions and all of these can be implemented in user space, which is why we aren’t starting from it.

25 Likes

Alright, that makes sense :slightly_smiling_face:

The only real pain point with implementing this kind of API ourselves right now is data transfer. Right now, your only option is oftentimes just to copy over the data you need to use into the actor where it’s needed. (usually implicitly via a BindableFunction or BindableEvent)

Especially if you’re working with large data structures, it can be expensive copying everything over. The overhead of those copying operations is holding me back from using multithreading for many systems in my game.

Do the future plans for shared storage allow for easily and efficiently reading from large data structures without having to perform an expensive copy?

9 Likes

The builds were updated to a new version (you will need to re-download the builds, the links in the original post have been updated).

Changes:

  • CollectionService.GetTagged/GetTag/HasTag are now safe to use from parallel code
  • Workspace.FindPartsInRegion3 is now safe to use from parallel code
  • Improve task dispatch performance and scalability a bit, esp. for systems with low numbers of cores
  • Fix thread memory leaks in ConnectParallel / task.desynchronize
  • Fix VM memory leaks when closing the game

This is likely going to be the last version for this year, unless important issues are reported in which case we might do one final build on Friday. Please be aware that we’re going on Christmas/New Year break starting next week, so we’ll have very limited ability to answer questions and no new builds will happen during that time. However don’t let this dissuade you from exploring this system’s capabilities and limitations :slight_smile:

If you want to know if you’re running the latest build, this is what the version should look like in v3:
image

13 Likes

This is definetely possible, and me and another dev I know who goes by the name of Vilksian have done some seperate implementations of exactly that in normal single threaded Lua. He’s building a Roblox terrain editor version of procedural generation, which is going to be way faster than mine. Mine is using meshpart triangles and Marching Cubes instead. However, even at decent render distances, my version still runs at 40-60 FPS, and my computer is barely mid tier, so it’s absolutely feasible to get decent speeds out of procedural generation. On top of that, with the implementation of parallel processing units, I’m sure we could get massive boosts to performance.

Can you possibly add :GetTouchingParts() to the whitelist? It’s fine if it’s after break. Querying and processing the world like this is important.

1 Like

Oh yeah, one more question, does :ConnectParallel() reuse Lua threads like :Connect() does without wait()? How about if we call synchronize()?

1 Like

Right now it doesn’t. We may introduce a small thread pool there later.

It seems odd to me that the debugger doesn’t work at all in Actor-parented scripts in this developer build. I understand it not working in parallel or desynchronized units but shouldn’t it still function in normal code?

My best guess is that it just disables it outright because there’s no way to know ahead of time what code will and won’t be multithreaded and afaik the debugger actually changes the bytecode so it’s an either/or situation.

It would certainly be less work to make it work in non-parallel code, it just requires attaching the debugger to all actor-specific VMs and we haven’t quite done that yet either.

Hi, I have a little question which might seem out of place, but since you’re currently looking into raytracing with 2D grid canvases composed of ScreenGui Frames, wouldn’t it be more efficient to have a new instance, something like a UVCanvas to do rendering on? Since we already have UIGradients to do partial pixel / canvas editing, wouldn’t this be something you can just take and make its own instance of? This would open up the door to do raytracing and even shader engine creation on an even larger scale.

Edit: In case you’re worried about the moderation aspect, just like builds in game can’t be moderated until reported, games like Free Draw allow you to draw on a canvas without passing through Roblox’s image moderation system. Or you can introduce it to a handful of developers just like with the video feature.

8 Likes

They sorta answered this question in earlier posts, they cant have something like this as it would let users bypass moderation/the image moderation system, but if they make that api harder to understand it would help reduce the chance of that happening, i hope doing that is enough for them to allow it though.

2 Likes

So I’ve been doing some interesting testing with a raytracer program I wrote. It’s really simplistic, just casting rays outwards and getting the color of whatever object it hits and setting the frame’s color accordingly. For the multithreaded part, it gets the number of threads I want to generate and divides the pixels up amongst them into groups, and has each thread render them independently. If you’d like to download the place file here it is:

RaytracemultiThread.rbxl (26.1 KB)

I’ve setup the program with an IntValue in the workspace which controls the number of individual tasks I’m dispatching. (Basically the number of individual threads I’m generating per frame) Something I’ve found is that with my current system, which dispatches a new set of tasks every single frame, the general trend is less threads equals more performance. If I run 100 threads overall, I can get anywhere from an average of 0.5 to 5 FPS less than if I ran 16, which is what my computer is rated for, 16 threads. If I run 4 threads, i can get anywhere from 0 to 0.5 FPS differences overall compared to 16 threads. That may sound small, but keep in mind my FPS was extremely stable over the entire experience, keeping my camera still I would barely get a 0.1+/- wiggle in either direction. It is definitely enough to be significant. If I ran 1 thread instead of 4, I would find no difference in performance, even though I am apparently processing only 2500 Rays in 4 threads instead of 10000 Rays in one thread.

Overall, I think this brings up an interesting question about the intended usage of this API. Should we be using this in the way I demonstrated in my place file, where we are generating a new thread for every set of processes we run, every time we want to run them? or is it a smarter idea to instead create threads by some hacky process like generating a while loop in a ConnectParallel function, keep them handy in the background, and maybe push objects to a compute buffer for those loops to chew up and spit out? I would have originally assumed the former, because that’s what this API seems to support and the latter is very hacky and makes very little sense to me. But on the other hand, in my test cases, I can’t seem to find any good evidence that the former is actually more performant. If anything, at higher thread counts, it’s actually less performant, and the difference between a non-one thread count and a single thread is inconceivably small, if it exists at all.

5 Likes

Alright, finally had the time and energy tonight to give this a try.
Here’s the end result!

ParallelTerrain.rbxl (39.1 KB)

It works by managing an object pool of “chunk-generation” actors. Each actor defines a Size and Position for a chunk of terrain to be generated at.

This is how everything is laid out:

  • Configuration for the physical Terrain generation is defined in the TerrainConfig module, under the ServerStorage:

  • The ParallelTerrain script contains parameters for controlling the chunk geneneration and number of initial actors allocated.

Shoutout to the RIDE team for making it extremely easy to convert the existing terrain generator into an infinite terrain chunk generator!

(P.S. This crashes Roblox Studio when the DataModel is closing :wink:)

28 Likes

Raycast minimaps are cool. I have implemented some texture to the grass and concrete with noise, which looks nice, but blurs at higher zooms. I’ve noticed there are some memory leaks because over time it gets very laggy but is initially fast.

edit:
just to add, there are 5284 pixels with 331 actors, each actor is filled up with 16 object values linking to a different pixel which a calculates the raycast for and changes the color appropriately. The last actor doesn’t fill up entirely and manages only 4 pixels.

20 Likes

Most recent version significantly improved core utilization on my 6-core CPU. Was previously getting 65-70% utilization, now I’m getting 80-90% utilization.

1 Like

Oops yeah sorry, this is a regression from memory leak fixes in V3 :sweat: There’s going to be a V4 after all!

5 Likes