Parallel Lua Beta

I found a really weird bug that causes some sort of memory leak that persists between playtesting sessions:

LocalScript inside of an actor in StarterPlayerScripts:

for i = 1,250 do
	RunService.RenderStepped:ConnectParallel(function()
		task.synchronize()
	end)
end

LuaHeap will continue to rise until it reaches 2000+ Mb, but the weirdest part is that if you stop the game and play again, LuaHeap will keep rising from wherever it left off. I tested this in the developer preview build and it doesnā€™t happen there, only on the new beta in the live version of studio.

ParallelLuaBugRepro.rbxl (22.3 KB)

So far I have managed it to completely crash Windows several times :sunglasses:
When ran on a moreā€¦ competent machine, it completely accelerated my terrain generation system.

I have tripled the speed of the my terrain generation system while making it more stable and it no longer makes the server laggy while generating planets. 5.313 seconds to fully initialize and spawn the terrain of a 2048 x 512 x 2048 section of land.

What exactly can we expect from this proposed shared storage? Will it be instance based or will we be able to write tables to it for read only purposes?

Currently I have a custom grid and pathfinding algorithm and being able to multi thread this would be a great help. However, the pathfinding algorithm wouldnā€™t be able to access the grid data if run in parallel currently from my understanding. The only way I could transfer this information at the moment without the massive overhead of sending it through bindables would be by making the grid out of parts and trying to convey information through that and value objects. This would be incredibly unwieldy for obvious reasons.

Some kind of service with which we can write to from the main thread and read from in parallel threads would be greatly beneficial. Can we expect anything similar to that or give us any other ideas as to what to expect?

4 Likes

Do servers have access to multiple threads? I ask this because I want to know if itā€™s useful to implement parallel lua on the server.

I noticed that parallel tasks are always processed before gameStepped in the frame pipeline. This means when we call ConnectParallel on heartbeat, the actual processing is delayed until the next frame. Are there any plans to have an option to start a WaitingParallelScriptsJob at any point in the pipeline? It would be nice to have some background tasks running in heartbeat that wonā€™t delay physics, or code that we have bound to gameStepped. Alternatively we might have stuff we want to process in parallel during PreRender, like calculations for visual effects to be applied before rendering.

Ultimately giving us more control over where to run tasks in parallel will allow us to make the most of this powerful tool.

Will we ever have access to the number of threads available? And will we ever have access to manually assigning jobs to specific threads?

In my situation, I want to evenly distribute a set amount of jobs that take a varied amount of time (which I have a rough estimate for). Back on the main thread, it will yield until all jobs are completed to then handle the processed data. Iā€™m not exactly sure how Roblox determines what runs on what thread, but I donā€™t believe they are going to have better distribution than me (considering I know how long each job will take). The main problem with this is that I could have 20/28 jobs already completed on 11/12 threads, while that 1 final thread still needs to process 8 more jobs.

Are there any plans to address a situation like this?

EDITS: Just to clarify, each job isnā€™t taking 4-5 seconds and yielding inside, itā€™s performing calculations and I have a good estimate of how long it will take to process.

Technically each Actor is a thread, but if you have more actors than threads, Roblox will start doing some dynamic distribution? This means that if we have access to the number of threads on the userā€™s system, we also have the ability to control job distribution?

I just want to add that it would be super helpful to have a RunService:IsThread() function, or at least something similar to see what environment the code is currently running in.

Behind the scenes Actors are bucketed into Luau VMs. Because scripts canā€™t move between VMs theres no way to granularly assign tasks to threads as the runtime has to ensure that each VM is only ever access on a single thread. Because of this it is possible that you can get really unlucky where all the actors doing expensive stuff all get put on the same VM, but because curently the VMs are allocated ahead of time, thereā€™s easy way around this.

In my situation, I was thinking more of sending information to actors via BindableEvents (since there is no real way to communicate yet), then those actors do the computations and spit out a result. If we have access to know what thread an actor is running on (via read-only property) and how many threads are available to be allocated to actors, my goal could be accomplishable?

If you generate a random number in a module script, it will serve as a sort of id for the VM that actor is on, as each VM as a seperate view of module scripts.

Thatā€™s smart, I can create a bunch of actors until they end up viewing the same cached ModuleScript. This definitely would work, but you can see how it seems ā€œhacky.ā€ As parallel lua is still in beta, there is definitely room for new features and design choices that can make this more straight forward.

Itā€™s finally out, a feature I have been desperate for, for years on end. This needs significant improvement however.

Firstly, I would like to be able to access all methods, including setting the position of parts. I do not care if itā€™s unsafe, I accept the potential dangers of this and would still like this to occur. Maybe include an additional flag to say you would like to execute unsafe methods. I have created a demonstration place to display the need for this.
Demonstration.rbxl (22.3 KB)

Secondly, I would like mutators and delegations for me to control for system i/o and core counts.

Thirdly, I would like a way to synchronize code using parallelism, maybe using impotency keys, in an efficient manor. Currently _G is a good way to do this, but I would like a more efficient system to do this.

Thank you for finally implementing this after years and years of desperate need for it.

4 Likes

The limitation is probably in place not because of developers writing potentially broken code but because race conditions can introduce vulnerabilities into software. Whether it be the client or server, writing to a property from 2 different threads will cause undefined behavior.

2 Likes

Race conditions can easily be dealt with and is not of my concern, I want this feature implemented and itā€™s relatively easy to do. I develop multithreaded code. If two threads set the same data at the same time, depending on how the virtual machine is implemented, typically a kernel overflow occurs dropping either the most recent or oldest dataset for the object. Other times itā€™s just a memory recourse, ie the same property is accessed in memory but the instruction is overridden by the most recent thread.

2 Likes

Not sure if I understand correctly, but letā€™s say that 1000 parts are made each .heartbeat. Would separating it into 4 parallels that each create 250 parts be 4x faster if not more? For like 4 core systems

1 Like

I agree, I want to be able to access/set with no regards to bugs that might come from it.
I understand that there would be some frustration from users not knowing whatā€™s going on.
Properties and methods should be marked with a safety level and by default should not be able to change them unless you toggle a setting.

1 Like

After playing around with this feature for a bit, Iā€™ve come across a fatal memory leak: Parallelized VMs will never have their memory freed, and will persist across playtests. This will quickly cripple your Studio experience if you create large amounts of data per playtest on parallelized VMs. Only way to reclaim this memory is to restart Studio, which is a hassle of course.

Steps to reproduce, 100% success rate for me (Windows 10):

  • Create a bunch of VMs (client or server)
  • Fill each VM with a large amount of memory
  • Destroy each VM in-game, observe
  • Stop the playtest, observe

This does not happen with synchronous code (e.g. replacing Actors with Models)

Hereā€™s a quick repro place for testing. Entry point located in ServerScriptService > Main, use the variable debug_USE_THREADS to control the behaviour.

threads.rbxl (22.1 KB)


Overall, Iā€™m completely in love with this feature, this is truly the start of a new era for Roblox. The possibilities with a system like this are endless!

Ergo naturally, I wrote up a quick Perlin terrain generator that generates 24x24 chunks of 64x64x64 (~151M voxels^3), which is an unnecessary amount of terrain for any purpose. Itā€™s really big. Bigger than a really big thing.

Sequentially it took 2 minutes 41 seconds to process, when in parallel it only took 44 seconds! Most of that time was spent calling WriteVoxels() (which isnā€™t yet safe for dyssynchronous running). Unfortunately I couldnā€™t go any bigger due to this pesky memory leak; the terrain itself takes around 1.5 GB, while all the dead VMs take 18 GB!

8 Likes

This is really great! Glad to see this coming, I wouldnā€™t mine a few more examples of use cases though? Personally without the ability to edit things from the Parallels(If I am understanding correctly, which I probably am not) I canā€™t really see many cases to use this for.

1 Like

This is awesome and lets me generate cool images like this really quickly.
image

Iā€™m curious if thereā€™s a good way to figure out actor utilization. Iā€™m trying to balance max actors and chunk size (essentially size of work each actor does). Iā€™ve found it hard to balance these numbers and have just been fine-tuning via trial & error. Is there a better way I can figure out the best way to balance a large parallel task like this?

8 Likes

This is based on my interpretation of the microprofiler:
Each task you run in parallel gets assigned to a taskqueue and all tasks within such a taskqueue are essentially treated as one task as far as the task scheduler is concerned.
Considering the task scheduler doesnā€™t (canā€™t) know for how long each task will run they are assigned randomly to the task queues priority being given to the task queues that have the least amount of tasks already being assigned to them.
The maximum amount of task queues that exist is by the looks of it hard limited to exactly 24 so ideally you want to have a maximum of 24 actors running code in parallel to minimize the amount of overhead you get from multithreading as any additional tasks you try to run in parallel will just get added to one of the already existing task queues and doesnā€™t get you any additional benefit from the task scheduler spreading the work load more evenly across your cpu threads.

2 Likes

@EthicalRobot I think itā€™s weird that HttpService:GenerateGUID can not be called in parallel

image

is there a reason for this or is it an oversight (did someone forgot to whitelist this)?

4 Likes