New (Parallel) Old (Single Threaded)
Over the last two weeks in my free time, I’ve been putting together a procedural 3D particle system to help a friend out with his fighting game. It’s beneficial at this point to see how this scales across different CPUs with different core counts. In my system I’m currently running a Ryzen 7 5800X3D, so I don’t have a great sense for how this does on other, lower end devices. Let me know how it runs on your end!
Mobile support is disabled for now because I’m too lazy to add mobile controls. I’ve been at this for a while . Here’s the game link for anyone interested in torturing their CPU themselves:
Do note that I’m not done here. I still have more stuff to implement in parallel and this is just the slightly quality assured version. I’ve been tinkering with the system a little bit over time so here’s what’s to come.
Quick update because it was making me mad that my repeated instantiation was still slow. More methods that are catered to performance was my solution. I also massively improved the parallel processing implementation as I figure out how to use Parallel Luau more effectively.
I intend to release this to the public eventually as it’s probably a good learning resource for more than just me. I’m finalizing the API before any sort of public release though, things have gotten pretty messy over the many iterations of my code. It’ll be a few before this module is battle tested and “public ready”, however. As for what it looks like right now:
These changes are live at the game link if you wanna give it a whirl yourself. It runs purely on the client at the moment but the implementation of Parallel Luau opens the doors to server application. I haven’t tested it yet but it will be on the back burner. Enjoy!
I’m not a super creative person so my example cases aren’t super wide but I have a few. So far, I’ve been been able to do things like impact effects, rain, fireworks, atmospheric effects (think old Minecraft void fog), etc. The module was built with a purpose: being used in my friend’s fighting game. From talking to some friends, their use cases included things like spells, and other various atmospherics.
It’s intended to be combined with other VFX to provide a dynamic, yet highly tunable visual quality increase over static prefabs in some cases. The system is flexible as I include all BaseParts including MeshParts as valid prefabs. That opens the door to some pretty interesting opportunities as the random selection can be uniform or weighted.
This is what the demo looks like after recent work. It’s been a few days since my last post so a lot has changed, just know I re-implemented emittance in a way that made more sense for efficiency WRT memory and parallelization. Suffice to say, it runs shockingly well! My desktop PC is currently running a Ryzen 7 5800X3D, which manages to instance the 500 parts, update them, and clean them at 240hz (~2ms Heartbeat).
The colors you see are a visualization option that I added for the chunking behavior. Each color you see represents a chunked group of 25 particles under a single actor. It shows you the unit of work the particle is processed under. The machinery for chunk visualization is built into the module. On release this will be available for debugging purposes via the ChunkManager singleton, which contains a Visualize property.
I also enabled mobile support! With some testers (who I am very thankful for), I improved performance in the end phase which means particle clean up is no longer as expensive. It doesn’t suffer from the issue of frame time spikes which affects all three phases of execution. The new stability makes it viable to use on lower powered devices like smartphones.
Here’s the game link again for redundancy’s sake (and to save a tiny amount of scrolling):
No idea when I’ll publicly release this module as it still hasn’t been battle tested but I’m inching closer as times goes on. I plan on adding a collision check option as that is also relatively easy to do in parallel and would make atmospherics more useful in the context of a game. Stay tuned!
It takes about 3 milliseconds on my system (ryzen 5 5600g), so it runs smoothly, but that could be improved
From what I am seeing in the micro profiler, there is still a large amount of stuff being done in serial, Mainly changing the size of parts and using BulkMoveTo. For BulkMoveTo, you might be able to group more parts into a single call, however, I don’t actually know if that would improve the performance. As for the size, I don’t know why it has to be called. If you could get rid of whatever part of the script changing the size, that would improve performance by a good amount
The first frame in which the particles are instantiated take up more time, which can create some small lag spikes. If you could reuse old parts that could probably help as well
Though even parenting parts takes up time. I think there is a module that allows the “caching” of parts by moving them really far away or something
Mainly changing the size of parts and using BulkMoveTo. For BulkMoveTo, you might be able to group more parts into a single call, however, I don’t actually know if that would improve the performance.
The point you make about BulkMoveTo is actually very valid, the chunk size is hand tuned so adjusting the number of calls per frame is as simple as changing a number in a table. The chunks are the units of work however, so larger chunks means more opportunity to block if balanced incorrectly. It’s a bit of a process finding the sweet spot.
As for the size, I don’t know why it has to be called. If you could get rid of whatever part of the script changing the size, that would improve performance by a good amount
The size change code is old and hasnt been reconsidered except for when i parallelized what I already had. When I first started working on the module it was useful to blend the transitions between the effect phases as the code base was in its infancy. I am aware of the relative cost incurred by resizing. Personally I like how it looks so I chose to keep it. It would be easy to make it optional to curb frame time issues.
The first frame in which the particles are instantiated take up more time, which can create some small lag spikes. If you could reuse old parts that could probably help as well
Though even parenting parts takes up time.
I actually already implement instance pooling at a few levels! The first frame the particles are instantiated in does have a spike, but its a one time cost per EffectInstance, assuming you dont invalidate the cache. I can’t really make every effect pool over the same group of Instances in all cases, as the prefabs are up to the user to implement and the distribution of particle types in the pool is not always the same.
I also permanently pool the Actor instances that are created when a new chunk is generated. 60 Actors are automatically instantiated when the module loads, and more can be dynamically generated as needed. The particles are capable of being preloaded via EffectInstance:Prealloc(nPrefabs, batchSize) where nPrefabs is a number evenly divided by batchSize, which instantiates batchSize particles every frame until nPrefabs have been generated.
I don’t really get this part. BulkMoveTo cannot be used in parallel, and is used in serial. Why is it separated into different chunks, and how can it block?
It’s separated into chunks because while I can calculate particle states completely independently from one another, Actors are still a large abstraction and treating a single particle as a unit of work isn’t profitable. BulkMoveTo happens in the serial blocks of those chunks because when I first started writing the system, I was under the impression that I could mutate the state of the datamodel under the Actor in parallel.
Instead of rewriting a bunch of code to combine the state of the chunked data, I just decided to bulk set the CFrame of whatever belonged to the chunk in serial execution and call it a day. If my chunk size is too large I won’t have enough units of work to efficiently parallelize the task of calculating the custom physics. The workload would begin to block as the program doesn’t re-enter the serial state until parallel execution is finished.
Particles tend to be small anyways so for now its good and cheap enough to justify keeping.
The size interpolation is now optional as previously discussed. The demo has it disabled and it reasonably improves performance. The collision detection is done in parallel and is relatively cheap due to reasonable ray sizes. Note that this is all relative. Performance will vary from system from system and it is up to the developer to implement level of detail. The API FX3D provides currently makes that job pretty easy.
Other various bug fixes and optimizations have been put in place across the entire code base as I correct my previous mistakes and find better ways to do things. I again have no idea how this performs on a wide variety of devices so I ask if you do play, press Shift + F4 and take a look at the heartbeat time.
The first release candidate for this module is nearing completion as I’m nearly done implementing the features I want / need for a public release. The release candidate will be tested by more people than just me before it gets put into the hands of the public however. I want to make sure things are coherent before I put it out there. Enjoy for now, I’ll be back for more later!