Part Instancing - pre-release announcement

zeuxcg · May 29, 2018, 11:45pm

We’ve released instancing support for mesh & CSG parts last year; we are getting ready to release part instancing which would extend the concept to other part types; this post is a heads-up that this will happen in a few weeks, and contains some technical details for the inquisitive minds.

Note: part instancing is designed to just work, and requires no user interaction. If you stick to basic parts and reuse meshes/textures more (and make sure Lighting.Outlines is disabled), you’re on the good side of performance, and you can skip reading the rest.

Posting this on behalf of @maxvee who spent lots of time implementing various part rendering features and trying to make sure they match the existing rendering pipeline.

What is instancing?

Graphics engine internals: utilizes hardware capabilities to draw many similar objects at once. Has been around since ca. 2005, so it’s about time.
Instancing for Meshes and CSGs was shipped in (about) October 2017. If you haven’t noticed anything, I’ll take it as a compliment.

Advantages:

saves a lot of graphics memory used by not making many copies of the same piece of geometry;
eliminates expensive part re-clustering caused by moving parts, dynamic updates, etc., potentially shaving off 4-1000000 ms per frame;
(Previously, e.g. changing color of a single part triggered re-generation of everything in the general vicinity of the offender.)
(almost) immediate part updates;

Disadvantages:

batching is less efficient and highly depends on the number of unique pieces of geometry used;
requires graphics hardware support (D3D11, GL3, GLES 3, Vulkan, Metal). Android support is the worst: a lot of devices are too old; and a lot are too buggy to support it reliably.

There are two new “metrics” (*) here to be (mildly) aware of: batching efficiency and update performance.

Batching efficiency

Very high number of draw calls is a CPU bottleneck in rendering code - the time it takes for our code to talk to e.g. D3D11 which will talk to the driver, and that’s for every frame. We can mitigate the issue by submitting many similar parts per single draw call.

Similar in this case means:

parts have the same geometry (to a certain extent).
parts have the same shader (material)
parts have the same texture (if plastic)

For example:

Many meshes that refer to the same assetid will share their geometry. I.e. there will be only one vertex buffer with exactly one copy of the mesh, and it will be used to render all of them. There is no more duplication of meshes when forming geometry clusters.
If, say, 100 meshes use the same material and texture, they will be rendered in a single draw call, unless other properties disagree (see below). If half of your meshes use bricks, one quarter uses plastic with one texture, and the remaining quarter uses plastic with another texture, there will be three draw calls, unless other properties disagree.

You can gauge batching efficiency by pressing Ctrl+Shift+F2 and looking at the bottom. One line should read, e.g. (Village template):

Clusters: fw 0c 0p; dyn 0c 0p; hum ...c ...p; inst 51c 561e 4559i

0c 0p - means there are no FastClusters, everything is instanced;
hum … - that’s for humanoids, we won’t touch those for now;
inst 51c 561e 4559i - means there are 51 instanced clusters with 561 render entities and 4559 instances.

The number of entities is essentially the number of batches (draw calls) it would take to render all 4559 parts in the scene. This tells us that on average there will be about 9 instances per draw call (which is not great…).

So the bigger the ratio of instances to entities, the better the batching efficiency is. The theoretical maximum is about 512 (for now).

The following properties do not incur batching costs (i.e different parts count as similar):

For parts:

Cframe (position/rotation)
Size(+)
Color/BrickColor
UsePartColor
Reflectance

For SpecialMeshes:

Offset
Scale
VertexColor

For Decals:

Color3
Transparency
StudsPerTileU/V

+ - the following exceptions for Size apply:

trusses have to have the same number of segments, otherwise a different piece of geometry is generated for every unique number of segments.
elongated head SpecialMeshes will turn into cylinders, which will effectively split the batch into two

The following are known notorious Batch Wreckers:

MeshType
MeshId
TextureId
Material
Stud configuration - this one will generate a slightly different copy per part type, per face, per stud type. There are about 2700 different combinations of just those, so be careful with studs. Stud configuration has no effect on MeshPart or CSG batching.
Transparency - this one is the worst. Since OIT is still expensive (ask me again in 5 years), we have to force a single instance per draw call for each transparent part. Does not pertain to decal transparency.

Properties not explicitly mentioned here, like Name or Velocity, have no effect on graphics.

Note on decals: internally, decals use a separate geometry piece that closely follows the object that they’re mapped on top of. They are rendered with transparency on all the time, but it’s not as ridiculous as one per draw call.

Update performance

Batching efficiency alone is a good indicator of static performance, i.e. it is the “base cost” of just rendering so many things. When parts are dynamic, though, additional performance considerations come into play.

Relative costs of updates, from faster to slower

Nothing - does not incur any dynamic costs. Static objects are not updated at all.
CFrame (position/rotation) of meshes, CSGs, blocks/cylinders/balls with no specialmeshes.
This is “the fast path”. As cheap as patching a few floats in a struct that the renderer sends to the GPU.
Color, UsePartColor, Size, Reflectance.
Triggers a full update for the part and a bbox update for the cluster.
SpecialMeshes.
Approx. 10x slower to update than basic parts, also triggers bbox update for the cluster. Also, there is no ‘fast path’ for SpecialMeshes.
CFrame, moving across cluster boundaries.
If a position update moves the part too far to a different cluster, internally this triggers part handover logic, which involves bumping of a few lists, etc. Will trigger bbox updates for two clusters.
Transparency.
Changing transparency from nonzero to nonzero is the same as color/size/etc. However, transitioning between zero to nonzero always involves creation/destruction of a few internal graphics objects.
Changing anything else (graphics-related, doesn’t include Name or Velocity) triggers re-creation of internal graphics objects. This also includes any changes to object’s decals and child SpecialMesh properties. If it had any decals, the decals are also re-created. Expect memory allocations, extending lists, updates to clusters.

Note that multiple property updates are handled properly, and graphics objects are updated (almost) only once.

Other noteworthy changes

Head SpecialMeshes no longer “expand” as before, they are simply scaled up to a certain size, and then replaced with a cylinder, with decals disabled. (see https://devforum.roblox.com/t/potential-deprecation-of-non-uniform-head-scaling-feedback-welcome/101768/7)
Torso SpecialMeshes are rendered as boxes. (see SpecialMesh.MeshType=Enum.MeshType.Torso will be deprecated soon)
Outlines are not supported. Turning on outlines inhibits part instancing for the entire place file. (Meshes and CSGs are unaffected.)
For wedge parts, studs on slant faces will look a bit “non-Euclidean” when at 45 degrees, due to non-uniform scaling.

(*) - not actually metrics.

tbradm · May 30, 2018, 12:44am

I’m uncertain of the use of the word “texture” in the context “plastic with texture”. I think it refers to SurfaceType (e.g Smooth, Studs, etc.), but please clarify if I’m wrong.

Mikastrae · May 30, 2018, 12:45am

@maxvee Thank you.

Maximum_ADHD · May 30, 2018, 1:10am

The amount of edge cases that had to be covered for this must have been insane.
Massive kudos to @maxvee for pulling through.

Mistertitanic44 · May 30, 2018, 1:38am

loving it

cosmonomical · May 30, 2018, 1:48am

mcdonalds jingle

chesse20 · May 30, 2018, 1:49am

Batching Effeciency? Add this please

Coeptus · May 30, 2018, 1:55am

Great thread! Really interesting to read.

One question regarding batching efficiency…

Is this limited to each individual cluster?

If I have two identical MeshParts in close proximity, they are both rendered in one single draw call. But, if I position them further apart, they’re split into different clusters and are rendered in two separate draw calls.

Will the latter have a negative performance impact?

Ben_Est · May 30, 2018, 2:20am

This is a gold mine of effeciency knowledge. This information should be put on a wiki page for easy reference in the future.

richard702 · May 30, 2018, 3:21am

Will instancing apply to adornments?

Revlayz · May 30, 2018, 6:48am

Finally!

zeuxcg · May 30, 2018, 5:33pm

Currently we only instance within one cluster so if two meshes are far apart they will render as separate draw calls. This will likely change later. Note that the memory for mesh geometry is shared across clusters - we just don’t batch draw calls across clusters.

codes4breakfast · May 30, 2018, 6:49pm

How close does two parts/meshes need to be for them to be considered in the same cluster? Is there a certain threshold or do they work like “chunks”?

ConfidentCoding · May 30, 2018, 8:00pm

Thank you for such an informative post

Hexadecagons · May 30, 2018, 8:49pm

This is great, seeing the rendering engine become more optimised is something I’m always happy to see.

SelDraken · May 30, 2018, 9:14pm

This is the type of post I really enjoy seeing, very full of details without getting too technical, but enough to get my mind working on how to be more efficient in my building.

zeuxcg · June 1, 2018, 11:38pm

They work like chunks with size 128x64x128 studs. FWIW I would not advise designing your levels for this or anything - these details may change and shouldn’t be critical to performance with further improvements to the system (such as cross-chunk draw call merging)

zeuxcg · June 1, 2018, 11:39pm

Yeah this is correct. Each unique combination of stud types on surfaces results in unique internal geometry and we can’t instance that. So if you have a block smooth-on-all-sides and a block that has one surface marked as studs, we won’t render them in one draw call. You might want to use smooth-on-all-sides blocks for other reasons (such as performance of place loading due to internal physics operations that run on place load).

zeuxcg · July 12, 2018, 1:15am

Want to update the thread to mention the progress:

Part Instancing
We’ve been going through the rollout, discovering a few small bugs and fixing them along the way. It’s really close as far as we know - it’s currently disabled, but we’ll try to enable it in the coming weeks on desktop. Didn’t quite make RDC US, but will definitely make RDC EU
Inter-cluster instancing
We’ve implemented a feature that allows merging objects from different clusters into a single draw call dynamically. This is live of desktop as of right now and currently means that any Mesh/CSG parts that are clones of each other and are in the view will be rendered with a single draw call regardless of where they are. This also opens up opportunities for significant optimization of transparent part rendering - with part instancing and this combined, we’re seeing less overhead for transparent parts which is awesome.
macOS performance
We’ve discovered a performance regression with instancing (also reported here https://devforum.roblox.com/t/roblox-critical-metal-graphics-mode-on-mac-cause-drops-in-fps-to-rendering-many-parts) that was affecting both part instancing (that was enabled for ~1 hour last week) and mesh/csg instancing, and was only impacting NVidia GPUs on Mac. This issue has since been fixed, so on macOS/NVidia mesh & csg parts should render faster now
Part instancing and levels with really high part count
The initial release of part instancing uses “dynamic” instance data submission - the data for each part that describes the visual appearance of the part is uploaded to GPU every frame. This works fine for levels with reasonable number of parts that are visible - e.g. 10-20k - but starts hitting performance issues for really aggressive scenes e.g. 100k visible parts. We’re working on a change that, on a per-cluster basis, caches the part instance data in GPU buffers and updates those. These updates should be much faster than the reclustering operations that could happen in the previous system, which hopefully should make the new system work well even on levels with millions of blocks.

wravager · July 12, 2018, 5:02am

Will the live time CSG on the roadmap work with instancing as well?