Part Instancing - pre-release announcement

We’ve released instancing support for mesh & CSG parts last year; we are getting ready to release part instancing which would extend the concept to other part types; this post is a heads-up that this will happen in a few weeks, and contains some technical details for the inquisitive minds.

Note: part instancing is designed to just work, and requires no user interaction. If you stick to basic parts and reuse meshes/textures more (and make sure Lighting.Outlines is disabled), you’re on the good side of performance, and you can skip reading the rest.

Posting this on behalf of @maxvee who spent lots of time implementing various part rendering features and trying to make sure they match the existing rendering pipeline.

What is instancing?

Graphics engine internals: utilizes hardware capabilities to draw many similar objects at once. Has been around since ca. 2005, so it’s about time.
Instancing for Meshes and CSGs was shipped in (about) October 2017. If you haven’t noticed anything, I’ll take it as a compliment.

Advantages:

  • saves a lot of graphics memory used by not making many copies of the same piece of geometry;
  • eliminates expensive part re-clustering caused by moving parts, dynamic updates, etc., potentially shaving off 4-1000000 ms per frame;
    (Previously, e.g. changing color of a single part triggered re-generation of everything in the general vicinity of the offender.)
  • (almost) immediate part updates;

Disadvantages:

  • batching is less efficient and highly depends on the number of unique pieces of geometry used;
  • requires graphics hardware support (D3D11, GL3, GLES 3, Vulkan, Metal). Android support is the worst: a lot of devices are too old; and a lot are too buggy to support it reliably.

There are two new “metrics” (*) here to be (mildly) aware of: batching efficiency and update performance.

Batching efficiency

Very high number of draw calls is a CPU bottleneck in rendering code - the time it takes for our code to talk to e.g. D3D11 which will talk to the driver, and that’s for every frame. We can mitigate the issue by submitting many similar parts per single draw call.

Similar in this case means:

  1. parts have the same geometry (to a certain extent).
  2. parts have the same shader (material)
  3. parts have the same texture (if plastic)

For example:

  1. Many meshes that refer to the same assetid will share their geometry. I.e. there will be only one vertex buffer with exactly one copy of the mesh, and it will be used to render all of them. There is no more duplication of meshes when forming geometry clusters.

  2. If, say, 100 meshes use the same material and texture, they will be rendered in a single draw call, unless other properties disagree (see below). If half of your meshes use bricks, one quarter uses plastic with one texture, and the remaining quarter uses plastic with another texture, there will be three draw calls, unless other properties disagree.

You can gauge batching efficiency by pressing Ctrl+Shift+F2 and looking at the bottom. One line should read, e.g. (Village template):

Clusters: fw 0c 0p; dyn 0c 0p; hum ...c ...p; inst 51c 561e 4559i

0c 0p - means there are no FastClusters, everything is instanced;
hum … - that’s for humanoids, we won’t touch those for now;
inst 51c 561e 4559i - means there are 51 instanced clusters with 561 render entities and 4559 instances.

The number of entities is essentially the number of batches (draw calls) it would take to render all 4559 parts in the scene. This tells us that on average there will be about 9 instances per draw call (which is not great…).

So the bigger the ratio of instances to entities, the better the batching efficiency is. The theoretical maximum is about 512 (for now).

The following properties do not incur batching costs (i.e different parts count as similar):

For parts:

  • Cframe (position/rotation)
  • Size(+)
  • Color/BrickColor
  • UsePartColor
  • Reflectance

For SpecialMeshes:

  • Offset
  • Scale
  • VertexColor

For Decals:

  • Color3
  • Transparency
  • StudsPerTileU/V

+ - the following exceptions for Size apply:

  • trusses have to have the same number of segments, otherwise a different piece of geometry is generated for every unique number of segments.
  • elongated head SpecialMeshes will turn into cylinders, which will effectively split the batch into two

The following are known notorious Batch Wreckers:

  • MeshType
  • MeshId
  • TextureId
  • Material
  • Stud configuration - this one will generate a slightly different copy per part type, per face, per stud type. There are about 2700 different combinations of just those, so be careful with studs. Stud configuration has no effect on MeshPart or CSG batching.
  • Transparency - this one is the worst. Since OIT is still expensive (ask me again in 5 years), we have to force a single instance per draw call for each transparent part. Does not pertain to decal transparency.

Properties not explicitly mentioned here, like Name or Velocity, have no effect on graphics.

Note on decals: internally, decals use a separate geometry piece that closely follows the object that they’re mapped on top of. They are rendered with transparency on all the time, but it’s not as ridiculous as one per draw call.

Update performance

Batching efficiency alone is a good indicator of static performance, i.e. it is the “base cost” of just rendering so many things. When parts are dynamic, though, additional performance considerations come into play.

Relative costs of updates, from faster to slower

  1. Nothing - does not incur any dynamic costs. Static objects are not updated at all.

  2. CFrame (position/rotation) of meshes, CSGs, blocks/cylinders/balls with no specialmeshes.
    This is “the fast path”. As cheap as patching a few floats in a struct that the renderer sends to the GPU.

  3. Color, UsePartColor, Size, Reflectance.
    Triggers a full update for the part and a bbox update for the cluster.

  4. SpecialMeshes.
    Approx. 10x slower to update than basic parts, also triggers bbox update for the cluster. Also, there is no ‘fast path’ for SpecialMeshes.

  5. CFrame, moving across cluster boundaries.
    If a position update moves the part too far to a different cluster, internally this triggers part handover logic, which involves bumping of a few lists, etc. Will trigger bbox updates for two clusters.

  6. Transparency.
    Changing transparency from nonzero to nonzero is the same as color/size/etc. However, transitioning between zero to nonzero always involves creation/destruction of a few internal graphics objects.

  7. Changing anything else (graphics-related, doesn’t include Name or Velocity) triggers re-creation of internal graphics objects. This also includes any changes to object’s decals and child SpecialMesh properties. If it had any decals, the decals are also re-created. Expect memory allocations, extending lists, updates to clusters.

Note that multiple property updates are handled properly, and graphics objects are updated (almost) only once.

Other noteworthy changes

  1. Head SpecialMeshes no longer “expand” as before, they are simply scaled up to a certain size, and then replaced with a cylinder, with decals disabled. (see https://devforum.roblox.com/t/potential-deprecation-of-non-uniform-head-scaling-feedback-welcome/101768/7)
  2. Torso SpecialMeshes are rendered as boxes. (see SpecialMesh.MeshType=Enum.MeshType.Torso will be deprecated soon)
  3. Outlines are not supported. Turning on outlines inhibits part instancing for the entire place file. (Meshes and CSGs are unaffected.)
  4. For wedge parts, studs on slant faces will look a bit “non-Euclidean” when at 45 degrees, due to non-uniform scaling.

(*) - not actually metrics.

156 Likes

I’m uncertain of the use of the word “texture” in the context “plastic with texture”. I think it refers to SurfaceType (e.g Smooth, Studs, etc.), but please clarify if I’m wrong.

2 Likes

@maxvee Thank you.

4 Likes

The amount of edge cases that had to be covered for this must have been insane.
Massive kudos to @maxvee for pulling through.

16 Likes

loving it

4 Likes

mcdonalds jingle

6 Likes

Batching Effeciency? Add this please

1 Like

Great thread! Really interesting to read.

One question regarding batching efficiency…

Is this limited to each individual cluster?

If I have two identical MeshParts in close proximity, they are both rendered in one single draw call. But, if I position them further apart, they’re split into different clusters and are rendered in two separate draw calls.

Will the latter have a negative performance impact?

5 Likes

This is a gold mine of effeciency knowledge. This information should be put on a wiki page for easy reference in the future.

3 Likes

Will instancing apply to adornments?

Finally!

Currently we only instance within one cluster so if two meshes are far apart they will render as separate draw calls. This will likely change later. Note that the memory for mesh geometry is shared across clusters - we just don’t batch draw calls across clusters.

4 Likes

How close does two parts/meshes need to be for them to be considered in the same cluster? Is there a certain threshold or do they work like “chunks”?

Thank you for such an informative post :smile:

This is great, seeing the rendering engine become more optimised is something I’m always happy to see.

3 Likes

This is the type of post I really enjoy seeing, very full of details without getting too technical, but enough to get my mind working on how to be more efficient in my building.

3 Likes

They work like chunks with size 128x64x128 studs. FWIW I would not advise designing your levels for this or anything - these details may change and shouldn’t be critical to performance with further improvements to the system (such as cross-chunk draw call merging)

Yeah this is correct. Each unique combination of stud types on surfaces results in unique internal geometry and we can’t instance that. So if you have a block smooth-on-all-sides and a block that has one surface marked as studs, we won’t render them in one draw call. You might want to use smooth-on-all-sides blocks for other reasons (such as performance of place loading due to internal physics operations that run on place load).

2 Likes

Want to update the thread to mention the progress:

  1. Part Instancing
    We’ve been going through the rollout, discovering a few small bugs and fixing them along the way. It’s really close as far as we know - it’s currently disabled, but we’ll try to enable it in the coming weeks on desktop. Didn’t quite make RDC US, but will definitely make RDC EU :wink:

  2. Inter-cluster instancing
    We’ve implemented a feature that allows merging objects from different clusters into a single draw call dynamically. This is live of desktop as of right now and currently means that any Mesh/CSG parts that are clones of each other and are in the view will be rendered with a single draw call regardless of where they are. This also opens up opportunities for significant optimization of transparent part rendering - with part instancing and this combined, we’re seeing less overhead for transparent parts which is awesome.

  3. macOS performance
    We’ve discovered a performance regression with instancing (also reported here https://devforum.roblox.com/t/roblox-critical-metal-graphics-mode-on-mac-cause-drops-in-fps-to-rendering-many-parts) that was affecting both part instancing (that was enabled for ~1 hour last week) and mesh/csg instancing, and was only impacting NVidia GPUs on Mac. This issue has since been fixed, so on macOS/NVidia mesh & csg parts should render faster now

  4. Part instancing and levels with really high part count
    The initial release of part instancing uses “dynamic” instance data submission - the data for each part that describes the visual appearance of the part is uploaded to GPU every frame. This works fine for levels with reasonable number of parts that are visible - e.g. 10-20k - but starts hitting performance issues for really aggressive scenes e.g. 100k visible parts. We’re working on a change that, on a per-cluster basis, caches the part instance data in GPU buffers and updates those. These updates should be much faster than the reclustering operations that could happen in the previous system, which hopefully should make the new system work well even on levels with millions of blocks.

30 Likes

Will the live time CSG on the roadmap work with instancing as well?

2 Likes