Luau, Optimizations and Using them consciously

Disclaimer, because I am sure that the content of this post may be controversal for some:

  • I’m in no way a professional at developing software; I’m just a guy who can read C++ comments as well as what would be doing my own studio executor in my free time. I am simply sharing what I believe to be good, useful knowledge.

  • This is a small tutorial focused on optimizing your code style, design, and overall structure so we can cheat our way into slightly better performance and memory usage, taking advantage of things consciously instead of unconsciously.

Like a great mind once said…

If you would be a real seeker after truth, it is necessary that at least once in your life you doubt, as far as possible, all things.

  • René Descartes (1596-1650)

So doubt everything on this post, and test yourself, perhaps it could be different in your case!


Lexicon

Up-References/Up-Values:

  • They are references made to variables outside of the scope of a function.
local dummyVariable = 0
local function test() --- nups (number of upvalues): 1.
	print(
		dummyVariable -- Captures dummyVariable as an Up-Reference
	)
end

test()

Virtual Machine (programming context):

  • It is software that interprets a specific format of ‘bytecode.’
    Bytecode:
  • Optimized representation of source code that a virtual machine can interpret to do operations.
    Operation Code (or Op. Code):
  • It is an instruction that a virtual machine or CPU can interpret. We will be focusing on the first one.

Where to begin?

We should begin first by understanding why Luau was even conceived. Luau came up as an alternative VM, which announced its full rollout on Faster Lua VM Released | Roblox Developer Forum in the year 2019, up to that point, we had been using Lua 5.1, whose performance, while good, was not perfect, and ROBLOX had started noticing its shortcomings. This also came with the unintended side effect of breaking cheats due to the big changes the VM had, having brand-new op codes and a new bytecode format.

The performance

Luau comes packed with performance. As the engineer went on 6 years ago, some scripts ran almost twice as fast! These optimizations come all over the place; however, we can still increase our yields by taking the most optimal paths consciously.

The fast-path(s)

The Luau VM has many implementations for its many opcodes. These have multiple ‘paths.’ These paths are normally labeled ‘fast-path’ and ‘slow-path.’ We want to MAXIMIZE the number of ‘fast paths’ we take in order to improve our performance and efficiency.

A key to taking these fast paths is making sure our environment never loses its ‘safe’ flag. In short, the ‘safe’ flag marks the environment that is currently running as one whose globals have not ‘changed’ uncontrollably while executing. You can only lose the ‘safe’ flag in one of the following ways (which I know of at least!):

  • Using getfenv
  • Using setfenv
  • Using loadstring

These functions are either deprecated or their usage is not recommended.

The flag, however, allows the interpreter to make some assumptions about the environment, such as the fact that once a non-mutable global is retrieved, it is guaranteed to not change throughout the script’s lifetime. This takes us to our first op code.

GETIMPORT

GETIMPORT is an operation code that is basically an optimized GETGLOBAL.

When the bytecode is loaded into the Luau VM using the luau_load function, it does a few operations:

  • Check bytecode version
  • Check types version
  • Load string table
  • userdata type mappings
  • Load all the functions in the bytecode.
  • Find the ‘main’ function
  • Create a Closure object that represents the Luau function that was just loaded.

Simple and easy! However, we are interested in the 5th step, Load all the functions in the bytecode. During this step, the Luau VM will begin loading the type information, instructions, constants, and debugging information (if available). When it begins loading constants, there, there is where the magic happens.

There are different kinds of constants, namely:

  • nil
  • boolean
  • number
  • vector
  • string
  • table
  • closure
    And there is another one, which is specific to GETIMPORT, the import constant.

This constant is what gives GETIMPORT its speed. The globals that are marked non-mutable are resolved and saved into the constants table.

Then, when GETIMPORT’s interpreter implementation is called, it will first check if the import resolution was successful, and if it was and the environment is ‘safe,’ it retrieves the imported global from the constants table, skipping a table lookup that could have potentially caused a __index call, saving precious execution time.

References to this section:

DUPCLOSURE

DUPCLOSURE simply reuses or duplicates a closure if certain, specific conditions are met.

To simplify, a Closure object represents an instance of a function that the Luau interpreter can call at any point. These have two underlying implementations: lua_CFunction and Proto. The first are functions exposed from C/C++, while the latter are Luau functions exposed from bytecode.

Luau will attempt to emit the DUPCLOSURE op. code only if the compiler is sure that there is a chance for reusing the function. In cases where there are unique up-values, in which every instance of the Closure object would have a unique up-value, the NEWCLOSURE op. code is emitted, which guarantees a closure allocation.

So, if we use, say, this code:

local lib = {}

function lib.create()
	local createdObject = {}
	function createdObject.printObject()
		print(createdObject)
	end
	return createdObject
end
print(lib.create().printObject)
print(lib.create().printObject)
print(lib.create().printObject)
print(lib.create().printObject)

Then the disassembly of the instructions of lib.create when it is compiled with Optimization 0+ is the following:

NEWTABLE R0, 1, 0 ; Create a new table and store it into R0. The size of the hash-table will be '1' and the size of the array will be '0'.
NEWCLOSURE R1, P0 ; Create a new closure from P0 (proto 0), and save it into R1. 
CAPTURE VAL R0 ; Capture the value at R0 'createdObject'
SETTABLEKS R1 R0 K0 ; Set the index K0 'printObject' in the table present at R0 'createdObject' to whatever is in R1 (the function we just created).
RETURN R0, 1 ; Return n (1) registers from R0

This is all to say that we will create a new function that will capture (hold an up-reference to) the table we created. Then we will set that function to an index and return the table we created originally.

Every time we call this function, we will, unfortunately, create a new closure every single time.

This can be proven just by loading the code, calling the function a few times, and looking at what we obtain on our print:
Pasted image 20250429120907
Different functions every. single. call.

However, if we slightly modify your implementation to the following:

local lib = {}

function lib.create()
	local createdObject = {}
	function createdObject.printObject(self)
		print(self)
	end
	return createdObject
end
print(lib.create().printObject)
print(lib.create().printObject)
print(lib.create().printObject)
print(lib.create().printObject)

Then the DUPCLOSURE op code is emitted by the compiler instead of NEWCLOSURE.

Running the code on Studio yields us the following output:
Pasted image 20250429121051

Just like magic, we are using the same function over and over again; we are not sacrificing memory for performance like people continue to believe.

References for this section:


These are the two op codes I wanted to go over in this community tutorial, really, mainly since I have encountered more than one person who has gone out of their way to say, ‘Localize everything!’ or ‘You will be wasting memory doing this!’ This is to simply prove to you, dear reader, that no. You are not wasting resources doing this; you’re simply deciding not to use metatables and making sure to take the fastest VM paths possible where applicable.

I hope to release a follow-up post that disproves the ‘advantage’ of metatables on some aspects; however, I wish to obtain feedback on this whole post, what I could improve, and perhaps even users to run their own tests. However, it is fairly clear that if no functions are created, we don’t waste time creating the object, allocating it, etc., which is quicker regardless!

Let me know your thoughts down below :smiley:

17 Likes

speaking from experience, the dot operation causes your script to be slower.
the following code will run slightly slower

print(game.workspace.Part.Attachment.WorldPosition)

then this code(faster):

local pos = game.workspace.Part.Attachment.WorldPosition
print(pos)
1 Like

Alrighty:

The first sample has this disassembly:

-- Disassembled with Konstant V2.1's disassembler, made by plusgiant5
-- Disassembled on 2025-04-29 12:14:58
-- Luau version 6, Types version 3
-- Time taken: 0.000190 seconds

[0] #1 [0x00000041]          PREPVARARGS 0                     ; -- Prepare for any number (top) of variables as ...
[1] #2 [0x0001000C]          GETIMPORT 0, 1 [0x40000000]       ; var0 = print
[3] #3 [0x0003050C]          GETIMPORT 5, 3 [0x40200000]       ; var5 = game
[5] #4 [0xCA05040F]          GETTABLEKS 4, 5, 202 [4]          ; var4 = var5.Workspace
[7] #5 [0x4404030F]          GETTABLEKS 3, 4, 68 [5]           ; var3 = var4.Part
[9] #6 [0x0D03020F]          GETTABLEKS 2, 3, 13 [6]           ; var2 = var3.Attachment
[11] #7 [0xE002010F]         GETTABLEKS 1, 2, 224 [7]          ; var1 = var2.WorldPosition
[13] #8 [0x01020015]         CALL 0, 2, 1                      ; var0(var1)
[14] #9 [0x00010016]         RETURN 0, 1                       ; return

while your second sample has the following disassembly:

-- Disassembled with Konstant V2.1's disassembler, made by plusgiant5
-- Disassembled on 2025-04-29 12:15:58
-- Luau version 6, Types version 3
-- Time taken: 0.000368 seconds

[0] #1 [0x00000041]          PREPVARARGS 0                     ; -- Prepare for any number (top) of variables as ...
[1] #2 [0x0001040C]          GETIMPORT 4, 1 [0x40000000]       ; var4 = game
[3] #3 [0xCA04030F]          GETTABLEKS 3, 4, 202 [2]          ; var3 = var4.Workspace
[5] #4 [0x4403020F]          GETTABLEKS 2, 3, 68 [3]           ; var2 = var3.Part
[7] #5 [0x0D02010F]          GETTABLEKS 1, 2, 13 [4]           ; var1 = var2.Attachment
[9] #6 [0xE001000F]          GETTABLEKS 0, 1, 224 [5]          ; var0 = var1.WorldPosition
[11] #7 [0x0007010C]         GETIMPORT 1, 7 [0x40600000]       ; var1 = print
[13] #8 [0x00000206]         MOVE 2, 0                         ; var2 = var0
[14] #9 [0x01020115]         CALL 1, 2, 1                      ; var1(var2)
[15] #10 [0x00010016]        RETURN 0, 1                       ; return

This is a disassembly of the bytecode with a ROBLOX-like compiler (which accounts for the mutable globals and the settings of it).

I really don’t understand why one sample would execute slower than another. However, I did gloss over a small detail during my editing for the final release of this post, being the following:

The Luau compiler is really smart. It will try to use GETIMPORT on virtually everything as long as it is not defined as a mutable global. Mutable globals cannot be GETIMPORTed, since their state changes! Because of this, it could affect the time of resolution slightly. What are your testing conditions for this? Because, by the looks of it, an additional MOVE will not improve performance, possibly decreasing it! So if you could provide the test and the methodology for what you mean, I’d be very glad to test it and understand why such could be.

Following the code, it only really seems like you would potentially lose performance rather than increase it (Luau VM code), since you’re making an unnecessary move somewhat.

For reference on the settings ROBLOX compiles with, at least in studio, are the following:

const char *mutableGlobals[] = {
    "Game", "Workspace", "game", "plugin", "script", "shared", "workspace",
    nullptr
};

auto compileOpts = Luau::CompileOptions{};
compileOpts.optimizationLevel = 2;
compileOpts.debugLevel = 1;
compileOpts.vectorLib = "Vector3";
compileOpts.vectorCtor = "new";
compileOpts.vectorType = "Vector3";
compileOpts.mutableGlobals = mutableGlobals;
return Luau::compile(szLuauCode.data(), compileOpts);

I’d also remark that calling print is slower than you may think, so that’s not really a good comparison point!

1 Like

Following up on this, I have just compared the code, and the results are within the margin of error for a difference.

Even then, most of the cost (time spent on the code) is likely on the call to print, and not on obtaining the variable from the game global!


Even simply removing the print call yields us near sub-microsecond performance; Benchmarker itself says the following:


However, after removing the prints and replacing them with a stub call to a function that is empty and running 2000 iterations of it, we receive the following outputs:

Very much equal, with B coming out behind, unlike what you proposed before.

And when going to extremes, such as 20000 iterations, the difference becomes more pronounced.

!!! To be noted, however, is that I have done these tests in --!optimize 1. Roblox now compiles using --!optimize 2 last I checked, which means these tests would, ideally, produce almost the same, if not THE SAME, bytecode if a Luau function is used, in an effort to maintain as much accuracy as possible, with a ‘function’ that has no weight in our test (unlike print). This is because --!optimize 2 enables inlining optimisations, which --!optimize 1 does not do, which makes it perfect for this test.

Benchmarker test:

--!optimize 1
--[[
This file is for use by Benchmarker (https://boatbomber.itch.io/benchmarker)

|WARNING| THIS RUNS IN YOUR REAL ENVIRONMENT. |WARNING|
--]]

local function f(p) end

return {
	ParameterGenerator = function()
		return
	end,

	Functions = {
        ["A"] = function(Profiler)
            for i = 1, 20000 do 
                f(game.workspace.Part.Attachment.WorldPosition)
            end    
        end,

        ["B"] = function(Profiler)
            for i = 1, 20000 do
                local pos = game.workspace.Part.Attachment.WorldPosition
                f(pos)    
            end
        end,
	},
}

As you can see, sample B ends up being slower. The extra MOVE instruction is likely playing against it.

Disassembly of ‘A’

-- Disassembled with Konstant V2.1's disassembler, made by plusgiant5
-- Disassembled on 2025-04-29 12:46:28
-- Luau version 6, Types version 3
-- Time taken: 0.000409 seconds

[0] #1 [0x00000041]          PREPVARARGS 0                     ; -- Prepare for any number (top) of variables as ...
local function f() -- Line 2
end
[1] #2 [0x00000040]          DUPCLOSURE 0, 0                   ; var0 = f
[2] #3 [0x00010304]          LOADN 3, 1                        ; var3 = 1
[3] #4 [0x4E200104]          LOADN 1, 20000                    ; var1 = 20000
[4] #5 [0x00010204]          LOADN 2, 1                        ; var2 = 1
[5] #6 [0x000D0138]          FORNPREP 1, 13                    ; for var3 = var3, var1, var2 do -- If loop shouldn't start (var3 > var1) then goto [19]
::18::
[6] #7 [0x00000406]          MOVE 4, 0                         ; var4 = var0
[7] #8 [0x0002090C]          GETIMPORT 9, 2 [0x40100000]       ; var9 = game
[9] #9 [0xEA09080F]          GETTABLEKS 8, 9, 234 [3]          ; var8 = var9.workspace
[11] #10 [0x4408070F]        GETTABLEKS 7, 8, 68 [4]           ; var7 = var8.Part
[13] #11 [0x0D07060F]        GETTABLEKS 6, 7, 13 [5]           ; var6 = var7.Attachment
[15] #12 [0xE006050F]        GETTABLEKS 5, 6, 224 [6]          ; var5 = var6.WorldPosition
[17] #13 [0x01020415]        CALL 4, 2, 1                      ; var4(var5)
[18] #14 [0xFFF30139]        FORNLOOP 1, -13                   ; var3 += var2; if var3 <= var1 then goto [6] end
::5::
[19] #15 [0x00010016]        RETURN 0, 1                       ; return

Disassembly of B:

-- Disassembled with Konstant V2.1's disassembler, made by plusgiant5
-- Disassembled on 2025-04-29 12:45:59
-- Luau version 6, Types version 3
-- Time taken: 0.000231 seconds

[0] #1 [0x00000041]          PREPVARARGS 0                     ; -- Prepare for any number (top) of variables as ...
local function f() -- Line 2
end
[1] #2 [0x00000040]          DUPCLOSURE 0, 0                   ; var0 = f
[2] #3 [0x00010304]          LOADN 3, 1                        ; var3 = 1
[3] #4 [0x4E200104]          LOADN 1, 20000                    ; var1 = 20000
[4] #5 [0x00010204]          LOADN 2, 1                        ; var2 = 1
[5] #6 [0x000E0138]          FORNPREP 1, 14                    ; for var3 = var3, var1, var2 do -- If loop shouldn't start (var3 > var1) then goto [20]
::19::
[6] #7 [0x0002080C]          GETIMPORT 8, 2 [0x40100000]       ; var8 = game
[8] #8 [0xEA08070F]          GETTABLEKS 7, 8, 234 [3]          ; var7 = var8.workspace
[10] #9 [0x4407060F]         GETTABLEKS 6, 7, 68 [4]           ; var6 = var7.Part
[12] #10 [0x0D06050F]        GETTABLEKS 5, 6, 13 [5]           ; var5 = var6.Attachment
[14] #11 [0xE005040F]        GETTABLEKS 4, 5, 224 [6]          ; var4 = var5.WorldPosition
[16] #12 [0x00000506]        MOVE 5, 0                         ; var5 = var0
[17] #13 [0x00040606]        MOVE 6, 4                         ; var6 = var4
[18] #14 [0x01020515]        CALL 5, 2, 1                      ; var5(var6)
[19] #15 [0xFFF20139]        FORNLOOP 1, -14                   ; var3 += var2; if var3 <= var1 then goto [6] end
::5::
[20] #16 [0x00010016]        RETURN 0, 1                       ; return

As you can see, the main difference simply lies on the extra MOVE instructions at #12 and #13, likely making B slightly less performant than A, practically margin of error.

3 Likes

First off, this is a great in-depth analysis of how upvalue optimizations within Luau works! I saw another post of yours explaining the concept, and when I saw you had also made a full post about it, I knew I had to read the whole thing.

This statement is too true. I believe a high percentage of devs, including myself before reading this post, would believe that each object creation would also create a new function, even without up values.

I think that comes down more to not understanding the optimizations Luau has over Lua. When I made my tutorial about OOP and memory efficiency (2020?), I believe Luau was still new(ish) to Roblox. I don’t even think I knew what it was at that point! I was still following old conventions. I really need to update that.

I still don’t know about all the optimizations Luau provides to Roblox. I hope you, and other devs, continue to dive into these types of optimizations, for those of us that either don’t have the time, or the knowledge to deep dive into the language itself.

Though I think metatables still have a place within OOP implementations. For… inheritance (Am I the only one that still does this?). But it sure lowers the performance quite a bit!

Also, this is an amazing mindset:

Following the advice, I tested it, and everything was true! Cool stuff!

1 Like