Optimizing native code performance with TYPE HINTS and MAGIC?!

INTRODUCTION

Hello guys, did you know that in Luau’s --!native mode, your type annotations aren’t just for linting, they also dictate how the JIT compiler generates machine code?

If you “lie” to the compiler with incorrect type hints, your code can become significantly slower than if you had used no hints at all.

THE EXPERIMENT

In the script below, we compare two functions with identical logic but different type hints:

  • bad_func : Lies about its parameter being a buffer (it isn’t).
  • normal_func : Correctly annotates the type of its parameter to be number.
--!native
--!optimize 2
--!strict

local ITERATIONS = 100000000
local rand_num = Random.new():NextInteger(1, 10000)
local _1 = 1
local _2 = 1

-- Incorrect annotation: logic expects a number, hint says buffer
local function bad_func(num: buffer)
    for i = 1, ITERATIONS do
        _1 *= _1 + num
    end
end

-- Correct annotation
local function normal_func(num: number)
    for i = 1, ITERATIONS do
        _2 *= _2 + num
    end
end

-- Timing results
local start = os.clock()
bad_func(rand_num)
local elapsed = os.clock() - start
print("BAD:", math.round(elapsed * 1000), "ms")

local start = os.clock()
normal_func(rand_num)
local elapsed = os.clock() - start
print("NORMAL:", math.round(elapsed * 1000), "ms")

Typical Results

When running under --!native, the “bad” hint causes a massive performance hit:

BAD: 603 ms
NORMAL: 423 ms

WHY DOES THIS HAPPEN?

When you use --!native, the Luau JIT uses your type annotations to specialize the generated machine code.

Below is a quote from Luau.org documenting this behavior. The text was first added to the website on Feb 19, 2025.

“… Luau JIT takes into account the type annotations present in the source code to specialize code paths …”

Explanations

When you tell the JIT a variable is a number, it generates optimized CPU instructions for floating-point math. However, it also generates a “guard.”

If you provide a buffer hint but pass a number, the following happens:

  1. The JIT creates a “specialized code path” optimized for a buffer.
  2. During execution, the “guard” checks the input and sees it is actually a number.
  3. The machine code path is rejected, and the code falls back to being interpreted.

Pseudocode of the JIT logic

local function normal_func(num: number)
    if typeof(num) == "number" then
        -- NATIVE PATH: Executing raw machine code
        fast_native_path(num)
    else
        -- FALLBACK: Bytecode interpretation
        -- This happens when the hint was wrong or missing
        for i = 1, ITERATIONS do
            _2 *= _2 + num
        end
    end
end

The Proof

If you remove the --!native flag, the performance difference disappears. Both functions run at roughly 600ms because they are both running in the interpreter. The “bad” hints only hurt you when you are trying to go faster with --!native mode.

BAD: 605 ms
NORMAL: 603 ms

DEEP DIVE

In more complex benchmarks, things can get even crazier! (See below)

The “magic” of typeof and any

We discovered that using any or typeof() can sometimes be faster than manual type definitions for no apparent reason:

Type Annotaion Style Average Execution Time Performance
Inferred using any or typeof 411.3 ns Fastest
Standard Type Definitions 466.8 ns 1.13x slower
Incorrectly Typed 1.723 µs 4.18x slower
Benchmark Code
-- Must enable native code generation to see the performance
-- impact of type hints!
-- 
--!native
--!optimize 2
--!strict


-- The use of types inferred using "typeof" can magically
-- boost performance by 1.1x in this benchmark!
-- 
type StandardTable = { _n: number, _p: number, _r0: number, _pr: number, _odds: number }
type MagicTable    = typeof({ _n = 0, _p = 0, _r0 = 0, _pr = 0, _odds = 0 })
type StandardArray = { number }
type MagicArray    = typeof({ 0, 0, 0, 0, 0 })
type HalfMagicArray = MagicArray | StandardArray


-- Each workload function has the same core logic, the only
-- main difference is the type hint used by each function.
-- 
-- Trust me, I didn't put any additional code into the functions
-- to affect their performance. (other than table unpacking code)
-- 
local function args_workload(rd: Random, n: number, p: number, r0: number, pr: number, odds: number): number
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

-- Every type hint of this function is "any"
local function args_workload_with_any_type_hints(rd: any, n: any, p: any, r0: any, pr: any, odds: any): any
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

-- Every type hint of this function is intentionally incorrect
local function args_workload_with_awful_type_hints(rd: number, n: buffer, p: Random, r0: any, pr: {}, odds: boolean): Vector3
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

local function standard_table_workload(rd: Random, tbl: StandardTable): number
    local n, p, r0, pr, odds = tbl._n, tbl._p, tbl._r0, tbl._pr, tbl._odds
    
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

local function magic_table_workload(rd: Random, tbl: MagicTable): number
    local n, p, r0, pr, odds = tbl._n, tbl._p, tbl._r0, tbl._pr, tbl._odds
    
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

local function standard_array_workload(rd: Random, tbl: StandardArray): number
    local n, p, r0, pr, odds = tbl[1], tbl[2], tbl[3], tbl[4], tbl[5]
    
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

local function magic_array_workload(rd: Random, tbl: MagicArray): number
    local n, p, r0, pr, odds = tbl[1], tbl[2], tbl[3], tbl[4], tbl[5]
    
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

local function half_magic_array_workload(rd: Random, tbl: HalfMagicArray): number
    local n, p, r0, pr, odds = tbl[1], tbl[2], tbl[3], tbl[4], tbl[5]
    
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end






-- Testing code (nothing magical beyond this point!)
--
local function format_duration(seconds: number): string
	return if     seconds >= 1.0      then string.format("%.3f s" , seconds             )
		   elseif seconds >= 0.001    then string.format("%.3f ms", seconds * 1000      )
		   elseif seconds >= 0.000001 then string.format("%.3f µs", seconds * 1000000   )
		   else                            string.format("%.3f ns", seconds * 1000000000)
end


local function format_throughput(ops: number): string
	return if     ops >= 1000000000 then string.format("%.3f G/s", ops / 1000000000)
		   elseif ops >= 1000000    then string.format("%.3f M/s", ops / 1000000   )
		   elseif ops >= 1000       then string.format("%.3f K/s", ops / 1000      )
		   else                          string.format("%.3f /s" , ops             )
end


local function dataframe_to_string(df      : {{[any]: any}},
                                   cols    : {any},
                                   col_fmt : {[any]: (string | (any) -> string)?}): string
	local formatted_data = table.create(#df)
	local n_cols = #cols
	local widths = table.create(n_cols, 0)
	
	for _, row in ipairs(df) do
		local formatted_row = table.create(n_cols)
		for i, key in ipairs(cols) do
			local value = row[key]
			local fmt = col_fmt[key]
			local text
			if fmt then
				if type(fmt) == "string" then
					text = string.format(fmt, value)
				else
					text = fmt(value)
				end
			else
				text = tostring(value)
			end
			formatted_row[i] = text
			widths[i] = math.max(widths[i], utf8.len(text))
		end
		table.insert(formatted_data, formatted_row)
	end
	
	local col_names = table.create(n_cols)
	for i, key in ipairs(cols) do
		local name = tostring(key)
		col_names[i] = name
		widths[i] = math.max(widths[i], utf8.len(name))
	end
	
	local function pad_strings(string_arr: {string})
		for i, str in ipairs(string_arr) do
			string_arr[i] = " " .. str .. string.rep(" ", widths[i] - utf8.len(str)) .. " "
		end
	end
	
	pad_strings(col_names)
	local top = "┌"
	local hdr = "│" .. table.concat(col_names, "│") .. "│"
	local mid = "├"
	local btm = "└"
	for i = 1, n_cols do
		local w = #col_names[i]
		top ..= string.rep("─", w) .. (if i ~= n_cols then "┬" else "┐")
		mid ..= string.rep("─", w) .. (if i ~= n_cols then "┼" else "┤")
		btm ..= string.rep("─", w) .. (if i ~= n_cols then "┴" else "┘")
	end
	for i, row in ipairs(formatted_data) do
		pad_strings(row)
		formatted_data[i] = "│" .. table.concat(row, "│") .. "│"
	end
	
	formatted_data[#formatted_data + 1] = btm
	formatted_data[ 0] = mid
	formatted_data[-1] = hdr
	formatted_data[-2] = top
	return table.concat(formatted_data, "\n", -2)
end


local function run_benchmark(n_rounds: number,
							 n_calls: number,
							 benchmark: {[any]: any}, ...: any)
	local results = {}
	
	for key, functor in benchmark do
		local name = (
			if     type(key    ) == "string"   then key
			elseif type(functor) == "function" then "[" .. tostring(key) .. "] " .. debug.info(functor, "n")
			else                                    "[" .. tostring(key) .. "]"
		)
		table.insert(results, {
			["Functor"] = functor,
			["Name"] = name,
			["Time Records"] = table.create(n_rounds),
			["Total Time"] = 0.0,
		})
    end
    
	for r = 1, n_rounds do
		for i, result in results do
			local functor = result["Functor"]
            
			local st = os.clock()
			for _ = 1, n_calls do
				functor(...)
			end
			local ed = os.clock()

			local elapsed_time = ed - st
			result["Time Records"][r] = elapsed_time
			result["Total Time"] += elapsed_time
        end
	end
	
	for i, result in results do
		local total_time = result["Total Time"]
		local time_records = result["Time Records"]

		table.sort(time_records)
		result["Best Time"] = time_records[1] / n_calls
		result["Worst Time"] = time_records[#time_records] / n_calls
		result["Avg. Time"] = total_time / (n_rounds * n_calls)
		result["Throughput"] = (n_rounds * n_calls) / total_time
		result["Rank"] = #results

		local sum = 0
		for _, t in time_records do
			sum += t
		end
		local mean = sum / #time_records

		local var = 0
		for _, t in time_records do
			var += (t - mean) ^ 2
		end
		var /= #time_records

		local stddev = math.sqrt(var)
		local cv = stddev / mean
		result["%RSD"] = cv * 100
	end

	for i, lhs in results do
		for j, rhs in results do
			if i ~= j then
				if lhs["Best Time"] <= rhs["Worst Time"] then
					-- Since we don't have a statistical library, we use this simple heuristic,
					-- which shows the best case ranking of a function.
					-- If multiple functions have the same rank, then there probably isn't enough
					-- evidence that one of them is more performant than the other
					lhs["Rank"] -= 1
				end
			end
		end
	end

	table.sort(results, function(a, b)
		return a["Total Time"] < b["Total Time"]
	end)

	local min_total_time = results[1]["Total Time"]
	for i, result in results do
		result["Factor"] = result["Total Time"] / min_total_time
	end

	local cols = {"Name", "Avg. Time", "%RSD", "Throughput", "Factor", "Rank", "Best Time", "Worst Time"}
	print("\n" .. dataframe_to_string(results, cols, {
		["Avg. Time" ] = format_duration,
		["Best Time" ] = format_duration,
		["Worst Time"] = format_duration,
		["Total Time"] = format_duration,
		["Throughput"] = format_throughput,
		["Factor"    ] = "%.3fx",
		["%RSD"      ] = "%.2f",
	}))
end









-- Test cases:
-- 
local tbl_data = { _n = 10000, _p = 0.3, _r0 = 3000, _pr = 0.008705361364930851, _odds = 0.4285714285714286 }
local arr_data = { 10000, 0.3, 3000, 0.008705361364930851, 0.4285714285714286 }

local function sample_using_standard_args(rd: Random)
	return args_workload(rd, 10000, 0.3, 3000, 0.008705, 0.42857)
end

local function sample_using_args_tagged_as_any(rd: Random)
    return args_workload_with_any_type_hints(rd, 10000, 0.3, 3000, 0.008705, 0.42857)
end

local function sample_using_awful_args(rd: Random)
	return args_workload_with_awful_type_hints(rd, 10000, 0.3, 3000, 0.008705, 0.42857)
end

local function sample_using_standard_table(rd: Random)
	return standard_table_workload(rd, tbl_data)
end

local function sample_using_magic_table(rd: Random)
	return magic_table_workload(rd, tbl_data)
end

local function sample_using_standard_array(rd: Random)
	return standard_array_workload(rd, arr_data)
end

local function sample_using_magic_array(rd: Random)
	return magic_array_workload(rd, arr_data)
end

local function sample_using_half_magic_array(rd: Random)
	return half_magic_array_workload(rd, arr_data)
end


local rd = Random.new()
run_benchmark(20, 100000, {
    using_standard_args      = sample_using_standard_args,
    using_args_tagged_as_any = sample_using_args_tagged_as_any,
    using_awfully_typed_args = sample_using_awful_args,
    using_standard_array     = sample_using_standard_array,
    using_standard_table     = sample_using_standard_table,
    using_magic_array        = sample_using_magic_array,
    using_magic_table        = sample_using_magic_table,
    using_half_magic_array   = sample_using_half_magic_array,
}, rd)
Detailed Benchmark Results
┌──────────────────────────┬────────────┬──────┬─────────────┬────────┬──────┬────────────┬────────────┐
│ Name                     │ Avg. Time  │ %RSD │ Throughput  │ Factor │ Rank │ Best Time  │ Worst Time │
├──────────────────────────┼────────────┼──────┼─────────────┼────────┼──────┼────────────┼────────────┤
│ using_args_tagged_as_any │ 411.397 ns │ 1.00 │ 2.431 M/s   │ 1.000x │ 1    │ 404.314 ns │ 421.085 ns │
│ using_magic_array        │ 411.963 ns │ 0.90 │ 2.427 M/s   │ 1.001x │ 1    │ 406.846 ns │ 421.481 ns │
│ using_half_magic_array   │ 413.424 ns │ 1.01 │ 2.419 M/s   │ 1.005x │ 1    │ 406.927 ns │ 420.875 ns │
│ using_magic_table        │ 415.501 ns │ 0.93 │ 2.407 M/s   │ 1.010x │ 1    │ 409.878 ns │ 423.415 ns │
│ using_standard_args      │ 466.897 ns │ 1.07 │ 2.142 M/s   │ 1.135x │ 5    │ 461.447 ns │ 480.484 ns │
│ using_standard_array     │ 468.936 ns │ 0.97 │ 2.132 M/s   │ 1.140x │ 5    │ 464.366 ns │ 481.593 ns │
│ using_standard_table     │ 470.348 ns │ 1.02 │ 2.126 M/s   │ 1.143x │ 5    │ 466.596 ns │ 482.509 ns │
│ using_awfully_typed_args │ 1.723 µs   │ 1.01 │ 580.439 K/s │ 4.188x │ 8    │ 1.693 µs   │ 1.759 µs   │
└──────────────────────────┴────────────┴──────┴─────────────┴────────┴──────┴────────────┴────────────┘

The Cursed Realm

By changing just one parameter from number to any, we can magically shift the performance of the following function:

Function Code
local function func(rd: Random, n: number, p: number, r0: number, pr: number, odds: number): number
	if n == 0 or p == 0 then return 0 end
	if p == 1 then return n end

	local u, pu, pd, ru, rd_ = rd:NextNumber() - pr, pr, pr, r0, r0
	if u < 0 then return r0 end

	while true do
		local active = false
		if rd_ >= 1 then
			pd *= rd_ / (odds * (n - rd_ + 1))
			u -= pd
			active = true
			if u < 0 then return rd_ - 1 end
		end
		if rd_ ~= 0 then rd_ -= 1 end
		ru += 1
		if ru <= n then
			pu *= (n - ru + 1) * odds / ru
			u -= pu
			active = true
			if u < 0 then return ru end
		end
		if not active then return 0 end
	end
end

Function Signature Comparisons

Original Function Signature
-- Every parameter of this function is correctly annotated as "number".
-- However, the compiler does not like it very much :(
local function no_magic(rd: Random, n: number, p: number, r0: number, pr: number, odds: number): number
Optimized Function Signature
-- The last parameter is now annotated as "any" instead of "number"
-- The compiler is LOVING this change! 😍🥰
local function white_magic(rd: Random, n: number, p: number, r0: number, pr: number, odds: any): number
De-optimized Function Signature
-- Every parameter except the fourth is annotated as "number".
-- The compiler absolutely HATES this parameter list and wants to destroy it!!
local function black_magic(rd: Random, n: number, p: number, r0: number, pr: any, odds: number): number

Performance Comparisons

The following results are obtained from a difference PC than the one we used above:

┌─────────────────────────┬────────────┬──────┬────────────┬────────┬──────┬────────────┬────────────┐
│ Name                    │ Avg. Time  │ %RSD │ Throughput │ Factor │ Rank │ Best Time  │ Worst Time │
├─────────────────────────┼────────────┼──────┼────────────┼────────┼──────┼────────────┼────────────┤
│ Magically Optimized!    │ 661.954 ns │ 0.67 │ 1.511 M/s  │ 1.000x │ 1    │ 656.346 ns │ 674.783 ns │
│ No Magic :(             │ 721.910 ns │ 0.52 │ 1.385 M/s  │ 1.091x │ 2    │ 714.817 ns │ 728.863 ns │
│ Magically De-Optimized! │ 793.683 ns │ 0.51 │ 1.260 M/s  │ 1.199x │ 3    │ 787.313 ns │ 804.044 ns │
└─────────────────────────┴────────────┴──────┴────────────┴────────┴──────┴────────────┴────────────┘

Remarks

I have no explanations for this. It is likely a compiler bug that will eventually get patched.

CONCLUSIONS

Don’t Guess Types

In --!native mode, a wrong hint is worse than no hint. If you are unsure of a type, use any.

Don’t Lie to Silence the Linter

Don’t use a fake type just to get rid of the warnings. It can degrade the runtime speed!

Check Types During Refactors

If your code mysteriously slows down after a refactor, check your type definitions. They may have accidentally triggered some de-optimizations.

Performance Hacks

For extreme compute-heavy tasks, try experimenting with different type hints. The compiler seems to prefer table shapes and types inferred using typeof() or any over manual ones in some scenarios.
Please don’t get too paranoid about this micro-optimization though.

3 Likes

Very interesting find, thanks for sharing this.

1 Like