Q_rsqrt - Fast inverse square root ported to luau

This is a proof-of-concept code that shows how to make C-like bitwise manipulations using only native luau features. Despite the name, it is actually slower and less precise than if you were to use 1/math.sqrt(). So don’t use it seriously.

--!strict
--!native

local evil = buffer.create(4)

local function Q_rsqrt(number:number) : number
	local x2:number = number * 0.5
	buffer.writef32(evil,0,number)	-- evil floating point bit manipulation
	buffer.writei32(evil,0,0x5f3759df - bit32.rshift(buffer.readi32(evil,0),1)) -- what the ####?
	local y:number  = buffer.readf32(evil,0)
	y  = y * ( 1.5 - ( x2 * y * y ) ) -- 1st iteration
	--y  = y * ( 1.5 - ( x2 * y * y ) )	-- 2nd iteration, this can be removed
	return y
end

This also retains the original comments. I would like to hear what you think of it.

5 Likes

This is actual way faster

--[[
local evil = buffer.create(4)
buffer.writef32(evil,0,#)
]]

local nan:number = 0/0
local evil:buffer = buffer.create(4)
local function Q_rsqrt(number:number) : number
	if number<0 then
		return nan
	end
	buffer.writef32(evil,0,number)	-- evil floating point bit manipulation
	buffer.writei32(evil,0,0x5f3759df - bit32.rshift(buffer.readi32(evil,0),1)) -- what the ####?
	local y:number  = buffer.readf32(evil,0)
	buffer.fill(evil,0,0)
	return 1/y*(1.5-(number*.5*y*y))
end
local ti:number=0
local start:number
local n:number
for i:number=0,1000 do
	start=os.clock()
	n=math.sqrt(i)
	ti+=os.clock()-start
end
warn(ti/1000)
ti=0
for i:number=0,1000 do
	start=os.clock()
	n=Q_rsqrt(i)
	ti+=os.clock()-start
end
warn(ti/1000)

benchmark on average
math.sqrt(i) >>> 3.5234875
Q_rsqrt(i) >>> 2.0055625

its just that it shoots up and down in a range of [0-3], with out buffer.fill(evil,0,0), the range instead becomes [0~9.5]

That’s a weird benchmark you showed me, and it’s wrong too for the reason that it does not properly replicate the original algorithm:

1/y*(1.5-(number*.5*y*y))

The original algorithm did not include any division operations for the exact reason it was created in the first place: to avoid it.

I made a benchmark that tries to be more fair:

--!optimize 2
--!strict
--!native

local asd:number = math.random(9,9)
local TEST_AMOUNT:number = 1000000

local evil:buffer = buffer.create(4)

--original implementation
local function Q_rsqrt(number:number) : number
	local x2:number = number * 0.5
	buffer.writef32(evil,0,number)	-- evil floating point bit manipulation
	buffer.writei32(evil,0,0x5f3759df - bit32.rshift(buffer.readi32(evil,0),1)) -- what the ####?
	local y:number  = buffer.readf32(evil,0)
	y  = y * ( 1.5 - ( x2 * y * y ) ) -- 1st iteration
	--y  = y * ( 1.5 - ( x2 * y * y ) )	-- 2nd iteration, this can be removed
	return y
end

--apparently faster
local function fQ_rsqrt(number:number) : number
	buffer.writef32(evil,0,number)	-- evil floating point bit manipulation
	buffer.writei32(evil,0,0x5f3759df - bit32.rshift(buffer.readi32(evil,0),1)) -- what the ####?
	local y:number  = buffer.readf32(evil,0)
	--buffer.fill(evil,0,0) -- no need to clean the stack simulation, it will be overriden in next iteration anyway
	return y * ( 1.5 - ( number * 0.5 * y * y ) )
end

local a:number = 0
local t:number = 0

for i = 1,TEST_AMOUNT do
	a = Q_rsqrt(asd)
	t += os.clock()
end

print("Q_rsqrt:",os.clock()-t/TEST_AMOUNT,"\nValue:",a)

task.wait() -- break cache to make it more fair

t = 0
a = 0

for i = 1,TEST_AMOUNT do
	a = fQ_rsqrt(asd)
	t += os.clock()
end

print("fQ_rsqrt:",os.clock()-t/TEST_AMOUNT,"\nValue:",a)

task.wait()

t = 0
a = 0

for i = 1,TEST_AMOUNT do
	a = 1/math.sqrt(asd)
	t += os.clock()
end

print("Default:",os.clock()-t/TEST_AMOUNT,"\nValue:",a)

It depends from machine to machine, but generally speaking, the native syntax one 1/math.sqrt() is the fastest, the second being fQ_rsqrt and the last one being the original implementation. Yes, there are probably cases where if the device is old enough, the fQ_rsqrt might be able to be on par with syntax implementation. But it is extremely niche and, honestly, not worth it.

So in short: no, the algorithm remains slower than the native syntax.

1 Like