That’s a weird benchmark you showed me, and it’s wrong too for the reason that it does not properly replicate the original algorithm:
1/y*(1.5-(number*.5*y*y))
The original algorithm did not include any division operations for the exact reason it was created in the first place: to avoid it.
I made a benchmark that tries to be more fair:
--!optimize 2
--!strict
--!native
local asd:number = math.random(9,9)
local TEST_AMOUNT:number = 1000000
local evil:buffer = buffer.create(4)
--original implementation
local function Q_rsqrt(number:number) : number
local x2:number = number * 0.5
buffer.writef32(evil,0,number) -- evil floating point bit manipulation
buffer.writei32(evil,0,0x5f3759df - bit32.rshift(buffer.readi32(evil,0),1)) -- what the ####?
local y:number = buffer.readf32(evil,0)
y = y * ( 1.5 - ( x2 * y * y ) ) -- 1st iteration
--y = y * ( 1.5 - ( x2 * y * y ) ) -- 2nd iteration, this can be removed
return y
end
--apparently faster
local function fQ_rsqrt(number:number) : number
buffer.writef32(evil,0,number) -- evil floating point bit manipulation
buffer.writei32(evil,0,0x5f3759df - bit32.rshift(buffer.readi32(evil,0),1)) -- what the ####?
local y:number = buffer.readf32(evil,0)
--buffer.fill(evil,0,0) -- no need to clean the stack simulation, it will be overriden in next iteration anyway
return y * ( 1.5 - ( number * 0.5 * y * y ) )
end
local a:number = 0
local t:number = 0
for i = 1,TEST_AMOUNT do
a = Q_rsqrt(asd)
t += os.clock()
end
print("Q_rsqrt:",os.clock()-t/TEST_AMOUNT,"\nValue:",a)
task.wait() -- break cache to make it more fair
t = 0
a = 0
for i = 1,TEST_AMOUNT do
a = fQ_rsqrt(asd)
t += os.clock()
end
print("fQ_rsqrt:",os.clock()-t/TEST_AMOUNT,"\nValue:",a)
task.wait()
t = 0
a = 0
for i = 1,TEST_AMOUNT do
a = 1/math.sqrt(asd)
t += os.clock()
end
print("Default:",os.clock()-t/TEST_AMOUNT,"\nValue:",a)
It depends from machine to machine, but generally speaking, the native syntax one 1/math.sqrt()
is the fastest, the second being fQ_rsqrt
and the last one being the original implementation. Yes, there are probably cases where if the device is old enough, the fQ_rsqrt
might be able to be on par with syntax implementation. But it is extremely niche and, honestly, not worth it.
So in short: no, the algorithm remains slower than the native syntax.