Is it faster to use the Bit32 Library, instead of Arithmetic operations?

StyledDev · December 23, 2021, 7:27pm

I understand that bitwise operations, especially in compiled-languages such as C & C++
can be used as a form of optimization for some equations, such as the following :

local a : number = 82384 / 64;

can also be performed using :

local a : number = Bit32.rshift(82384, 7);

Or in a language like C

int a = (int) (82384 >> 7)

For those who are not aware : bitwise operations closely represent the structure of a CPU.
and depending on different factors like the compiler you use, and the CPU architecture.
some Bitwise operations can be performed in a single CPU Cycle, with some
Arithmetic operators taking anywhere from 20-24,000 CPU Cycles.
Which can be a heavily optimization for some programs.

My thought process : is that ROBLOX uses a fork of Lua, which is an interpreted language, which uses it’s own Virtual machine, which then runs on your hardware & “interprets” a script, so it may not have the same benefits you receive from a compiled language. Although the ROBLOX Engine does use C++, and the Bit32 Library itself uses C++, so using said library may be able to take advantage of the optimization if it doesn’t run on the Lua VM.

What have I tried? : I’ve tried optimizing equations with bitwise operators in the past using several compiled languages with compatibility, and have found this to be a valid optimization, I’ve also looked in several places such as the LuaU Documentation Library - Luau, and the ROBLOX API Reference bit32 | Roblox Creator Documentation
Although both only provide usage-related information.

ThanksRoBama · December 24, 2021, 4:01pm

It’s easy to test. If you do, it’d be interesting to see your results

Babybunnyiscute19 · December 24, 2021, 4:32pm

Lua 5.4 has the same bitwise operators as C / C++, though I’m not sure if older versions have it. Lua 5.4’s bitwise operators should be quite fast, since Lua was written in C, so it’s basically just a port of C’s operators to Lua (I think). I’ll do a comparison between C, C++, C# Lua 5.4, and bit32 to see which one is faster. I’ll edit this reply when I get the results.

The test will be 1 << 2, which should be 4.

Edit: After looking through stack overflow for a while, I finally got C and C++'s result, they were the same, and I have no clue what they mean.

Here are the results:

C: 40 (It just says 40, I don’t know anything about time.h, so if you do know, please say what it means exactly)

C++: 40 (Same as C)

C#: 0.188s (Done on dotnetfiddle.net, probably really dumb, but it worked)

Lua 5.4: 0.0010000000000048 (Using my benchmarking module)

Luau w/ bit32: 0.0004785000346601

Source code:

C / C++:

#include <stdio.h>
#include <time.h>

int main() {
    float startTime = (float)clock()/CLOCKS_PER_SEC;
    printf("%d", 1 << 2);
    
    printf("%d", (float)clock()/CLOCKS_PER_SEC);
    return 0;
}

C#:

using System;
					

namespace BitThing {
	public static class Program {
		public static void Main(string[] args) {
			Console.WriteLine(1 << 2);
		}
	}
}

Lua 5.4:

bm = time.benchmark() --// This just returns os.clock()

print(1 << 2)

time.mark(bm) --// prints out os.clock() - the benchmark

Luau w/ bit32:

local bm = os.clock()

print(bit32.lshift(1, 2))

print(os.clock() - bm)

The results are just quick tests, they are not scientific. Here as some things to make this test better:

For C / C++, actually knowing time.h before using it.
For C#, not using dotnetfiddle.net, and using VS.
For Lua 5.4, not using a NuGet package to interpret the code and not using a module.
For Luau w/ bit32, instead of using Luau, use normal Lua with the bit32 library added.

Once again, this test was made out of my own interests, not for anything set in stone. I’ll let someone else to make a better test, since I don’t have the resources or the knowledge to do so.

EDIT: In my tests, C# seems super slow, this is because I was using .NET 4.7.2, instead of Roslyn 4.0 or .NET 6.

With Roslyn 4.0, you get 0.045s, and with .NET 6 on, you get 0.012s!

I’m going to try to replicate the 0.045 second result that Roslyn 4.0 got with the Roslyn 4.0 csc.exe binary that VS has.

EDIT 2:
I tried to replicate it, but when I benchmarked my code, it either said “0”, or a negative number, which kind of sucks.

CoderHusk · December 24, 2021, 5:52pm

you cant notice the difference, if you are looking to optimize something you should look at lowering the time complexity and not the actual computation speed

StyledDev · December 28, 2021, 7:19pm

Usually this would be correct, but making any slight optimizations can be important, especially in cases where a program is being run on a per-frame basis, with an already low time complexity.

CoderHusk · December 29, 2021, 4:48am

yea well im speaking from practical experience here. i got this program which outputs something every second and doing those micro optimizations only increased fps by 1 or 2. but check this when i try a different algorithm entirely…
from 60 to almost 200 it is night and day

doytnr · January 17, 2022, 8:59pm

I used arithmetic addition and bit32 addition side by side, 1 operation does not tell much difference, but many might…

18 is from the artithmetic (9+9)
5544 is from Bit32 (4123+1421)

nicemike40 · January 19, 2022, 7:22pm

Here, I benchmarked it for you:

Running either rshift(x, 5) or x / 32 for 100,000,000 times, gives:

bit32 w/ lookup: 2251.6ms
bit32 localized: 2444.9ms
division: 1707.6ms

(The second one is just bit32.rshift brought into the local scope to avoid the cost of a table lookup. It should be faster, the only reason it wouldn’t be is because roblox is doing some trickery and directly inlining things that look like “bit32.rshift”).

First of all, you should only consider this if you have an insanely tight loop that you absolutely cannot optimize anymore and you are certain that it is a bottleneck in your program. Otherwise you’re probably wasting your time! Note that in the benchmark that’s 100 million operations in about 2 seconds. If you’re not on that scale, don’t worry about it.

In C, it is either the same speed or slower to do a bitshift:

If you know the divisor at compile time, your compile will optimize any power of two division to be a bitshift automatically
If you don’t know the divisor at compile time, the cost of checking if its a power of two will greatly dwarf the cost of the division itself.

In lua(u), division is faster possibly because the function call has some overhead.

Benchmark code

local function test(name, func)
	local start = os.clock()
	for i = 1, 100000000 do
		func(i)
	end
	local dur = os.clock() - start
	print(string.format("%s: %0.1fms", name, 1000*dur))
end

wait(2)

-- where we know its a power of two
test("bit32 w/ lookup", function(i) i = bit32.rshift(i, 5) end)

-- localize the function to avoid table lookup cost
local rshift = bit32.rshift
test("bit32 localized", function(i) i = rshift(i, 5) end)

-- just divide normally
test("division", function(i) i = i / 32 end)

Flamingpork · November 1, 2024, 4:17pm

Speaking from practical experience, you are misrepresenting micro optimizations.
Yes, reducing complexity of algorithms is a larger improvement, but if that’s not enough when you reach lowest complexity, they become a whole lot more important.
They have their place and use cases, and to simply disregard them entirely is unjust.