Hi everyone,
Was working on a Neural Network library and the classic matrix multiplication is simply too slow.
What I have worked on is a algorithm that can multiply matrices of F32s 5-10x faster than normal Matrix Multiplication by utilizing Native Luau SIMD (via vectors).
Code:
Code for Normal:
local Cb = buffer.create(N * M * FLOAT_SIZE)
for i = 0, N - 1 do
for j = 0, M - 1 do
local sum = 0
for k = 0, K - 1 do
local a = buffer.readf32(Ab, (i * K + k) * FLOAT_SIZE)
local b = buffer.readf32(Bb, (k * M + j) * FLOAT_SIZE)
sum += a * b
end
buffer.writef32(Cb, (i * M + j) * FLOAT_SIZE, sum)
end
end
return Cb
Code for Optimized:
(This will be opensourced soon alongside the rest of my library)
Results (1024x1024 Matrix Mult, ~6.5x faster):
=== Benchmark Results (fastest ā slowest) ===
> Optimized: 0.717437 s
> Normal: 4.581731 s
Results 4D Vectors (1024x1024 Matrix Mult, ~8x faster)
NOTE: 4D luau vectors are only accessible with a custom Luau build, and do not support Native Luau so are overall slower.
=== Benchmark Results (fastest ā slowest) ===
> 4D Optimized: 3.275472 s
> Optimized: 6.301843 s
> Normal: 25.338067 s