8x faster Matrix Multiplication Algorithm using SIMD

bitsplicer · March 22, 2026, 1:02am

Hi everyone,

Was working on a Neural Network library and the classic matrix multiplication is simply too slow.

What I have worked on is a algorithm that can multiply matrices of F32s 5-10x faster than normal Matrix Multiplication by utilizing Native Luau SIMD (via vectors).

Code:

Code for Normal:

local Cb = buffer.create(N * M * FLOAT_SIZE)

for i = 0, N - 1 do
    for j = 0, M - 1 do
        local sum = 0

        for k = 0, K - 1 do
            local a = buffer.readf32(Ab, (i * K + k) * FLOAT_SIZE)
            local b = buffer.readf32(Bb, (k * M + j) * FLOAT_SIZE)
            sum += a * b
        end

        buffer.writef32(Cb, (i * M + j) * FLOAT_SIZE, sum)
    end
end

return Cb

Code for Optimized:

(This will be opensourced soon alongside the rest of my library)

Results (1024x1024 Matrix Mult, ~6.5x faster):

=== Benchmark Results (fastest → slowest) ===
>       Optimized: 0.717437 s
>          Normal: 4.581731 s

Results 4D Vectors (1024x1024 Matrix Mult, ~8x faster)

NOTE: 4D luau vectors are only accessible with a custom Luau build, and do not support Native Luau so are overall slower.

=== Benchmark Results (fastest → slowest) ===
>    4D Optimized: 3.275472 s
>       Optimized: 6.301843 s
>          Normal: 25.338067 s

hello2e6 · July 2, 2026, 9:00pm

Is your library open sourced yet? Im kind of desperate for this solution!

hello2e6 · July 2, 2026, 9:01pm

My forward pass for my 55M param neural network is taking 40+ seconds. I think I’m also going to try vectors.

hello2e6 · July 2, 2026, 10:42pm

Also, buffers are way slower to read than tables.