Matrix Multiplication

Matrix Multiplication
Matrix Multiplication

The following Julia programme implements the so-called FastShift algorithm for fast matrix multiplication of integer matrices using bit operations and recursive subdivision. Comments in the source code explain important parts of the implementation.

# 🔠 FastShift Algorithm in Julia, Authors: Boris Haase & Math AI 🧠 from April 17, 2025
using LinearAlgebra, BenchmarkTools, Printf, Random, Base.Threads
const D = 256                                        # Ceiling value, less than or equal to G
const E = 8                                          # Exponent roughly greater than five
const R = 64                                         # Threshold for switching to multithreading
const G = 1  << E                                    # Desired upper bit limit
const I = BigInt(1)                                  # One  as a BigInt
const L = (I << D) - I                               # Upper/lower limit for random values
const T = AbstractMatrix{BigInt}                     # Matrix type for large integers
const S = G  << 1                                    # Shift factor as a power of two
const O = BigInt(0)                                  # Zero as a BigInt

# 🔧 Helper functions for modular division and matrix multiplication
remR(x, t, u, v) =  x < O ? -adjR(-x & u, t, v) : adjR(x & u, t, v)
adjR(x, t, v)    =  x < v ? x : x - t
divR(x, v, s)    = (x + v) >> s

function simd_mul(A::T, B::T, n::Int)::T             # Conventional SIMD matrix multiplication
    C = zeros(BigInt, n, n)
    @inbounds for i in 1:n                           # No bounds checking for indices
        for j in 1:n
            z = O
            @simd ivdep for k in 1:n                 # No interdependencies in this loop
                z += A[i, k] * B[k, j]
            end
            C[i, j] = z
        end
    end
    return C                                         # Return the matrix product
end

# 🔡 FastShift recursive multiplication for matrices of large integers
function fastshift(A::T, B::T, n::Int, s::Int)::T
    n <= R && return simd_mul(A, B, n)               # Use standard multiplication for small n

    m = n >> 1                                       # Split matrices into four blocks
    M =    1:m                                       # Left  index of the matrix
    N =  m+1:n                                       # Right index of the matrix
    v = I << (s - 1)                                 # Reference value for nearest neighbour
    t = I << s                                       # Modulus for division with remainder
    r = s << 1                                       # Next recursion depth
    u = t -  I                                       # Bitmask to extract the lower half

    R1 = (B[M, M] .<< s) .+ B[M, N]                  # Top    row of combined matrix
    R2 = (B[N, M] .<< s) .+ B[N, N]                  # Bottom row of combined matrix

    if n > R                                         # Use threads for large matrices
        t1 =  @spawn fastshift(A[M, M], R1, m, r)
        t2 =  @spawn fastshift(A[M, N], R2, m, r)
        t3 =  @spawn fastshift(A[N, M], R1, m, r)
        S4 =         fastshift(A[N, N], R2, m, r)    # Matrix on main thread
        S1 =  fetch(t1)                              # Retrieve results from other threads
        S2 =  fetch(t2)
        S3 =  fetch(t3)
    else
        S1 =         fastshift(A[M, M], R1, m, r)    # No threading for small matrices
        S2 =         fastshift(A[M, N], R2, m, r)
        S3 =         fastshift(A[N, M], R1, m, r)
        S4 =         fastshift(A[N, N], R2, m, r)
    end

    C1 = divR.(S1, v, s   ) .+ divR.(S2, v, s   )    # Modular division as defined
    C2 = remR.(S1, t, u, v) .+ remR.(S2, t, u, v)
    C3 = divR.(S3, v, s   ) .+ divR.(S4, v, s   )
    C4 = remR.(S3, t, u, v) .+ remR.(S4, t, u, v)

    return [C1 C2; C3 C4]                            # Return the full matrix product
end

# 🕔 Benchmark for C = A * B with correctness check against Julia's standard implementation
function main()
    @assert D <= G "Ceiling value must not exceed bit limit G"
    println("⏱️ Benchmark for the FastShift algorithm – m from 1 to 10")
    println("═════════════════════════════════════════════════════════")
    println("🕸 Threads: $(nthreads())")             # Expected to run with four threads

    for m in 1:10
        n  = 1 << m                                  # Dimension as a power of two
        A  = rand(-L:L, n, n)                        # Left  random matrix with large integers
        B  = rand(-L:L, n, n)                        # Right random matrix with large integers

        println("\n❓ Test: 2^$m × 2^$m → $n × $n")  # Display test info
        t_fasts = @elapsed C_fasts = fastshift(A, B, n, S + m)
        @printf("🔢 Runtime for FastShift: %10.4f seconds\n", t_fasts)

        t_julia = @elapsed C_julia = A * B           # Measure Julia's standard multiplication
        @printf("❕ Runtime for Julia A*B: %10.4f seconds\n", t_julia)
        println("💯 Equality of results:")           # Check if results match
        println(C_fasts == C_julia ? "✅ FastShift ≙ Julia" : "❌ FastShift ≠ Julia")
    end
end

# ▶️ Start execution
main()

Runtime comparison for \(n = 2^m\) and \(64\) bit resp. \(256\) bit: \(T_j\) (Julia), \(T_f\) (FastShift), and the ratio \(T_j/T_f\) in seconds
\(m\) \(T_j^{64}\) \(T_f^{64}\) \(T_j^{64}\)/\(T_f^{64}\) \(T_j^{256}\) \(T_f^{256}\) \(T_j^{256}/T_f^{256}\)
50.00690.00661.04550.00680.00651.0462
60.05730.05571.02870.09880.09511.0389
70.75570.08059.38820.88660.10268.6414
86.85931.04166.58658.49641.36996.2029
955.28784.561012.122866.099413.05185.0654
10496.958739.925212.4443592.410076.82057.7094

Modeling the Runtime Behaviour

The runtime can be further reduced by using suitable pointer arithmetic. The nonlinear function \(T_n \sim n^a\) modeled with Mathematica and compared with both \(T_n \sim n^{\hat{a}}\) and \(T_n \sim n^{\hat{a}} + b\) yields the following results for 64 bit: \(T_j^{64} = \mathcal{O}(n^{2.32})\) and \(T_f^{64} = \mathcal{O}(n^{1.96})\). Analogously, 256 bit yields: \(T_j^{256} = \mathcal{O}(n^{2.35})\) and \(T_f^{256} = \mathcal{O}(n^{2.05})\).

© 2025 by Boris Haase

top