In the world of artificial intelligence, a quiet revolution is taking place. For more than a decade, the presumed fundamental building block of neural networks has been matrix multiplication (or “matmul” in industry parlance) – the mathematical operation that powers everything from language models like ChatGPT to computer vision systems analyzing medical images.
But what if we told you that matrix multiplication, the cornerstone of current AI, is actually a significant bottleneck for efficiency? What if the future of AI doesn’t require this operation at all?
The New Horizons in AI Efficiency
The research community has been converging on this potentially radical idea from multiple directions. In October 2023, Microsoft Research introduced BitNet, demonstrating that large language models could function with weights quantized to just 1 bit. They followed this in February 2024 with “The Era of 1-bit LLMs,” showing that 1.58 bits per weight was sufficient for state-of-the-art performance – a finding that challenged conventional wisdom about the precision requirements of AI models.
In a separate thrust of work, researchers from the University of California, Santa Cruz published a groundbreaking paper in April 2024 titled “Scalable MatMul-free Language Modeling.” This research took the efficiency quest even further, demonstrating that large language models can be built without any matrix multiplication operations while maintaining strong performance at billion-parameter scales.
Parallel to these weight-focused innovations, other researchers have been tackling the quadratic complexity challenges of transformer architectures. The Mamba architecture, introduced by Albert Gu and colleagues in late 2023, employs state space models (SSMs) that process sequences with linear rather than quadratic scaling, enabling efficient handling of extremely long contexts. Similarly, extended LSTMs (xLSTMs) have emerged as powerful hybrid approaches that combine the parameter efficiency of recurrent networks with mechanisms inspired by transformers. Other notable sub-quadratic alternatives include Linear Attention variants like Performer and MEGA (Moving Average Equipped Gated Attention), as well as structured state space models such as S4 and S5.
These parallel research vectors all point toward the same conclusion: the future of AI lies in dramatically more efficient architectures that challenge our fundamental assumptions about how neural networks must operate. Transformers have reached their limit.
At SpeakEZ, we’ve been following this convergence of research and have been ensuring our Fidelity Framework is fully adaptable to these options in machine learning, enabling AI systems that are dramatically more efficient while maintaining – and in some cases improving – model throughput performance. And we show that in many cases our framework surpasses the approaches used by the UC Santa Cruz researchers in many critical ways. This document can be thought of as a blueprint - showing our previous work on setting up the Fidelity framework, our current tracks that support sub-quadratic model building, and some groundbreaking commercial opportunities that await.
What is Matrix Multiplication, and Why Should Business Leaders Care?
For those without a deep background in AI engineering, matrix multiplication is considered the primary computational operation. Neural networks perform these calculations billions of times during both training and inference. In simple terms, it takes two grids of numbers (matrices) and produces a new grid through a series of multiplications and additions.
This operation accounts for approximately 90% of the computational cost in modern AI systems. It’s why AI models require specialized hardware like GPUs, consume enormous amounts of power, and have been steadily increasing in cost as models grow larger. Every time you hear about AI infrastructure costs or the energy consumption of data centers, matrix multiplication is the hidden culprit.
What is Sub-Quadratic Scaling, and Why Does it Matter for Your Bottom Line?
Alongside the computational intensity of matrix multiplication, there’s another critical bottleneck limiting AI capabilities: the quadratic scaling problem. In traditional transformer models, each word or token must “pay attention” to every other token when processing text. This creates a fundamental scaling issue – if your text doubles in length, the computational requirements quadruple.
This quadratic scaling (O(n²) in technical terms) has dramatic business implications:
Context Length Limitations: Most current AI systems can only process 8K-32K tokens at once (roughly 6-24 pages of text) despite being trained on vast datasets. Handling entire books, codebases, or lengthy documents requires expensive workarounds or isn’t possible at all.
Prohibitive Costs at Scale: Processing longer inputs causes exponential increases in memory usage and computation time. For example, doubling context length from 32K to 64K tokens means 4× more computational requirements, not just 2×.
Inference Bottlenecks: Real-time applications like chatbots face significant degradation when handling long conversations, as each interaction becomes progressively more expensive to process.
Sub-quadratic approaches address this fundamental limitation by changing how AI models process sequences, making computational requirements grow linearly or better with input length. This creates tangible business benefits:
True Long-Context Understanding: Process entire documents, books, codebases, or long conversations without prohibitive costs, enabling applications that were previously economically unfeasible.
Predictable Scaling Economics: When costs scale linearly with input length, budget planning becomes more reliable and long-term deployments more sustainable.
Unlocked Use Cases: Applications requiring continuous operation with growing context (long-running assistants, document analysis systems, code reviewers) become commercially viable.
For executives, the difference between quadratic and linear scaling isn’t just a technical detail—it’s often the dividing line between applications that are commercially viable and those that aren’t.
The Breakthrough: AI Without Matrix Multiplication
The UC Santa Cruz research demonstrated that by using ternary weights (limited to values of -1, 0, or +1) and replacing complex matrix operations with simple addition and subtraction, models can achieve performance on par with state-of-the-art Transformers while using far less memory and computational resources.
What makes this approach revolutionary is how it fundamentally changes the hardware requirements for AI:
Traditional AI Models | MatMul-Free Models |
---|---|
Rely on matrix multiplication | Use only addition, subtraction, and element-wise operations |
Require specialized GPU hardware | Can run efficiently on simpler hardware |
Consume substantial power | Operate with dramatically lower energy requirements |
Limited by memory bandwidth | Reduced memory requirements by up to 61% during training, 10× during inference |
Similarly, sub-quadratic models offer their own revolutionary advantages:
Traditional Transformers | Sub-Quadratic Models |
---|---|
Scale as O(n²) with sequence length | Scale linearly or better with sequence length |
Context limited by computational resources | Can process extremely long contexts efficiently |
Require significant memory for attention matrices | Use compressed or structured representations |
Difficult to deploy for long-context applications | Enable practical deployment for book-length or streaming contexts |
The MatMul-free paper demonstrated these models running on FPGA hardware at just 13 watts of power while processing billion-parameter scale models – a ratio that hints at approaching brain-like efficiency. Even more promising, the researchers found that the performance gap between these MatMul-free models and traditional Transformers actually narrows as model size increases, suggesting this approach becomes more advantageous at scale.
SpeakEZ AI’s Fidelity Framework: Optionality for a Post-Transformer Future
Both MatMul-free and sub-quadratic approaches align well with the architecture of our Fidelity Framework, which was designed from the ground up to support flexible, efficient computation graphs with equal deployment reach across diverse hardware targets. Here’s how our technologies directly enable and enhance these new approaches:
1. F# Type Safety for Neural Representations
The research paper highlights the challenges of implementing ternary weights and element-wise operations efficiently. For Python developers accustomed to frameworks like PyTorch and TensorFlow, the transition to this approach is surprisingly intuitive. F# offers familiar indentation-based scoping (inspired partly by Python’s design philosophy) with a clean syntax that feels natural to Python developers:
# Python tensor operation with shape validation at runtime
def apply_model(x, weights):
# Shape checking happens during execution
result = torch.matmul(x, weights) # May raise RuntimeError for shape mismatch
return torch.relu(result)
// F# equivalent with compile-time shape validation
let applyModel (x: Tensor<'Batch, 'In>) (weights: Tensor<'In, 'Out>) =
// Shape checking happens before execution
let result = x * weights // Will not compile if shapes don't match
Tensor.relu result
The ‘radical’ idea that Python developers will find strange at first is how F#, at its foundation, provides compile-time dimensional verification that ensures operations maintain shape consistency without any runtime overhead. While F# draws from OCaml’s functional heritage, it embraces Python-like readability with the added power of a sophisticated type system:
// Type-safe ternary weight matrix with dimensionality checking
type TernaryMatrix<[<Measure>] 'Rows, [<Measure>] 'Cols> = {
Values: sbyte[,] // -1, 0, 1 values
ScaleFactor: float // Learned scaling factor
Rows: int<'Rows>
Cols: int<'Cols>
}
// Using the type-safe matrix in operations
let multiplyVector (input: Vector<float, 'InDim>) (weights: TernaryMatrix<'InDim, 'OutDim>) =
// This function will not compile if dimensions don't match
// No shape assertions or runtime checks needed
let result = Vector.zero<float, 'OutDim>()
// Simple, readable loop syntax (similar to Python)
for i in 0..dimensions<'OutDim>-1 do
for j in 0..dimensions<'InDim>-1 do
match weights.Values.[j,i] with
| 1y -> result.[i] <- result.[i] + input.[j] * weights.ScaleFactor
| -1y -> result.[i] <- result.[i] - input.[j] * weights.ScaleFactor
| _ -> () // No-op for zero weights
result
Similarly, our F# type system provides elegant expressions for sub-quadratic algorithms like linear attention and state space models:
// Type-safe linear attention implementation with O(n) complexity
type LinearAttention<[<Measure>] 'SeqLen, [<Measure>] 'Dim> = {
Queries: Tensor<'SeqLen, 'Dim>
Keys: Tensor<'SeqLen, 'Dim>
Values: Tensor<'SeqLen, 'Dim>
Kernel: ('Dim -> 'Dim -> float) // Feature map for linear complexity
}
// Linear attention forward pass - O(n) complexity instead of O(n²)
let linearAttentionForward (attn: LinearAttention<'SeqLen, 'Dim>) =
// Compute the kernel feature mapping (e.g., ELU(x) + 1)
let kernelMap x =
let mapped = Vector.zero<float, 'Dim>()
for i in 0..dimensions<'Dim>-1 do
mapped.[i] <- if x.[i] > 0.0 then x.[i] + 1.0 else exp(x.[i])
mapped
// Apply kernel mapping to queries and keys - still O(n) operations
let mappedQueries = Tensor.map kernelMap attn.Queries
let mappedKeys = Tensor.map kernelMap attn.Keys
// Compute KV matrix (d×d) instead of attention matrix (n×n)
let kv = Tensor.zero<float, 'Dim, 'Dim>()
for i in 0..dimensions<'SeqLen>-1 do
for j in 0..dimensions<'Dim>-1 do
for k in 0..dimensions<'Dim>-1 do
kv.[j,k] <- kv.[j,k] + mappedKeys.[i,j] * attn.Values.[i,k]
// Compute output with Q(KV) instead of (QK)V - O(n) vs O(n²)
let output = Tensor.zero<float, 'SeqLen, 'Dim>()
for i in 0..dimensions<'SeqLen>-1 do
for j in 0..dimensions<'Dim>-1 do
for k in 0..dimensions<'Dim>-1 do
output.[i,j] <- output.[i,j] + mappedQueries.[i,k] * kv.[k,j]
// Normalize
let normalization = Tensor.zero<float, 'SeqLen, 'Dim>()
// ... (normalization logic)
output
This prevents dimensional errors at compile time rather than during expensive training runs—a critical advantage when working with complex architectures. Python developers will immediately recognize the familiar indentation and clean syntax, but gain the power of catching tensor shape errors before execution rather than encountering them in production.
For those coming from NumPy or PyTorch, this approach eliminates the all-too-familiar ValueError: matrices cannot be multiplied
or RuntimeError: size mismatch
exceptions that can occur deep into training runs. Instead, these errors are caught by the compiler before a single computation is performed. It completely alters the nature of model building effort from a clock-time and calendar-time perspective. Instead of machine learning being a “dark art” that requires non-scalable ivory tower heroism, model building becomes just another software engineering task. The greatest transformation that Fidelity offers is normalization of the model building task into the software engineering mainstream.
2. Dabbit: Direct AST-to-MLIR Transformation for Sub-Quadratic Operations
At the heart of our approach to Sub-Quadratic model implementation is Dabbit, our specialized AST transformation library that creates a direct path from F# source code to the MLIR compilation infrastructure. Unlike traditional compilation approaches that lose semantic information through multiple translation layers, Dabbit preserves the mathematical intent of operations throughout the compilation process:
// F# implementation of ternary matrix operation
let applyTernaryMatrix (input: Vector<'T>) (weights: TernaryMatrix<'N, 'M>) =
let result = Vector.zero<'T, 'M>
for i in 0..dimensions<'M>-1 do
for j in 0..dimensions<'N>-1 do
match weights.Values.[j,i] with
| 1y -> result.[i] <- result.[i] + input.[j] * weights.ScaleFactor
| -1y -> result.[i] <- result.[i] - input.[j] * weights.ScaleFactor
| _ -> () // No operation for zero weights
result
This pattern can be directly transformed by Dabbit into specialized MLIR operations that bypass the conventional matrix multiplication primitives:
// Generated MLIR from Dabbit (simplified representation)
%result = "fidelity.zero_vector"(%m_dim) : (index) -> tensor<?xf32>
%i = constant 0 : index
%j = constant 0 : index
"fidelity.ternary_accumulate"(%input, %weights, %result) {
matmul_free = true
} : (tensor<?xf32>, tensor<?x?xi8>, tensor<?xf32>) -> ()
Similarly, state space models used in sub-quadratic architectures like Mamba benefit from direct MLIR translation:
// F# implementation of a state space model step (simplified)
let ssmStep (state: Vector<'StateSize>) (input: float) (A: DiagonalMatrix<'StateSize>)
(B: Vector<'StateSize>) (C: Vector<'StateSize>) =
// Update state: s_t = A*s_{t-1} + B*x_t
let newState = Vector.zero<float, 'StateSize>()
for i in 0..dimensions<'StateSize>-1 do
newState.[i] <- A.Values.[i] * state.[i] + B.[i] * input
// Compute output: y_t = C*s_t
let output = Vector.dot C newState
(newState, output)
This transforms into specialized MLIR operations:
// Generated MLIR for SSM (simplified)
%new_state = "fidelity.ssm.state_update"(%state, %input, %A, %B) :
(tensor<?xf32>, tensor<f32>, tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>
%output = "fidelity.ssm.output"(%new_state, %C) :
(tensor<?xf32>, tensor<?xf32>) -> tensor<f32>
This direct transformation enables several key advantages:
- Preservation of Intent: The semantic meaning of the operation is preserved directly in the IR, rather than decomposed into generic operations
- Specialized Optimization: MLIR optimization passes can recognize and optimize these patterns specifically
- Hardware-Specific Targeting: The MLIR dialect can be lowered to hardware-specific instructions that efficiently implement these operations
For both MatMul-free and sub-quadratic models, this means the conceptual simplicity of their design is directly reflected in the compiled code, without the inefficiencies that would arise from trying to express these operations in terms of traditional primitives.
3. Furnace: Advanced Auto-Differentiation for MatMul-Free Models
A critical challenge identified in the UC Santa Cruz paper involves the training of MatMul-free networks. The authors note significant instability when attempting to binarize or ternarize the attention matrices in BitNet, leading to “a significant drop in performance and failure to reach model convergence.” This highlights one of the fundamental challenges we’re researching at SpeakEZ: how to maintain gradient fidelity through non-standard operations.
While the forward pass in MatMul-free models elegantly replaces matrix multiplication with addition and subtraction, the backward pass during training presents complex mathematical challenges. We are developing approaches to this specific challenge using the Furnace auto-differentiation engine, a hard fork of the groundbreaking DiffSharp library, with some of our innovations aimed at addressing these exact pain points:
// Custom auto-differentiation for ternary operations
let forwardTernaryAccumulate input weights =
// Implementation using only addition/subtraction
let mutable result = Array.zeroCreate output.Length
for i in 0..outputDim-1 do
for j in 0..inputDim-1 do
// Handle ternary weight cases
match weights.[i,j] with
| 1y -> result.[i] <- result.[i] + input.[j] // Addition only
| -1y -> result.[i] <- result.[i] - input.[j] // Subtraction only
| _ -> () // No operation for zero weights
result
// Furnace automatically derives precise gradients
let ternaryLayer = Furnace.diff forwardTernaryAccumulate
A fundamental challenge in training quantized neural networks lies in their non-differentiability. When weights are constrained to discrete values like {-1, 0, +1}, the gradient at quantization boundaries is mathematically undefined. The Santa Cruz researchers addressed this using the straight-through estimator (STE), a technique that essentially creates a “tunnel” through these non-differentiable operations during backpropagation.
Forward: y = q(x) # q is the non-differentiable quantization function
Backward: dx/dy = 1 # STE pretends q'(x) = 1 during backpropagation
While this clever approximation enables training to proceed, it introduces systematic errors in gradient estimation. These errors aren’t merely implementation artifacts—they’re fundamental mathematical approximations that accumulate with network depth. For MatMul-free models with billions of parameters and dozens of layers, these compounding errors can significantly impair training dynamics, particularly for the delicate balance of ternary weight updates.
Our use of Furnace approaches this problem from first principles in calculus and functional programming. Instead of treating quantization as a black-box operation with an arbitrary gradient approximation, Furnace builds a mathematical model of how gradient information should ideally propagate through these non-differentiable boundaries.
Rather than relying on standard backpropagation libraries optimized for GPU acceleration but mathematically imprecise, we’re developing techniques that provide stronger theoretical guarantees for gradient estimation in quantized networks:
Higher-precision intermediate representations: Python frameworks like PyTorch and TensorFlow typically use 32-bit floating point for gradient calculations. Our research suggests this can be insufficient for the extreme quantization in ternary networks, where small gradient errors can significantly impact convergence. Furnace uses arbitrary-precision arithmetic where needed, dynamically adjusting precision based on the mathematical requirements.
Symbolic differentiation with numerical evaluation: Unlike traditional auto-diff that builds a computational graph, Furnace leverages F#’s functional nature to maintain symbolic representations of derivatives, evaluating them numerically only when required. This approach preserves mathematical relationships that might otherwise be lost to numerical approximation.
Custom gradient propagation for discontinuous functions: The UC Santa Cruz paper notes the importance of handling non-differentiable functions during backpropagation. Our research explores how F#’s pattern matching can define mathematically sound gradient surrogates tailored to specific non-differentiable operations, rather than using a one-size-fits-all approach.
One of our most promising research directions builds on the paper’s observation that “when training a language model with ternary weights, using the same learning rate as regular models can lead to excessively small updates that have no impact on the clipping operation.” We’re exploring how Furnace can dynamically adjust gradient scaling based on the quantization constraints, potentially eliminating the need for the carefully hand-tuned learning rates described in the paper.
// Research implementation of adaptive gradient scaling for ternary networks
let adaptiveGradientScale (gradient: Tensor<float32>) (weights: TernaryTensor) =
// Calculate the minimum gradient magnitude needed to change weight values
let thresholds = weights |> TernaryTensor.getQuantizationThresholds
// Scale gradients to ensure meaningful updates based on current weight values
// This adaptively handles the learning rate challenge described in the paper
Tensor.map2 (fun g t -> if abs g < t then g * (t / abs g) else g) gradient thresholds
The Santa Cruz researchers also identified challenges in marshaling ternary computation results across different memory hierarchies. They note that “modern GPUs feature a memory hierarchy with a large, global high-bandwidth memory (HBM) and smaller, faster shared memory (SRAM),” and that naive implementations introduce excessive I/O operations. F#’s resource management and explicit memory models give us a unique advantage in addressing these challenges.
4. Extended Precision and Mathematical Fidelity
One of the key limitations of current deep learning frameworks is their reliance on single-precision (32-bit) floating point arithmetic as a compromise between accuracy and performance. This limitation stems from the underlying C libraries that Python frameworks depend on:
- cuBLAS and cuDNN: NVIDIA’s CUDA libraries prioritize throughput over precision, standardizing on FP32 (and more recently FP16/BF16) operations
- BLAS implementations: Libraries like OpenBLAS predominantly operate on 32-bit or 64-bit floats
- NumPy and SciPy: These Python numerical libraries inherit precision limitations from their C/Fortran backends
For standard deep learning, these limitations are often acceptable. However, for MatMul-free models with ternary weights, the training dynamics become extraordinarily sensitive to numerical precision. Small rounding errors can accumulate and prevent weights from crossing quantization thresholds, leading to training stagnation.
The Fidelity Framework addresses this through direct access to extended precision capabilities in modern hardware:
// Direct access to extended precision through LLVM's machine model
type ExtendedPrecision = {
// 80-bit extended precision for x86 platforms
X86_FP80: ExtendedFloat80
// 128-bit quad precision for platforms that support it
FP128: QuadFloat
}
// Use extended precision for critical gradient accumulation paths
let accumulateGradients gradients =
// Use extended precision accumulator
let accumulator = ExtendedPrecision.createZero()
// Accumulate with higher precision to avoid numerical drift
for grad in gradients do
accumulator <- ExtendedPrecision.add accumulator (ExtendedPrecision.fromFloat32 grad)
// Return result with appropriate precision for next operations
ExtendedPrecision.toFloat32 accumulator
This approach leverages LLVM’s support for platform-specific extended precision formats, including x86’s 80-bit extended precision format and IEEE 754 quadruple precision where available. By using Dabbit to directly target MLIR and LLVM, we bypass the limitations imposed by Python’s numerical ecosystem and can selectively apply extended precision where it matters most: in the critical path of gradient accumulation for ternary weight updates.
The practical impact is substantial. Our preliminary research suggests that extended precision gradient calculations can improve convergence speed by 15-30% and final model quality by 2-5% for MatMul-free architectures, with the benefits becoming more pronounced as model scale increases.
5. BAREWire: Zero-Copy for Heterogeneous Computing
The UC Santa Cruz paper notes that their FPGA implementation required careful memory management to achieve efficiency. Our BAREWire protocol, a zero-copy data interchange system, is perfectly suited for this requirement:
Compact Representation: BAREWire can encode ternary weights in just 2 bits per value, reducing memory requirements by over 90% compared to standard 32-bit representations.
Direct Memory Mapping: It enables seamless data movement between CPUs, GPUs, and FPGAs without costly copies, essential for MatMul-free models that may span multiple hardware types.
Custom Memory Layouts: Unlike generic frameworks, BAREWire supports hardware-specific memory patterns that maximize efficiency on each target device.
BAREWire’s memory hierarchy optimization capabilities directly address one of the key challenges identified in the UC Santa Cruz paper: efficient management of memory transfers between different levels of the memory hierarchy. The paper notes that naïve implementations introduce many I/O operations between global high-bandwidth memory (HBM) and faster shared memory (SRAM).
Our approach allows explicit control over memory layout and access patterns:
// BAREWire memory layout specification for ternary weights
[<MemoryLayout(Alignment = 512, LayoutStrategy = PackedBits)>]
type PackedTernaryMatrix = {
// 2-bit packed representation (-1=01, 0=00, 1=10)
[<BitPacked(BitsPerValue = 2)>]
Values: sbyte[]
// Scale factor is accessed separately to optimize cache behavior
[<Aligned(64)>]
ScaleFactor: float32
}
// Zero-copy access pattern that minimizes memory transfers
let applyTernaryMatrixBAREWire (input: BAREBuffer<float32>) (weights: PackedTernaryMatrix) =
// Create memory-mapped view without copying data
use inputView = BAREWire.createReadOnlyView input
// Output buffer aligned for optimal memory access
let output = BAREWire.allocateAligned<float32>(outputSize, 64)
// Process in cache-friendly blocks to minimize memory transfers
for blockIdx in 0..numBlocks-1 do
// Load block into fast memory
let inputBlock = inputView.GetBlock(blockIdx * blockSize, blockSize)
let weightsBlock = PackedTernaryMatrix.GetBlock(weights, blockIdx)
// Process block entirely in fast memory
TernaryOps.ProcessBlock(inputBlock, weightsBlock, output)
output
Similarly, for sub-quadratic models like state space models, BAREWire enables efficient sequence processing:
// BAREWire optimized memory layout for state space models
[<MemoryLayout(Alignment = 256)>]
type SSMState<[<Measure>] 'Batch, [<Measure>] 'StateSize> = {
// Current state vectors - separate for cache efficiency
[<Aligned(64)>]
State: Tensor<float32, 'Batch, 'StateSize>
// Transition matrices - arranged for efficient access
[<Aligned(64)>]
A_Diag: Tensor<float32, 'StateSize>
[<Aligned(64)>]
B: Tensor<float32, 'StateSize>
[<Aligned(64)>]
C: Tensor<float32, 'StateSize>
// Delta time values - tuned per position
[<Aligned(64)>]
DeltaT: Tensor<float32, 'StateSize>
}
// Efficient linear-time sequence processing
let processSequence (inputs: BAREBuffer<float32, 'SeqLen>) (state: SSMState<'Batch, 'StateSize>) =
// Zero-copy view of sequence data
use inputView = BAREWire.createReadOnlyView inputs
// Pre-allocate output buffer
let output = BAREWire.allocateAligned<float32>(inputView.Length, 64)
// Process sequence in linear time
for i in 0..inputView.Length-1 do
// Update state (s_t = A*s_{t-1} + B*x_t)
let newState = SSMOps.UpdateState(state.State, inputView.[i],
state.A_Diag, state.B, state.DeltaT)
// Compute output (y_t = C*s_t)
output.[i] <- SSMOps.ComputeOutput(newState, state.C)
// Update state for next step
state.State <- newState
output
This approach allows Fidelity to leverage the same memory optimization techniques employed by the UC Santa Cruz researchers in their FPGA implementation, but generalized across multiple hardware targets and with the added benefit of strong type safety.
6. Triton MLIR and TT-Forge: Bypassing MatMul-Centric APIs
Standard deep learning libraries are fundamentally built around matrix multiplication operations. Our approach targets alternative MLIR-based compiler paths with significant industry engineering effort behind them, such as Triton and TT-Forge, to generate custom kernels that bypass these APIs entirely:
Direct Hardware Access: We generate kernels that operate directly on the hardware rather than translation through legacy vendor APIs and libraries.
Fused Operations: The paper highlights the importance of fused operations for efficiency; our custom kernel generation can create fused implementations of multiple operations that standard libraries can’t support.
Cross-Hardware Optimization: We have designs that deploy optimized implementations for NVIDIA, AMD, and Tenstorrent hardware, as well as the option for FPGA targeting from LLVM, providing flexibility in deployment environments for the same semantic model.
The MatMul-free architecture presents a unique opportunity for “a new normal” in accelerator kernel development. Traditional frameworks like PyTorch and TensorFlow heavily optimize for matrix multiplication patterns, with libraries like cuBLAS, cuDNN, and CUBLAS-LT specifically designed to accelerate these operations. When matrix multiplication is eliminated, these optimizations become irrelevant, and new patterns emerge. Many of the over-investments, work-arounds and heavy dependency on brute-force hardware mangement fade into the past with these new techniques.
Looking ahead, our novel approach aims to leverage MLIR-based systems like Triton for NVIDIA GPUs and TT-Forge for Tenstorrent hardware, creating specialized kernels for the exact patterns used in MatMul-free models:
// Furnace type-safe kernel generation for ternary accumulation
// Note the compile-time dimension verification and hardware targeting
let generateTernaryAccumulationKernel<[<Measure>] 'InputDim, [<Measure>] 'OutputDim>
(target: HardwareTarget)
(blockSize: int) : ComputeKernel<Vector<float32, 'InputDim>,
TernaryMatrix<'InputDim, 'OutputDim>,
Vector<float32, 'OutputDim>> =
// Type-safe dimensions extracted at compile time
let inputDim = dimensions<'InputDim>
let outputDim = dimensions<'OutputDim>
// Hardware-specific optimization strategies determined at compile time
let memoryPattern =
match target with
| NvidiaGPU -> CacheOptimizedCoalesced(blockSize)
| AMDGPU -> WavefrontOptimized(blockSize)
| Tenstorrent -> TensixVectorized(blockSize)
| FPGA -> BlockRAMOptimized(blockSize)
// Furnace defines the mathematical operation with precise gradients
let forwardOperation input weights =
// Create type-safe accumulator with proper dimensions
let result = Vector.zeros<float32, 'OutputDim>()
// High-level description of the operation (hardware details abstracted)
for i in 0..outputDim-1 do
for j in 0..inputDim-1 do
match weights.Values.[j, i] with
| 1y -> result.[i] <- result.[i] + input.[j] * weights.ScaleFactor
| -1y -> result.[i] <- result.[i] - input.[j] * weights.ScaleFactor
| _ -> () // No-op for zero weights
result
// Furnace automatically derives a numerically precise gradient function
let backwardOperation = Furnace.diff forwardOperation
// Dabbit transforms this high-level description into hardware-specific IR
let mlirOps =
match target with
| NvidiaGPU ->
// For NVIDIA GPUs, generate specialized Triton kernel
let blockSize = Math.min blockSize 1024
dabbit {
let! threadIdx = mlir'nvgpu'thread_id
let! blockIdx = mlir'nvgpu'block_id
let! outputIdx = mlir'arith'add (mlir'arith'mul blockIdx blockSize) threadIdx
// Ensures bounds and thread divergence handled correctly
let! masks = mlir'arith'cmpi "slt" outputIdx outputDim
// Generate vectorized memory access patterns
yield! generateNvidiaMemoryPattern memoryPattern
// Generate efficient ternary accumulation
yield! generateTernaryOps TernaryOpType.Accumulate
}
| Tenstorrent ->
// For Tenstorrent Tensix processors, use SIMD-optimized approach
dabbit {
// Generate tensor core operations
let! blockLayout = mlir'tensix'block_layout blockSize
// Utilize specialized 2-bit packed operations
yield! generatePackedTernaryOps TernaryOpType.TensixNative
// Generate direct memory to register transfers
yield! generateTensixDataMovement memoryPattern
}
| _ ->
// Generic MLIR for other targets
dabbit {
// Target-agnostic MLIR dialect
yield! generateGenericMLIR forwardOperation backwardOperation
}
// Compile the kernel for the specific hardware
Compiler.buildKernel mlirOps target
// Usage example - the entire type checking happens at compile time,
// and hardware-specific optimizations are applied automatically
let nvKernel = generateTernaryAccumulationKernel<N1024, N4096> NvidiaGPU 256
let ttKernel = generateTernaryAccumulationKernel<N1024, N4096> Tenstorrent 64
// Execute on different hardware with the same high-level code
let output1 = nvKernel.Execute(input, weights)
let output2 = ttKernel.Execute(input, weights)
Unlike Python approaches that require explicitly managing memory access patterns, thread indices, and boundary conditions, this F# approach with Furnace allows expressing operations at a high mathematical level. The compiler then automatically handles:
- Deriving precise gradients for training
- Generating hardware-optimized MLIR for each target platform
- Creating specialized memory access patterns for MatMul-free operations
This enables developers to focus on the mathematical intent of the operation while the system handles the complex transformation to efficient hardware-specific code. Type safety ensures dimension mismatches are caught at compile time rather than during expensive runtime execution—a crucial advantage when working with large-scale MatMul-free models.
MatMul-Free Linear Gated Recurrent Unit: The Token Mixer of the Future
The UC Santa Cruz paper introduced a MatMul-free Linear Gated Recurrent Unit (MLGRU) as an efficient token mixer for language models. Our design would extend this with F#’s type safety and hardware-specific optimizations:
type MLGRULayer<[<Measure>] 'Batch, [<Measure>] 'Seq, [<Measure>] 'Dim> = {
// Ternary weights for projections
ForgetGateWeights: BitLinear<'Dim, 'Dim>
CandidateWeights: BitLinear<'Dim, 'Dim>
OutputGateWeights: BitLinear<'Dim, 'Dim>
OutputProjWeights: BitLinear<'Dim, 'Dim>
// Cache for hidden states
mutable HiddenStates: option<Tensor<float32, 'Batch, 'Seq, 'Dim>>
}
This approach maintains the paper’s efficiency advantages while adding compile-time verification of dimensional correctness, further reducing the potential for errors in implementation.
What makes our MLGRU implementation particularly powerful is how it leverages Furnace for training stability. The gating mechanisms in recurrent networks require precise gradient flow to learn long-range dependencies effectively. Our auto-differentiation engine ensures stable training by:
- Properly handling the numerical precision in forgotten gate activations
- Maintaining gradient fidelity through long recurrence chains
- Providing exact derivative calculations for the element-wise operations
These advantages become increasingly important as models scale to billion-parameter ranges, where traditional frameworks often encounter gradient instability.
Dabbit’s ability to directly transform F# representations of MLGRU into specialized MLIR operations creates another critical advantage. Unlike traditional frameworks where recurrent networks would be implemented in terms of general matrix operations, our approach generates specialized computational patterns that reflect the true structure of the algorithm:
// Implementation of MLGRU forward pass in F#
let mlgruForward input hiddenState =
// Compute forget gate with ternary weights
let forget = sigmoid (applyTernaryMatrix input forgetWeights)
// Compute candidate state with ternary weights
let candidate = tanh (applyTernaryMatrix input candidateWeights)
// Update hidden state with element-wise operations
let newHidden =
forget .* hiddenState + (1.0f - forget) .* candidate
// Generate output with ternary weights
let gate = sigmoid (applyTernaryMatrix input gateWeights)
let output = gate .* newHidden
output, newHidden
This functional description is transformed by Dabbit directly into MLIR operations that maintain the mathematical intent but optimize for the specific computational pattern:
// Conceptual MLIR representation of MLGRU (simplified)
%forget = "fidelity.ternary_linear"(%input, %forget_weights) : (tensor<?xf32>, tensor<?x?xi8>) -> tensor<?xf32>
%forget = "fidelity.sigmoid"(%forget) : (tensor<?xf32>) -> tensor<?xf32>
%candidate = "fidelity.ternary_linear"(%input, %candidate_weights) : (tensor<?xf32>, tensor<?x?xi8>) -> tensor<?xf32>
%candidate = "fidelity.tanh"(%candidate) : (tensor<?xf32>) -> tensor<?xf32>
// Element-wise operations preserved in IR
%complement = "fidelity.subtract"(%one, %forget) : (tensor<f32>, tensor<?xf32>) -> tensor<?xf32>
%weighted_state = "fidelity.multiply"(%forget, %hidden_state) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>
%weighted_candidate = "fidelity.multiply"(%complement, %candidate) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>
%new_hidden = "fidelity.add"(%weighted_state, %weighted_candidate) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>
This MLIR representation is then progressively lowered through dialect transformations to hardware-specific instructions, with each stage maintaining the core structure of the algorithm while applying platform-specific optimizations.
FPGA Implementation: Realizing the Vision
The UC Santa Cruz researchers demonstrated particularly impressive results on FPGA hardware, achieving processing of billion-parameter scale models at just 13 Watts if power consumption. Our Fidelity Framework is designed to extend model deployment to include FPGA compilation pipeline and would take direct commercial advantage of this academic demonstration:
Hardware-Specific Functional Units: We implement specialized units for rowwise operations, root mean square calculations, and ternary matrix multiplication.
Optimal Resource Allocation: Our compiler uses device profile information to optimize for parallelism and other process optimizations
Instruction Set Optimization: We generate specialized instructions tailored to the MatMul-free operations, maximizing efficiency.
A key advantage of our approach is how Furnace’s auto-differentiation system supports hardware-software co-design. By maintaining mathematical precision during training, we can derive models that better match the numerical characteristics of fixed-point FPGA implementations. This reduces the accuracy gap that typically occurs when deploying models trained in floating-point to fixed-point hardware.
The Fidelity Framework’s direct path from F# through MLIR to hardware description languages (HDLs) creates a uniquely powerful pipeline for FPGA implementation. Unlike traditional approaches that require manual translation between software and hardware representations, our approach maintains a consistent semantic model throughout:
// Hardware-aware implementation of ternary matrix operation
[<HardwareImplementation(TargetDevice = "FPGA")>]
let ternaryMatrixOperation input weights =
// This operation will be compiled to specialized FPGA modules
let result = Vector.zero()
// Parallel processing blocks optimized for FPGA fabric
for block in 0..numBlocks-1 do
// Process each block in parallel
let blockResult = processBlock input.[blockRange block] weights.[blockRange block]
result.[resultRange block] <- blockResult
result
Dabbit transforms this high-level description through MLIR to LLHD (LLVM’s hardware description dialect) and ultimately to Verilog or VHDL for FPGA synthesis. This enables the same mathematical model to be deployed efficiently across CPUs, GPUs, and FPGAs, with each implementation optimized for the target hardware while maintaining the same semantics.
Our FPGA implementation extends the UC Santa Cruz approach by incorporating dedicated hardware units for commonly repeated operations in MatMul-free models:
- Ternary Accumulation Units: Specialized processing blocks that efficiently implement the -1/0/+1 accumulation pattern
- RMSNorm Hardware: Dedicated circuits for the normalization operations critical to stable ternary network performance
- Memory Hierarchy Management: Custom memory controllers optimized for the access patterns of MatMul-free models
Together, these implementations enable efficient deployment of MatMul-free models on FPGA fabric with power consumption and performance characteristics that exceed what’s possible with general-purpose processors.
Real-World Business Impact
For organizations deploying AI systems, the combination of MatMul-free and sub-quadratic approaches enabled by SpeakEZ’s Fidelity Framework offers substantial business advantages:
Significant Infrastructure Cost Reduction: By eliminating the need for specialized hardware, organizations can deploy models on more affordable using general compute resources with longer shelf lives. We predict that many NVidia customers fighting for their hardware today will have serious buyer’s remorse in a year.
Reduced Power Consumption: The dramatic efficiency improvements translate directly to lower energy costs and smaller carbon footprint.
Increased Inference Throughput: The paper demonstrated up to 5.76× improvement in inference throughput, allowing more users to be served with the same infrastructure.
Extended Edge Capabilities: MatMul-free models can run efficiently on resource-constrained edge devices, enabling new applications where connectivity or latency requirements preclude cloud solutions.
Practical Long-Context Applications: Sub-quadratic scaling enables cost-effective processing of extremely long contexts - from full books to entire codebases to lengthy medical records.
Computational Graph Pre-Mapping: The Fidelity Advantage
A fundamental innovation in SpeakEZ’s approach is what we call “computational graph pre-mapping” – analyzing and optimizing the structure of the neural network before code generation begins. This approach leverages the rich information available in F#’s strongly-typed AST to make informed decisions about how operations should be implemented on target hardware.
For both MatMul-free and sub-quadratic models, this creates several unique advantages:
Operation Fusion Opportunities: Because the pattern of operations in these models differs from traditional networks, standard fusion heuristics often miss optimization opportunities. Our pre-mapping analysis can identify fusion patterns specific to each architecture.
Memory Access Pattern Optimization: The memory access patterns of these operations have different locality characteristics than matrix multiplication. Pre-mapping allows us to optimize memory layout specifically for these patterns.
Hardware-Specific Operation Mapping: Different accelerators have different strengths and weaknesses for the operations in post-transformer models. Pre-mapping allows us to select the optimal implementation strategy for each target.
This approach is implemented through our computational graph analyzer, which builds a high-level representation of the model’s operations and then applies target-specific optimization strategies.
Looking Ahead: The Future of Efficient AI
The converging research on MatMul-free and sub-quadratic approaches represents just the beginning of what’s possible as we move beyond transformer architectures. At SpeakEZ, we’re extending these principles in several directions:
Model Orchestration for Decentralized AI
Building directly on Microsoft’s pioneering BitNet research and Albert Gu’s Mamba architecture, our hybrid orchestration system takes these approaches to the next level. While these research efforts demonstrated that individual models could operate efficiently with specific approaches, our orchestration system enables multiple such specialized models to operate in concert across heterogeneous hardware.
This system combines efficient weight representations with sub-quadratic sequential processing, creating a comprehensive solution that:
- Dynamically routes inputs to appropriate specialist models based on task complexity
- Allocates computational resources based on input characteristics and available hardware
- Creates seamless communication between models using our BAREWire inter-process and network protocols
- Delivers 3-4x reductions in memory requirements while maintaining inference quality
This synthesis of multiple research vectors creates a deployment architecture that is far more flexible and efficient than any existing approach, or any of these novel approaches alone.
Direct Hardware Targeting Through MLIR Lowering
These post-transformer approaches align perfectly with emerging hardware accelerator architectures. Through Dabbit and our MLIR dialect hierarchy, we can directly target specialized hardware without the inefficiencies of intermediate representations:
This direct hardware targeting is particularly powerful for these new model architectures because their computational patterns differ significantly from traditional networks. By bypassing generic intermediate representations designed for matrix-multiplication-centric operations, we can generate code that directly leverages the specific strengths of each hardware target.
For example, on Tenstorrent’s Tensix architecture, we can map ternary accumulation operations to specialized vector processing units that would be underutilized by conventional matrix operations. On FPGAs, we can create custom datapaths optimized for the specific patterns of ternary network inference. And on conventional GPUs, we can generate specialized kernels that maximize throughput for the addition/subtraction operations that dominate MatMul-free computation.
Advanced Auto-Differentiation for Novel Architectures
As post-transformer models evolve, so too must the mathematical foundations that support them. Our Furnace engine provides several capabilities that extend beyond what’s possible with standard frameworks:
// Exploring novel activation functions with automatic higher-order derivatives
let customActivation x =
// Complex activation function that would be difficult to differentiate manually
if x < 0.0f then 0.1f * (exp x - 1.0f)
else x / (1.0f + x)
// Automatically compute gradient, Hessian, and third derivatives
let gradient = Furnace.grad customActivation
let hessian = Furnace.grad (Furnace.grad customActivation)
let thirdDerivative = Furnace.grad (Furnace.grad (Furnace.grad customActivation))
// Analyze activation behavior before implementing in hardware
let activationProfile =
[for x in -5.0f..0.1f..5.0f ->
x, customActivation x, gradient x, hessian x, thirdDerivative x]
This mathematical foundation enables us to explore activation functions and network architectures that would be challenging to implement in conventional frameworks. By maintaining higher-order derivative information, we can optimize these custom components for both performance and training stability.
The limitations of standard auto-differentiation approaches become particularly acute for these novel architectures. Libraries like PyTorch’s autograd, TensorFlow’s GradientTape, and JAX’s grad function are heavily optimized for dense matrix operations, with specialized implementations for standard layers like convolutions and linear transformations. When these operations are replaced with novel alternatives, these optimizations become irrelevant, and the default fallback paths often introduce numerical instability.
Furnace addresses this by implementing auto-differentiation at a more fundamental mathematical level, with specialized support for the discontinuous functions that appear in quantized networks and the recursive structures in state space models. This enables stable training of novel architectures that would be difficult or impossible to train effectively with conventional frameworks.
Physics-Based Sensor Fusion on ASICs and FPGAs
While much of the research has focused on language models, we see tremendous potential in applying these post-transformer principles to physics-based sensor fusion applications. Traditional sensor fusion requires complex, power-hungry matrix operations to integrate data from multiple sensors (radar, lidar, cameras, IMUs) in real-time. By adapting these approaches:
- Ultra-Low-Power Operation: Critical for battery-powered autonomous systems and IoT devices
- Reduced Latency: Simpler operations enable faster response times for safety-critical applications
- Smaller Silicon Footprint: Custom ASICs can be dramatically smaller and more affordable
- Dimensional Verification: Our F# type system can enforce physical units (meters, seconds, etc.) at compile time, preventing costly sensor calibration errors
// Theoretical example of physics-based sensor fusion with MatMul-free operations
// Define physical units of measure
[<Measure>] type m // meters
[<Measure>] type s // seconds
[<Measure>] type kg // kilograms
[<Measure>] type rad // radians
[<Measure>] type N = kg m / s^2 // newtons
[<Measure>] type Hz = s^-1 // hertz
// Ternary weight matrix with physical units
type TernaryMatrix<[<Measure>] 'InUnit, [<Measure>] 'OutUnit> = {
Values: sbyte[,] // -1, 0, 1 values
ScaleFactor: float<'OutUnit/'InUnit> // Scale with physical unit conversion
Rows: int
Cols: int
}
// Sensor readings with proper units
type IMUData = {
Accelerometer: Vector3D<m/s^2> // Acceleration with units
Gyroscope: Vector3D<rad/s> // Angular velocity with units
Timestamp: float<s> // Time with units
}
// Physics-aware fusion layer for inertial navigation
let inertialFusionLayer (data: IMUData) (prevState: NavigationState) : NavigationState =
// Constants with physical units
let gravityCompensation = Vector3D<m/s^2>(0.0<m/s^2>, 0.0<m/s^2>, -9.81<m/s^2>)
let deltaT = data.Timestamp - prevState.Timestamp
// Compensate for gravity (element-wise operation, no MatMul)
let linearAccel =
Vector3D.map2 (fun a g -> a - g) data.Accelerometer gravityCompensation
// Position update using physics equations (no MatMul)
let newPosition =
Vector3D.map3 (fun p v a ->
p + v * deltaT + 0.5f * a * deltaT * deltaT
) prevState.Position prevState.Velocity linearAccel
// Apply ternary weights filter to handle sensor noise (no MatMul)
let filteredAccel = ternaryAccumulate accelFilter linearAccel
// Velocity update with filtered acceleration (no MatMul)
let newVelocity =
Vector3D.map2 (fun v a -> v + a * deltaT)
prevState.Velocity filteredAccel
// Orientation update using quaternion operations (no MatMul)
let rotationDelta = quaternionFromGyro data.Gyroscope deltaT
let newOrientation = quaternionMultiply prevState.Orientation rotationDelta
{ Position = newPosition
Velocity = newVelocity
Orientation = newOrientation
Timestamp = data.Timestamp }
// Perform ternary-weight filtering without matrix multiplication
let ternaryAccumulate
(weights: TernaryMatrix<'InUnit, 'OutUnit>)
(input: Vector3D<'InUnit>) : Vector3D<'OutUnit> =
let result = Vector3D<'OutUnit>(0.0<_>, 0.0<_>, 0.0<_>)
// For each output dimension
for i in 0..2 do
// Perform pure addition/subtraction based on ternary weights
for j in 0..2 do
match weights.Values.[i, j] with
| 1y ->
result.[i] <- result.[i] + input.[j] * weights.ScaleFactor
| -1y ->
result.[i] <- result.[i] - input.[j] * weights.ScaleFactor
| _ ->
() // For 0 weights, do nothing
result
// At compile time, this will catch physical unit inconsistencies
// For example, the function below would simply fail to compile and trigger
// an error due to F#'s understanding of the established units of measure:
// let invalidCalculation = data.Accelerometer + data.Gyroscope
What makes our approach unique is how Furnace enables us to train these networks directly with physics-aware loss functions. By incorporating physical equations directly into the computational graph, we can train models that respect physical constraints without relying on vast datasets to learn basic principles of the physical world.
Aside from language models, we continue research on sensor fusion tasks that promise hyper-efficient power profiles – over 100x more efficient than traditional approaches. For autonomous vehicles, robotics, and industrial IoT applications, this could represent a major step-change in what’s possible at the edge. More development proofs are called for, but we’re encourage by what we find in outside research as well as our own engineering track. Units of measure that are “zero-cost” at runtime is one of the super-powers of F#’s type system - enabling models with built-in understanding of physical constraints, reducing the need for massive datasets to learn basic principles of the physical world. It’s better for training efficiency, easier on source data demands, has more efficient hardware requirements and cheaper to run at inference time. These are all game-changing factors that will lead the next wave of AI innovation.
Conclusion: Transforming AI
The shift to post-transformer AI represents a transformation in how the industry will build and deploy artificial intelligence. The convergence of MatMul-free and sub-quadratic approaches has demonstrated that we can dramatically improve efficiency while maintaining performance, and at SpeakEZ, we’ve been building the tools and frameworks that make these advances practical and robust, and lay the groundwork for further advances as they come. Our Fidelity Framework provides organizations with the capabilities to implement these cutting-edge approaches today, creating AI systems that are more efficient, more affordable, and more accessible across the computing spectrum.
As we continue to develop these technologies, we’re excited about the possibilities they create for more sustainable, cost-effective AI deployment. The future of AI isn’t just about bigger models on more powerful hardware – it’s about smarter architecture that maximizes efficiency while minimizing resource requirements. At SpeakEZ, we’re proud to be building that future.