A Unified Vision for Ternary Models

While this idea might be met with controversy in the current swarm of AI hype, we believe that the advent of sub-quadratic AI models, heterogeneous computing, and unified memory architectures will show themselves as pivotal components to next generation AI system design. The elements are certainly taking shape. As we stand at this technological crossroads, AMD’s evolving unified CPU/GPU architecture, exemplified by the MI300A and its planned successors (MI325, MI350, MI400), combined with their strategic acquisition of Xilinx, offers a compelling case study for re-imagining how AI models can operate.

This exploration examines how the Fidelity framework, with its BAREWire zero-copy technology and F#’s type-safe bit manipulation, is uniquely positioned to leverage AMD’s unified architecture to create a new paradigm for distributed AI inference.

The Ternary Revolution: When Addition Beats Multiplication

Traditional neural networks rely heavily on matrix multiplication, an operation where GPUs excel with their massive parallelism. However, ternary quantization, reducing weights to {-1, 0, +1}, fundamentally changes this equation. By replacing multiplication with simple addition and subtraction, we shift the computational balance dramatically in favor of CPUs and FPGAs.

Balanced Ternary: The Critical Design Choice

The selection of balanced ternary {-1, 0, +1} over unbalanced ternary {0, 1, 2} is fundamental to achieving the computational efficiencies claimed in this architecture. Balanced ternary, praised by Donald Knuth as “the prettiest number system of all,” provides several critical advantages:

With balanced ternary, multiplication operations transform into trivial operations: multiplication by -1 becomes simple negation (sign flip), multiplication by 0 results in zero (allowing complete skip of computation), and multiplication by +1 is the identity operation (direct pass-through). This transformation eliminates multiplication entirely; unbalanced ternary would still require actual multiplication by 2, negating much of the efficiency gain.

The symmetric nature of balanced ternary around zero provides additional benefits. Negative numbers are integrated directly into the number system without special encoding, subtraction becomes sign inversion, and sparse operations gain natural efficiency since zero directly represents “no contribution” to the computation. This symmetry is particularly valuable for FPGA implementations, where balanced ternary operations map directly to simple multiplexers and inverters, while unbalanced ternary would require more complex arithmetic logic units.

%%{init: {'theme': 'neutral'}}%% graph TD A[Traditional FP16 Model] -->|Multiply-Accumulate| B[GPU: 300x faster than CPU] C[Ternary Model] -->|Add-Subtract Only| D[GPU: 4x faster than CPU] B --> E[GPU-Centric Deployment] D --> F[Heterogeneous Deployment] F --> G[CPU: Simple Patterns] F --> H[GPU: Complex Patterns] F --> J[FPGA: Stream Processing] F --> I[Shared Coordination] style A fill:#f9d5e5,stroke:#333 style C fill:#d0f0c0,stroke:#333 style F fill:#a8d8ea,stroke:#333 style J fill:#e6ccff,stroke:#333 style I stroke:#ff6600

This shift isn’t merely about performance, it’s about fundamentally rethinking where computation happens. When a CPU can process ternary operations at 512 operations per cycle using AVX-512, while a GPU manages only 2000 ops/cycle, the 4x advantage may not justify the complexity and power consumption of GPU-only deployment. Add Xilinx FPGAs to the mix, with their ability to implement ternary operations directly in configurable logic, and the efficiency gains become even more compelling.

The Art of Bit Packing: 5 Trits in 8 Bits

The mathematical acumen of ternary packing, fitting 5 ternary values into 8 bits (with padding where needed), provides the foundation for efficient storage and computation:

open FSharp.NativeInterop

[<Measure>] type trit
[<Measure>] type packed

let inline byteWithMeasure<[<Measure>] 'u> (b: byte) : byte<'u> = 
    LanguagePrimitives.ByteWithMeasure<'u> b

let inline intWithMeasure<[<Measure>] 'u> (i: int) : int<'u> = 
    LanguagePrimitives.Int32WithMeasure<'u> i

type TernaryValue = 
    | Neg
    | Zero
    | Pos
    | Pad  // Padding value for incomplete chunks
    member this.ToPackedByte = 
        match this with
        | Zero -> 0uy
        | Neg -> 1uy  
        | Pos -> 2uy
        | Pad -> 3uy  // Uses base-4 encoding when padding present
    
    static member FromPackedByte (value: byte) =
        match value with
        | 0uy -> Zero
        | 1uy -> Neg
        | 2uy -> Pos
        | 3uy -> Pad
        | _ -> failwith "Invalid ternary value"

// Pack using base-3 for pure ternary or base-4 when padding needed
let packTernary (values: TernaryValue array) : byte<packed> array * int<trit> =
    let actualTritCount = intWithMeasure<trit> values.Length
    
    let needsPadding = values.Length % 5 <> 0
    
    if needsPadding then
        let paddedValues = 
            let padding = Array.create (4 - (values.Length % 4)) Pad
            Array.append values padding
        
        let packedBytes = 
            paddedValues
            |> Array.chunkBySize 4
            |> Array.map (fun chunk ->
                let packed = 
                    chunk.[0].ToPackedByte + 
                    chunk.[1].ToPackedByte * 4uy +
                    chunk.[2].ToPackedByte * 16uy +
                    chunk.[3].ToPackedByte * 64uy
                byteWithMeasure<packed> packed)
        
        (packedBytes, actualTritCount)
    else
        let packedBytes = 
            values
            |> Array.chunkBySize 5
            |> Array.map (fun chunk ->
                let packed = 
                    chunk.[0].ToPackedByte + 
                    chunk.[1].ToPackedByte * 3uy +
                    chunk.[2].ToPackedByte * 9uy +
                    chunk.[3].ToPackedByte * 27uy +
                    chunk.[4].ToPackedByte * 81uy
                byteWithMeasure<packed> packed)
        
        (packedBytes, actualTritCount)

// Unpack function that handles both base-3 and base-4 encoding
let unpackTernary (packedBytes: byte<packed> array) (actualTritCount: int<trit>) : TernaryValue array =
    let isPadded = actualTritCount % (5 * 1<trit>) <> 0<trit>
    
    let allUnpacked = 
        if isPadded then
            packedBytes
            |> Array.collect (fun packedByte ->
                let b = byte packedByte
                [|
                    TernaryValue.FromPackedByte(b % 4uy)
                    TernaryValue.FromPackedByte((b / 4uy) % 4uy)
                    TernaryValue.FromPackedByte((b / 16uy) % 4uy)
                    TernaryValue.FromPackedByte((b / 64uy) % 4uy)
                |])
        else
            packedBytes
            |> Array.collect (fun packedByte ->
                let b = byte packedByte
                [|
                    TernaryValue.FromPackedByte(b % 3uy)
                    TernaryValue.FromPackedByte((b / 3uy) % 3uy)
                    TernaryValue.FromPackedByte((b / 9uy) % 3uy)
                    TernaryValue.FromPackedByte((b / 27uy) % 3uy)
                    TernaryValue.FromPackedByte((b / 81uy) % 3uy)
                |])
    
    // Return only actual data (Pad values are always at the end)
    allUnpacked.[0 .. (int actualTritCount - 1)]

This 96.9% storage efficiency, combined with SIMD-friendly unpacking operations, enables CPU cores to process ternary operations at speeds approaching specialized hardware, all while maintaining the flexibility to run on commodity processors.

Memory Architecture Evolution: The CXL Advantage

With the convergence on memory unification and AMD’s acquisition of Xilinx, there are now multiple pathways for efficient heterogeneous computing. The CXL (Compute Express Link) protocol becomes particularly crucial here, enabling cache-coherent interconnect between CPUs, GPUs, and now Xilinx FPGAs, each with distinct advantages for ternary model deployment:

%%{init: {'theme': 'neutral'}}%% graph LR subgraph "Future: CXL Unified Memory" C1[CPU Compute] --> C3[BAREWire Zero-Copy] C2[GPU Compute] --> C3 C5[FPGA Accelerator] --> C3 C3 --> C4[CXL Coherent Space] end subgraph "Transitional: Coherent Memory" B1[CPU Cache] B2[GPU Cache] B3[BAREWire Inter-Process Comms] B1 --> B3 B2 --> B3 end subgraph "Legacy: Split Memory" A1[CPU Memory] -.->|PCIe/IF| A2[GPU Memory] A2 -.->|Explicit Copy| A1 end style B3 fill:#ffffcc,stroke:#ff6600,stroke-width:4px style C4 fill:#d0f0c0,stroke:#006600,stroke-width:4px style C3 fill:#ffffcc,stroke:#ff6600,stroke-width:4px style C5 fill:#e6ccff,stroke:#333

MI300A: A Unified Future For AMD

The MI300A APU is the start of AMD’s vision to realize true hardware-coherent shared memory between CPU and GPU:

module UnifiedMemoryInference =
    // Single allocation visible to both CPU and GPU
    let createUnifiedTensor<'T> (shape: int array) =
        let buffer = AMD.allocateUnified<'T>(shape |> Array.reduce (*))
        {
            Data = buffer
            CPUView = buffer.HostPointer
            GPUView = buffer.DevicePointer  // Same physical memory!
            Shape = shape
        }
    
    // Zero-copy model distribution
    let distributeModel (model: TernaryModel) =
        // Attention heads stay on GPU
        let attention = createUnifiedTensor model.AttentionShape
        
        // Simple FFN layers on CPU
        let ffn = createUnifiedTensor model.FFNShape
        
        // Seamless data flow without copies
        { Attention = attention; FFN = ffn }

Infinity Fabric and CXL: Coherent Interconnect

For discrete GPU systems, Infinity Fabric provides cache-coherent interconnect with promising bandwidth, now enhanced with CXL support for Xilinx FPGA integration:

type InfinityFabricChannel = {
    Bandwidth: float<GB/s>  // Up to 800 GB/s
    Latency: float<ns>      // ~120ns
    CoherencyProtocol: XGMI
    CXLEnabled: bool        // For FPGA coherency
}

let setupCoherentChannel (cpu: EPYC) (gpu: MI300X) (fpga: XilinxVersal) =
    // Establish coherent link with CXL for FPGA
    let fabric = AMD.InfinityFabric.connect cpu gpu
    let cxlLink = CXL.establishCoherency fpga
    
    // Allocate in shared coherent memory space
    let sharedMemory = CXL.allocateCoherent(size = 16<GB>)
    
    // Map to all processing elements
    let mapping = {
        CPUAddress = fabric.mapToHost(sharedMemory)
        GPUAddress = fabric.mapToDevice(sharedMemory)
        FPGAAddress = cxlLink.mapToAccelerator(sharedMemory)
        Coherency = CXLCoherencyDomain.Unified
    }
    
    mapping

Numerical Precision Considerations

While ternary quantization provides dramatic compression and computational efficiency, certain operations still benefit from higher precision arithmetic. The Fidelity framework’s independence from BCL dependencies creates opportunities for exploring alternative numerical representations beyond traditional IEEE floating-point:

Posit Arithmetic for Residual Operations

Posit arithmetic presents an intriguing avenue for handling the residual dense operations that remain in our heterogeneous system. Posits provide superior accuracy and dynamic range compared to IEEE floats at equivalent bit widths, making them particularly valuable for:

Accumulator precision during ternary add-subtract operations
Intermediate calculations before ternary quantization
Dense residual operations that still execute on GPU
Critical path computations where accuracy impacts model performance

The integration of posit arithmetic into the Fidelity framework would complement ternary quantization, providing a two-tier numerical strategy: posits for precision-critical operations and ternary for the bulk of inference computation. This combination could yield better overall accuracy than pure FP16 implementations while maintaining the efficiency advantages of ternary quantization.

Actor-Based Model Workloads

The true power of heterogeneous ternary inference emerges when we orchestrate multiple specialized models as a group of cooperating actors:

%%{init: {'theme': 'neutral'}}%% graph LR subgraph "Heterogeneous Inference Layer" C[Coordinator Actor] R[Router Actor] A[Prospero Scheduler/Orchestrator] CPU1[CPU: Language Expert] CPU2[CPU: Logic Expert] GPU1[GPU: Vision Expert] GPU2[GPU: Math Expert] FPGA1[FPGA: Stream Expert] H1[Hybrid: Reasoning Expert] M[BAREWire Zero-Copy Pool] end C --> R R --> CPU1 & CPU2 & GPU1 & GPU2 & FPGA1 & H1 CPU1 & CPU2 & GPU1 & GPU2 & FPGA1 & H1 <---> A A <-.-> M style M fill:#ffffcc,stroke:#ff6600,stroke-width:4px style C fill:#a8d8ea,stroke:#333 style R fill:#a8d8ea,stroke:#333 style A fill:#a8d8ea,stroke:#333 style FPGA1 fill:#e6ccff,stroke:#333

This architecture leverages F#’s actor model to create a flexible, scalable inference system:

// Specialized model actors with hardware affinity
type ModelExpert = 
    | LanguageExpert of {
        Specialization: "translation" | "summarization" | "qa"
        Processor: CPUActor
        TernaryModel: CompressedBERT
    }
    | VisionExpert of {
        Specialization: "detection" | "segmentation" | "ocr"  
        Processor: GPUActor
        TernaryModel: CompressedYOLO
    }
    | StreamExpert of {
        Specialization: "filtering" | "transformation" | "aggregation"
        Processor: FPGAActor  // Xilinx Versal
        TernaryModel: StreamingNetwork
    }
    | ReasoningExpert of {
        Specialization: "math" | "logic" | "planning"
        Processor: HybridActor  // CPU + GPU + FPGA
        TernaryModel: CompressedCoT
    }

// Coordinator with zero-copy message passing
let createConstellation (config: ConstellationConfig) =
    let coordinator = MailboxProcessor.Start(fun inbox -> async {
        // Pre-allocate shared memory pool with CXL coherency
        let memoryPool = BAREWire.createPool {
            Size = 64<GB>
            AccessMode = CXLUnifiedMemory
            Pinned = true
        }
        
        // Initialize expert actors including FPGA stream processors
        let experts = [
            LanguageExpert { 
                Specialization = "qa"
                Processor = CPUActor.spawn 0
                TernaryModel = Models.compressedBERT 
            }
            VisionExpert {
                Specialization = "detection"
                Processor = GPUActor.spawn 0  
                TernaryModel = Models.compressedYOLO
            }
            StreamExpert {
                Specialization = "filtering"
                Processor = FPGAActor.spawn 0
                TernaryModel = Models.streamingNetwork
            }
            ReasoningExpert {
                Specialization = "math"
                Processor = HybridActor.spawn (cpu = 1, gpu = 0, fpga = 0)
                TernaryModel = Models.compressedCoT
            }
        ]
        
        while true do
            let! msg = inbox.Receive()
            match msg with
            | Query(input, replyChannel) ->
                // Allocate from shared pool - zero copy
                let! sharedBuffer = memoryPool.AllocateAsync(input.Size)
                input.CopyTo(sharedBuffer)
                
                // Route to appropriate expert
                let expert = selectExpert input.Type experts
                let! result = expert.ProcessAsync(sharedBuffer)
                
                replyChannel.Reply(result)
                memoryPool.Release(sharedBuffer)
    })
    
    coordinator

RDMA and Distributed Scaling

When scaling beyond single nodes, RDMA over Converged Ethernet (RoCE) enables zero-copy operations across the network:

module DistributedConstellation =
    // Setup RDMA for inter-node communication
    let setupRDMA (nodes: NodeEndpoint array) =
        nodes |> Array.map (fun node ->
            // Register memory regions for RDMA
            let memoryRegion = RDMA.registerMemory {
                Buffer = node.ModelMemory
                Size = node.ModelSize
                Access = IBV_ACCESS_REMOTE_READ ||| IBV_ACCESS_LOCAL_WRITE
            }
            
            // Create queue pairs for each connection
            let queuePairs = nodes |> Array.map (fun remote ->
                if remote.Id <> node.Id then
                    Some(RDMA.createQueuePair node remote)
                else None)
            
            { Node = node; MemoryRegion = memoryRegion; Connections = queuePairs })
    
    // Zero-copy read from remote node
    let readRemoteState (source: NodeConnection) (offset: int<bytes>) (size: int<bytes>) =
        // One-sided RDMA read - no CPU involvement on remote side
        let request = {
            Operation = RDMA_READ
            LocalAddress = localBuffer + offset
            RemoteAddress = source.MemoryRegion.Address + offset  
            RemoteKey = source.MemoryRegion.Key
            Length = size
        }
        
        RDMA.postSend source.QueuePair request

Performance Projections

When just looking at the ‘raw numbers’ the convergence of these technologies could potentially enable remarkable efficiency gains:

Metric	Traditional GPU-Only	Heterogeneous Ternary	Improvement
Memory Usage	10GB (FP16)	500MB (1.58-bit)	20x reduction
Power Consumption	350W	95W	3.7x reduction
Latency (1st token)	45ms	12ms	3.8x faster
Throughput	1000 tok/s	4000 tok/s	4x increase
Cost per Million Tokens	$0.50	$0.08	6.25x cheaper

While these optimizations are always a balancing act, the improvements could compound when deployed as independent elements:

Parallel Expert Evaluation: Multiple models process simultaneously
Intelligent Routing: Only necessary experts activate
Shared Context: Zero-copy context sharing between models
Dynamic Scaling: Add/remove experts based on load
FPGA Stream Processing: Dedicated logic for high-throughput operations

This not only increases efficiency, but also gives unprecedented visibility into and control over how a “solution stack” operates. The “AI” is no longer a black box, but a transparent set of discrete and manageable operators that can be evaluated, adjusted and tuned to suit a specific business outcome.

Implementation Roadmap

The path to making this vision operational involves several key phases:

Phase 1: Foundation

Implement ternary packing/unpacking kernels for AMD hardware
Develop BAREWire adapters for Infinity Fabric and CXL coherency
Create basic actor framework for model coordination
Deploy initial Xilinx FPGA acceleration kernels

Phase 2: Optimization

Optimize SIMD kernels for Zen 4/5 architectures
Implement GPU kernels for residual dense operations
Configure FPGA dataflow graphs for ternary operations
Develop profiling tools for workload distribution
Explore posit arithmetic integration for precision-critical paths

Phase 3: Scale

Add RDMA support for multi-node deployment
Implement dynamic expert routing algorithms
Enable CXL memory pooling across heterogeneous accelerators
Create deployment tools and monitoring

While this path seems revolutionary, the Fidelity framework is uniquely prepared to bring these elements together into a cohesive solution that will provide next-generation efficiency and reliability to intelligent systems.

A New Paradigm Requires Fresh Thinking

The combination of ternary quantization, AMD’s unified memory architecture, and actor-based orchestration represents more than incremental improvement, it’s emblematic of the innovation required to reimagine AI model operation. By embracing the natural sparsity of ternary operations and the flexibility of heterogeneous computing, we can build systems that are not just faster and more efficient, but fundamentally more capable, more manageable and more performant.

AMD’s hardware roadmap is a signal of this potential, particularly the unified memory architecture of MI3x series, the coherent interconnects of Infinity Fabric, and crucially, the Xilinx acquisition that brings FPGA acceleration into the same coherent memory space via CXL. When combined with the Fidelity framework’s type-safe approach and BAREWire’s zero-copy operations, we have uniquely powerful components needed to build the next generation of AI inference systems today while laying the foundation for tomorrow’s hardware breakthroughs.

The future of AI isn’t about ever-larger oceans of matrix multiplication running on ever-more-power-hungry GPUs. It’s about intelligent orchestration of specialized models, each optimized for its task and hardware, working together as a unified system within a business’ security boundary. With ternary quantization breaking the tyranny of matrix multiplication and companies like AMD enabling true heterogeneous computing across CPU, GPU, and FPGA domains, that future is brighter, safer and more efficient than ever.

Return to Blog