Building an Inference Engine for On-Device AI Coding

How to get a coding assistant that runs entirely on your machine fast enough that you forget it's running locally.

In Part 1 we built the data. In Part 2 we trained and compressed the model. Now we need to actually run it — on a laptop, with no cloud, fast enough that a developer doesn't cmd-tab away while waiting.

This is arguably the hardest part. A compressed model that "fits in memory" and a model that "runs well on consumer hardware" are very different things. The gap between them is filled with custom Metal kernels, cache management strategies, memory layout optimization, and a healthy amount of Rust.

Why We Built Our Own Inference Engine

We evaluated existing options. llama.cpp is excellent for standard transformer architectures but doesn't support our model's hybrid attention design. Other frameworks (vLLM, TensorRT-LLM) are designed for server deployment with assumptions about hardware that don't hold on a MacBook. We needed something purpose-built.

Rig's inference engine is written in Rust with platform-specific backends:

macOS/Apple Silicon: MLX (Apple's machine learning framework) with custom Metal compute kernels
Other platforms: Candle (Hugging Face's Rust-native inference framework) with Metal or CPU fallback

The Rust layer handles session management, caching, sampling, and orchestration. The compute backends handle the actual matrix math. This separation means we can optimize for each platform without duplicating the higher-level logic.

The Model Architecture Challenge

Our model isn't a standard transformer. It uses a hybrid attention architecture with two types of layers:

Full attention layers (a minority): Standard grouped-query attention with rotary position embeddings and KV caching. These handle the "global" reasoning — attending over the full context window.
Linear attention layers (the majority): A recurrent architecture that processes tokens with a fixed-size state, similar in spirit to state-space models. These handle local pattern matching and sequential reasoning with constant memory per token regardless of context length.

Plus Mixture-of-Experts routing on select layers, where each token activates only a subset of expert networks.

This hybrid design is why the model can handle long contexts efficiently — the linear attention layers don't need KV caches that grow with context length. But it also means off-the-shelf inference engines can't run it. Every operation in the forward pass needed custom implementation.

Memory: The Real Bottleneck

When people think about running large models locally, they think about model size — "will the weights fit in RAM?" But weights are often not the bottleneck. For our model, the weights are about 12GB after quantization. That fits comfortably.

The real problem is intermediate buffers. During inference, the model needs temporary storage for attention scores, expert routing logits, intermediate activations, and output logits. A naive implementation allocates these buffers for the full context window — meaning a 32K context model allocates buffers as if every forward pass processes 32,000 tokens, even during token-by-token generation where you're processing exactly one token at a time.

We use 32k here purely as an illustrative example, as most expectations with local models are on the smaller side. Rig supports context windows up to 256k which makes the buffer problem even more acute.

We decoupled buffer allocation from context length. During generation, forward pass buffers are sized for the actual batch being processed (typically a small number of tokens), not the maximum context window. This single change reduces intermediate memory from tens of gigabytes to hundreds of megabytes — the difference between "needs a workstation" and "runs on a laptop."

KV cache management is the other memory lever. The full attention layers need to cache key-value pairs for previously processed tokens. At 32K context, this can consume several gigabytes. We offer optional per-row INT8 quantization of the KV cache, reducing its footprint by roughly 75% with negligible quality impact. The linear attention layers, by design, use fixed-size recurrent states regardless of context length — they're essentially free.

Custom Metal Kernels

Apple Silicon's unified memory architecture is ideal for ML inference — the GPU can directly access the same memory as the CPU without expensive copies. But getting peak performance requires going beyond framework-level APIs and writing custom Metal compute kernels.

We wrote specialized kernels for every core operation:

Attention: Flash-attention-inspired implementations that tile the computation to fit in the GPU's threadgroup memory, avoiding materializing the full attention matrix
MoE gathering and routing: Efficient top-K expert selection and sparse matrix operations that avoid computing through inactive experts
Fused activations: SwiGLU and gated operations fused into single kernel launches, eliminating intermediate buffer allocation and memory round-trips
Linear attention steps: The entire recurrent update (sigmoid gating, delta computation, state update) fused into a single kernel call, reducing Metal command buffer overhead
RMS normalization, rotary embeddings, quantized matmul: Each with format-specific fast paths

The fused kernels are particularly important. Our linear attention layers execute 36 times per token. Reducing each invocation from 5+ kernel launches to 2 has a measurable impact on tokens-per-second. At this level, the bottleneck shifts from compute to kernel dispatch overhead — Metal command buffer setup, argument encoding, and synchronization. Fewer, larger kernels win.

Speculative Decoding in Practice

In Part 2, we described training a recurrent drafter for speculative decoding. Here's how it works at inference time:

The drafter (embedded in the model, sharing its representations) predicts the next several tokens based on the current state
All draft tokens are fed through the main model in a single forward pass — this is where the parallelism comes from
The main model's output probabilities are compared against the draft tokens. Tokens are accepted greedily from left to right until a rejection
On rejection, the main model's distribution at the rejection point gives us the correct next token for free
The KV cache is "rewound" to the last accepted position, and generation continues

The acceptance rate varies by content type. Boilerplate code, common patterns, and predictable syntax see high acceptance rates — multiple tokens per main-model evaluation. Novel logic and complex reasoning see lower rates, gracefully falling back to standard autoregressive generation.

The key insight is that the drafter is essentially free in terms of latency: it runs as part of the embedding computation and adds negligible overhead. The only cost is the occasional wasted verification when drafts are rejected. In practice, speculative decoding gives us a significant speedup on coding tasks, where repetitive structure is common.

Session Management and Prefix Caching

A coding assistant isn't a one-shot interaction. Developers have ongoing sessions — they ask a question, get an answer, make some changes, ask another question. Each interaction shares context: the system prompt, the codebase summary, the conversation history.

We use a radix tree-based prefix cache that shares KV cache state across sessions with common prefixes. If two conversations start with the same system prompt and codebase context (which they almost always do), the KV cache for that shared prefix is computed once and reused. This means the second question in a session starts generating almost immediately, without re-processing thousands of tokens of context.

The inference engine runs as a persistent daemon process communicating over JSON-RPC. This means:

The model stays loaded in memory between interactions — no cold start penalty
Multiple sessions (editor tabs, terminal windows) share a single model instance
Each session maintains isolated cache state while sharing the common prefix
The daemon manages memory pressure, evicting old sessions when needed

Putting It All Together

The full inference stack, from user input to generated token:

Each token generation involves: one drafter pass, one main model verification pass (processing N draft tokens in parallel), cache management, and streaming. The whole thing runs on the GPU integrated into the developer's laptop.

Performance Characteristics

The numbers that matter for a local coding assistant:

Memory footprint: Model weights (~12GB) + KV cache (~400MB at 8K context with INT8) + forward buffers (~300MB) = fits in 16GB unified memory with room for the OS and IDE
Time to first token: Near-instant for follow-up messages in the same session (prefix cache hit). A few seconds for new sessions with long context (prefill)
Generation speed: Competitive with cloud-hosted models for typical coding interactions, thanks to speculative decoding and the MoE architecture only activating a fraction of parameters per token
Concurrent sessions: Multiple editor windows / terminal sessions share one model instance with isolated state

The Local Advantage

Running locally isn't just about privacy (though that matters). It's about latency and availability:

No network round-trip: Every token is generated on-device. No waiting for API responses, no dropped connections, no rate limiting
No usage limits our outages: the model is always available no matter the time of day or how heavy the task
No cold starts: The model is always loaded, always ready. The daemon process persists across interactions
Context stays local: Your codebase, your conversation history, your retrieval index — all local. No upload, no data retention policies, no wondering what happens to your code
Works offline: On a plane, in a cafe with bad WiFi, in a secure environment with no external network access — it just works

What Made This Possible

Three things converged to make on-device AI coding assistants viable:

Hybrid attention architectures that reduce memory requirements from O(n) to O(1) for most layers, enabling long-context inference without proportionally scaling memory
Apple Silicon's unified memory that eliminates the CPU-GPU transfer bottleneck, letting models access all available system memory at GPU speeds
Mixture-of-Experts that decouples model capacity from per-token compute cost, giving you the knowledge of a large model at the inference cost of a small one

We built an inference engine optimized for this specific intersection. It's not a general-purpose ML framework — it's a purpose-built system for running a specific class of models on consumer hardware, as fast as possible.

The result is a coding assistant that runs entirely on your machine, handles long-context problems across hours-long sessions, and responds fast enough that you forget it's running locally.

This is Part 3 of a 3-part series on building Rig, the world's first local AI coding assistant. Part 1: Teaching a Model to Code covers data generation. Part 2: How We Compressed a Frontier Model to Run Locally covers training and compression.