Llama.cpp Integration Guide - Claude Code

This guide explores how to implement a custom API provider for Claude Code using llama.cpp's llama-server. This setup is ideal for local-first development or when using high-end hardware like AMD Strix Halo or Apple Silicon M2 Max.

1. Architecture Overview

llama-server provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization.

Provider Hook Location

The primary location for adding new providers is services/api/client.ts.

Add Provider Type: Update APIProvider in utils/model/providers.ts to include 'llama-cpp'.
Environment Variable: Use a toggle like CLAUDE_CODE_USE_LLAMA_CPP=true.

Client Configuration:

if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) {
  return new Anthropic({
    apiKey: 'local-key', // llama-server often ignores this
    baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1',
    ...ARGS,
  })
}

Remote / Proxy Authentication

If you are proxying llama-server through an AWS-compatible gateway (e.g., LiteLLM), you can use the AWS_BEARER_TOKEN_BEDROCK environment variable to authenticate.

2. Hardware Optimization

To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends.

Apple Silicon (M2 Max)

llama.cpp has first-class Metal support.

Flags: Ensure -ngl (number of GPU layers) is set to the maximum (e.g., -ngl 99) to offload the entire model to the GPU.
Threads: Match the number of performance cores (e.g., -t 8).

AMD Strix Halo

Strix Halo features a massive iGPU and a powerful NPU.

Vulkan Backend: Use the Vulkan backend for the iGPU (LLAMA_VULKAN=1).
ROCm Backend: For Linux users, ROCm provides near-native performance for AMD hardware.
NPU Integration: If using Windows/Linux with experimental NPU drivers, ensure llama-server is compiled with the relevant plugin (e.g., OpenVINO).

3. Overcoming "Slow PP" (Prompt Processing)

Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly.

Persistent KV Caching (Slots)

llama-server supports slots, which allow multiple sessions to share or persist their KV cache.

Persistent Slot: Use --slot-save-path /path/to/cache to save the context state between CLI restarts.
Continuous Batching: Use --cont-batching to allow the server to process new prompts while tokens are still being generated for other requests.

Configuration Tips

Large Context: Set a generous context size with -c 32768 (or higher) to avoid frequent context shifting.
Flash Attention: Always enable Flash Attention (--flash-attn) to reduce memory bandwidth requirements during PP.

4. Supporting OSS Models

Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models:

Model	Mapping Suggestion	Strength
Qwen3-72B-Instruct	Map to `claude-3-opus-latest`	Excellent reasoning and tool use.
GPT-20-OSS	Map to `claude-3-5-sonnet-latest`	High-speed, high-intelligence balance.
GPT-120-OSS	Map to `claude-3-opus-latest`	Deep complex problem solving.

5. Recommended `llama-server` Command

For a dedicated local Claude Code backend:

./llama-server \
  -m models/qwen3-72b-q4_k_m.gguf \
  -c 32768 \
  -ngl 99 \
  --flash-attn \
  --cont-batching \
  --host 0.0.0.0 \
  --port 8080 \
  --api-key local-secret-token \
  --slot-save-path ./llama_slots

Caution

Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache.

4.1 KiB Raw Permalink Blame History