4.1 KiB
Llama.cpp Integration Guide - Claude Code
This guide explores how to implement a custom API provider for Claude Code using llama.cpp's llama-server. This setup is ideal for local-first development or when using high-end hardware like AMD Strix Halo or Apple Silicon M2 Max.
1. Architecture Overview
llama-server provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization.
Provider Hook Location
The primary location for adding new providers is services/api/client.ts.
- Add Provider Type: Update
APIProviderinutils/model/providers.tsto include'llama-cpp'. - Environment Variable: Use a toggle like
CLAUDE_CODE_USE_LLAMA_CPP=true. - Client Configuration:
if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) { return new Anthropic({ apiKey: 'local-key', // llama-server often ignores this baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1', ...ARGS, }) }
Remote / Proxy Authentication
If you are proxying llama-server through an AWS-compatible gateway (e.g., LiteLLM), you can use the AWS_BEARER_TOKEN_BEDROCK environment variable to authenticate.
2. Hardware Optimization
To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends.
Apple Silicon (M2 Max)
llama.cpp has first-class Metal support.
- Flags: Ensure
-ngl(number of GPU layers) is set to the maximum (e.g.,-ngl 99) to offload the entire model to the GPU. - Threads: Match the number of performance cores (e.g.,
-t 8).
AMD Strix Halo
Strix Halo features a massive iGPU and a powerful NPU.
- Vulkan Backend: Use the Vulkan backend for the iGPU (
LLAMA_VULKAN=1). - ROCm Backend: For Linux users, ROCm provides near-native performance for AMD hardware.
- NPU Integration: If using Windows/Linux with experimental NPU drivers, ensure
llama-serveris compiled with the relevant plugin (e.g., OpenVINO).
3. Overcoming "Slow PP" (Prompt Processing)
Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly.
Persistent KV Caching (Slots)
llama-server supports slots, which allow multiple sessions to share or persist their KV cache.
- Persistent Slot: Use
--slot-save-path /path/to/cacheto save the context state between CLI restarts. - Continuous Batching: Use
--cont-batchingto allow the server to process new prompts while tokens are still being generated for other requests.
Configuration Tips
- Large Context: Set a generous context size with
-c 32768(or higher) to avoid frequent context shifting. - Flash Attention: Always enable Flash Attention (
--flash-attn) to reduce memory bandwidth requirements during PP.
4. Supporting OSS Models
Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models:
| Model | Mapping Suggestion | Strength |
|---|---|---|
| Qwen3-72B-Instruct | Map to claude-3-opus-latest |
Excellent reasoning and tool use. |
| GPT-20-OSS | Map to claude-3-5-sonnet-latest |
High-speed, high-intelligence balance. |
| GPT-120-OSS | Map to claude-3-opus-latest |
Deep complex problem solving. |
5. Recommended llama-server Command
For a dedicated local Claude Code backend:
./llama-server \
-m models/qwen3-72b-q4_k_m.gguf \
-c 32768 \
-ngl 99 \
--flash-attn \
--cont-batching \
--host 0.0.0.0 \
--port 8080 \
--api-key local-secret-token \
--slot-save-path ./llama_slots
Caution
Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache.
See Also
- Authentication Guide: Details on general environment variables and credential management.