# Llama.cpp Integration Guide - Claude Code This guide explores how to implement a custom API provider for Claude Code using `llama.cpp`'s `llama-server`. This setup is ideal for local-first development or when using high-end hardware like **AMD Strix Halo** or **Apple Silicon M2 Max**. --- ## 1. Architecture Overview `llama-server` provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization. ### Provider Hook Location The primary location for adding new providers is [`services/api/client.ts`](file:///Users/vlad/Developer/vlad/claude-code/services/api/client.ts). 1. **Add Provider Type**: Update `APIProvider` in `utils/model/providers.ts` to include `'llama-cpp'`. 2. **Environment Variable**: Use a toggle like `CLAUDE_CODE_USE_LLAMA_CPP=true`. 3. **Client Configuration**: ```typescript if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) { return new Anthropic({ apiKey: 'local-key', // llama-server often ignores this baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1', ...ARGS, }) } ``` ### Remote / Proxy Authentication If you are proxying `llama-server` through an AWS-compatible gateway (e.g., LiteLLM), you can use the `AWS_BEARER_TOKEN_BEDROCK` environment variable to authenticate. --- --- ## 2. Hardware Optimization To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends. ### Apple Silicon (M2 Max) `llama.cpp` has first-class **Metal** support. - **Flags**: Ensure `-ngl` (number of GPU layers) is set to the maximum (e.g., `-ngl 99`) to offload the entire model to the GPU. - **Threads**: Match the number of performance cores (e.g., `-t 8`). ### AMD Strix Halo Strix Halo features a massive iGPU and a powerful NPU. - **Vulkan Backend**: Use the Vulkan backend for the iGPU (`LLAMA_VULKAN=1`). - **ROCm Backend**: For Linux users, ROCm provides near-native performance for AMD hardware. - **NPU Integration**: If using Windows/Linux with experimental NPU drivers, ensure `llama-server` is compiled with the relevant plugin (e.g., OpenVINO). --- ## 3. Overcoming "Slow PP" (Prompt Processing) Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly. ### Persistent KV Caching (Slots) `llama-server` supports **slots**, which allow multiple sessions to share or persist their KV cache. - **Persistent Slot**: Use `--slot-save-path /path/to/cache` to save the context state between CLI restarts. - **Continuous Batching**: Use `--cont-batching` to allow the server to process new prompts while tokens are still being generated for other requests. ### Configuration Tips - **Large Context**: Set a generous context size with `-c 32768` (or higher) to avoid frequent context shifting. - **Flash Attention**: Always enable Flash Attention (`--flash-attn`) to reduce memory bandwidth requirements during PP. --- ## 4. Supporting OSS Models Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models: | Model | Mapping Suggestion | Strength | | :--- | :--- | :--- | | **Qwen3-72B-Instruct** | Map to `claude-3-opus-latest` | Excellent reasoning and tool use. | | **GPT-20-OSS** | Map to `claude-3-5-sonnet-latest` | High-speed, high-intelligence balance. | | **GPT-120-OSS** | Map to `claude-3-opus-latest` | Deep complex problem solving. | --- ## 5. Recommended `llama-server` Command For a dedicated local Claude Code backend: ```bash ./llama-server \ -m models/qwen3-72b-q4_k_m.gguf \ -c 32768 \ -ngl 99 \ --flash-attn \ --cont-batching \ --host 0.0.0.0 \ --port 8080 \ --api-key local-secret-token \ --slot-save-path ./llama_slots ``` --- > [!CAUTION] > Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache. --- ## See Also - **[Authentication Guide](file:///Users/vlad/Developer/vlad/claude-code/docs/AUTH_GUIDE.md)**: Details on general environment variables and credential management.