claude-code/docs/LLAMA_CPP.md

# Llama.cpp Integration Guide - Claude Code

This guide explores how to implement a custom API provider for Claude Code using `llama.cpp`'s `llama-server`. This setup is ideal for local-first development or when using high-end hardware like **AMD Strix Halo** or **Apple Silicon M2 Max**.

---

## 1. Architecture Overview

`llama-server` provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization.

### Provider Hook Location
The primary location for adding new providers is [`services/api/client.ts`](file:///Users/vlad/Developer/vlad/claude-code/services/api/client.ts).

1.  **Add Provider Type**: Update `APIProvider` in `utils/model/providers.ts` to include `'llama-cpp'`.
2.  **Environment Variable**: Use a toggle like `CLAUDE_CODE_USE_LLAMA_CPP=true`.
3.  **Client Configuration**:
    ```typescript
    if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) {
      return new Anthropic({
        apiKey: 'local-key', // llama-server often ignores this
        baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1',
        ...ARGS,
      })
    }
    ```

### Remote / Proxy Authentication
If you are proxying `llama-server` through an AWS-compatible gateway (e.g., LiteLLM), you can use the `AWS_BEARER_TOKEN_BEDROCK` environment variable to authenticate.

---

---

## 2. Hardware Optimization

To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends.

### Apple Silicon (M2 Max)
`llama.cpp` has first-class **Metal** support.
- **Flags**: Ensure `-ngl` (number of GPU layers) is set to the maximum (e.g., `-ngl 99`) to offload the entire model to the GPU.
- **Threads**: Match the number of performance cores (e.g., `-t 8`).

### AMD Strix Halo
Strix Halo features a massive iGPU and a powerful NPU.
- **Vulkan Backend**: Use the Vulkan backend for the iGPU (`LLAMA_VULKAN=1`).
- **ROCm Backend**: For Linux users, ROCm provides near-native performance for AMD hardware.
- **NPU Integration**: If using Windows/Linux with experimental NPU drivers, ensure `llama-server` is compiled with the relevant plugin (e.g., OpenVINO).

---

## 3. Overcoming "Slow PP" (Prompt Processing)

Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly.

### Persistent KV Caching (Slots)
`llama-server` supports **slots**, which allow multiple sessions to share or persist their KV cache.
- **Persistent Slot**: Use `--slot-save-path /path/to/cache` to save the context state between CLI restarts.
- **Continuous Batching**: Use `--cont-batching` to allow the server to process new prompts while tokens are still being generated for other requests.

### Configuration Tips
- **Large Context**: Set a generous context size with `-c 32768` (or higher) to avoid frequent context shifting.
- **Flash Attention**: Always enable Flash Attention (`--flash-attn`) to reduce memory bandwidth requirements during PP.

---

## 4. Supporting OSS Models

Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models:

| Model | Mapping Suggestion | Strength |
| :--- | :--- | :--- |
| **Qwen3-72B-Instruct** | Map to `claude-3-opus-latest` | Excellent reasoning and tool use. |
| **GPT-20-OSS** | Map to `claude-3-5-sonnet-latest` | High-speed, high-intelligence balance. |
| **GPT-120-OSS** | Map to `claude-3-opus-latest` | Deep complex problem solving. |

---

## 5. Recommended `llama-server` Command

For a dedicated local Claude Code backend:

```bash
./llama-server \
  -m models/qwen3-72b-q4_k_m.gguf \
  -c 32768 \
  -ngl 99 \
  --flash-attn \
  --cont-batching \
  --host 0.0.0.0 \
  --port 8080 \
  --api-key local-secret-token \
  --slot-save-path ./llama_slots
```

---

> [!CAUTION]
> Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache.

---

## See Also
- **[Authentication Guide](file:///Users/vlad/Developer/vlad/claude-code/docs/AUTH_GUIDE.md)**: Details on general environment variables and credential management.