Files
claude-code/docs/LLAMA_CPP.md
2026-04-02 15:19:28 +03:00

104 lines
4.1 KiB
Markdown

# Llama.cpp Integration Guide - Claude Code
This guide explores how to implement a custom API provider for Claude Code using `llama.cpp`'s `llama-server`. This setup is ideal for local-first development or when using high-end hardware like **AMD Strix Halo** or **Apple Silicon M2 Max**.
---
## 1. Architecture Overview
`llama-server` provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization.
### Provider Hook Location
The primary location for adding new providers is [`services/api/client.ts`](file:///Users/vlad/Developer/vlad/claude-code/services/api/client.ts).
1. **Add Provider Type**: Update `APIProvider` in `utils/model/providers.ts` to include `'llama-cpp'`.
2. **Environment Variable**: Use a toggle like `CLAUDE_CODE_USE_LLAMA_CPP=true`.
3. **Client Configuration**:
```typescript
if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) {
return new Anthropic({
apiKey: 'local-key', // llama-server often ignores this
baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1',
...ARGS,
})
}
```
### Remote / Proxy Authentication
If you are proxying `llama-server` through an AWS-compatible gateway (e.g., LiteLLM), you can use the `AWS_BEARER_TOKEN_BEDROCK` environment variable to authenticate.
---
---
## 2. Hardware Optimization
To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends.
### Apple Silicon (M2 Max)
`llama.cpp` has first-class **Metal** support.
- **Flags**: Ensure `-ngl` (number of GPU layers) is set to the maximum (e.g., `-ngl 99`) to offload the entire model to the GPU.
- **Threads**: Match the number of performance cores (e.g., `-t 8`).
### AMD Strix Halo
Strix Halo features a massive iGPU and a powerful NPU.
- **Vulkan Backend**: Use the Vulkan backend for the iGPU (`LLAMA_VULKAN=1`).
- **ROCm Backend**: For Linux users, ROCm provides near-native performance for AMD hardware.
- **NPU Integration**: If using Windows/Linux with experimental NPU drivers, ensure `llama-server` is compiled with the relevant plugin (e.g., OpenVINO).
---
## 3. Overcoming "Slow PP" (Prompt Processing)
Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly.
### Persistent KV Caching (Slots)
`llama-server` supports **slots**, which allow multiple sessions to share or persist their KV cache.
- **Persistent Slot**: Use `--slot-save-path /path/to/cache` to save the context state between CLI restarts.
- **Continuous Batching**: Use `--cont-batching` to allow the server to process new prompts while tokens are still being generated for other requests.
### Configuration Tips
- **Large Context**: Set a generous context size with `-c 32768` (or higher) to avoid frequent context shifting.
- **Flash Attention**: Always enable Flash Attention (`--flash-attn`) to reduce memory bandwidth requirements during PP.
---
## 4. Supporting OSS Models
Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models:
| Model | Mapping Suggestion | Strength |
| :--- | :--- | :--- |
| **Qwen3-72B-Instruct** | Map to `claude-3-opus-latest` | Excellent reasoning and tool use. |
| **GPT-20-OSS** | Map to `claude-3-5-sonnet-latest` | High-speed, high-intelligence balance. |
| **GPT-120-OSS** | Map to `claude-3-opus-latest` | Deep complex problem solving. |
---
## 5. Recommended `llama-server` Command
For a dedicated local Claude Code backend:
```bash
./llama-server \
-m models/qwen3-72b-q4_k_m.gguf \
-c 32768 \
-ngl 99 \
--flash-attn \
--cont-batching \
--host 0.0.0.0 \
--port 8080 \
--api-key local-secret-token \
--slot-save-path ./llama_slots
```
---
> [!CAUTION]
> Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache.
---
## See Also
- **[Authentication Guide](file:///Users/vlad/Developer/vlad/claude-code/docs/AUTH_GUIDE.md)**: Details on general environment variables and credential management.