axios+telemetry cleanup
This commit is contained in:
103
docs/LLAMA_CPP.md
Normal file
103
docs/LLAMA_CPP.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# Llama.cpp Integration Guide - Claude Code
|
||||
|
||||
This guide explores how to implement a custom API provider for Claude Code using `llama.cpp`'s `llama-server`. This setup is ideal for local-first development or when using high-end hardware like **AMD Strix Halo** or **Apple Silicon M2 Max**.
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture Overview
|
||||
|
||||
`llama-server` provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization.
|
||||
|
||||
### Provider Hook Location
|
||||
The primary location for adding new providers is [`services/api/client.ts`](file:///Users/vlad/Developer/vlad/claude-code/services/api/client.ts).
|
||||
|
||||
1. **Add Provider Type**: Update `APIProvider` in `utils/model/providers.ts` to include `'llama-cpp'`.
|
||||
2. **Environment Variable**: Use a toggle like `CLAUDE_CODE_USE_LLAMA_CPP=true`.
|
||||
3. **Client Configuration**:
|
||||
```typescript
|
||||
if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) {
|
||||
return new Anthropic({
|
||||
apiKey: 'local-key', // llama-server often ignores this
|
||||
baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1',
|
||||
...ARGS,
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
### Remote / Proxy Authentication
|
||||
If you are proxying `llama-server` through an AWS-compatible gateway (e.g., LiteLLM), you can use the `AWS_BEARER_TOKEN_BEDROCK` environment variable to authenticate.
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 2. Hardware Optimization
|
||||
|
||||
To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends.
|
||||
|
||||
### Apple Silicon (M2 Max)
|
||||
`llama.cpp` has first-class **Metal** support.
|
||||
- **Flags**: Ensure `-ngl` (number of GPU layers) is set to the maximum (e.g., `-ngl 99`) to offload the entire model to the GPU.
|
||||
- **Threads**: Match the number of performance cores (e.g., `-t 8`).
|
||||
|
||||
### AMD Strix Halo
|
||||
Strix Halo features a massive iGPU and a powerful NPU.
|
||||
- **Vulkan Backend**: Use the Vulkan backend for the iGPU (`LLAMA_VULKAN=1`).
|
||||
- **ROCm Backend**: For Linux users, ROCm provides near-native performance for AMD hardware.
|
||||
- **NPU Integration**: If using Windows/Linux with experimental NPU drivers, ensure `llama-server` is compiled with the relevant plugin (e.g., OpenVINO).
|
||||
|
||||
---
|
||||
|
||||
## 3. Overcoming "Slow PP" (Prompt Processing)
|
||||
|
||||
Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly.
|
||||
|
||||
### Persistent KV Caching (Slots)
|
||||
`llama-server` supports **slots**, which allow multiple sessions to share or persist their KV cache.
|
||||
- **Persistent Slot**: Use `--slot-save-path /path/to/cache` to save the context state between CLI restarts.
|
||||
- **Continuous Batching**: Use `--cont-batching` to allow the server to process new prompts while tokens are still being generated for other requests.
|
||||
|
||||
### Configuration Tips
|
||||
- **Large Context**: Set a generous context size with `-c 32768` (or higher) to avoid frequent context shifting.
|
||||
- **Flash Attention**: Always enable Flash Attention (`--flash-attn`) to reduce memory bandwidth requirements during PP.
|
||||
|
||||
---
|
||||
|
||||
## 4. Supporting OSS Models
|
||||
|
||||
Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models:
|
||||
|
||||
| Model | Mapping Suggestion | Strength |
|
||||
| :--- | :--- | :--- |
|
||||
| **Qwen3-72B-Instruct** | Map to `claude-3-opus-latest` | Excellent reasoning and tool use. |
|
||||
| **GPT-20-OSS** | Map to `claude-3-5-sonnet-latest` | High-speed, high-intelligence balance. |
|
||||
| **GPT-120-OSS** | Map to `claude-3-opus-latest` | Deep complex problem solving. |
|
||||
|
||||
---
|
||||
|
||||
## 5. Recommended `llama-server` Command
|
||||
|
||||
For a dedicated local Claude Code backend:
|
||||
|
||||
```bash
|
||||
./llama-server \
|
||||
-m models/qwen3-72b-q4_k_m.gguf \
|
||||
-c 32768 \
|
||||
-ngl 99 \
|
||||
--flash-attn \
|
||||
--cont-batching \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
--api-key local-secret-token \
|
||||
--slot-save-path ./llama_slots
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
> [!CAUTION]
|
||||
> Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache.
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
- **[Authentication Guide](file:///Users/vlad/Developer/vlad/claude-code/docs/AUTH_GUIDE.md)**: Details on general environment variables and credential management.
|
||||
Reference in New Issue
Block a user