axios+telemetry cleanup

2026-04-02 15:19:11 +03:00
parent a3cbca1e11
commit 7e1eac8002
100 changed files with 3048 additions and 4491 deletions
--- a/docs/LLAMA_CPP.md
+++ b/docs/LLAMA_CPP.md
@@ -0,0 +1,103 @@
+# Llama.cpp Integration Guide - Claude Code
+
+This guide explores how to implement a custom API provider for Claude Code using `llama.cpp`'s `llama-server`. This setup is ideal for local-first development or when using high-end hardware like **AMD Strix Halo** or **Apple Silicon M2 Max**.
+
+---
+
+## 1. Architecture Overview
+
+`llama-server` provides a REST API that can be configured to mimic the OpenAI or Anthropic message formats. To integrate it into Claude Code, you will need to modify the client initialization.
+
+### Provider Hook Location
+The primary location for adding new providers is [`services/api/client.ts`](file:///Users/vlad/Developer/vlad/claude-code/services/api/client.ts).
+
+1.  **Add Provider Type**: Update `APIProvider` in `utils/model/providers.ts` to include `'llama-cpp'`.
+2.  **Environment Variable**: Use a toggle like `CLAUDE_CODE_USE_LLAMA_CPP=true`.
+3.  **Client Configuration**:
+    ```typescript
+    if (isEnvTruthy(process.env.CLAUDE_CODE_USE_LLAMA_CPP)) {
+      return new Anthropic({
+        apiKey: 'local-key', // llama-server often ignores this
+        baseURL: process.env.LLAMA_CPP_BASE_URL || 'http://localhost:8080/v1',
+        ...ARGS,
+      })
+    }
+    ```
+
+### Remote / Proxy Authentication
+If you are proxying `llama-server` through an AWS-compatible gateway (e.g., LiteLLM), you can use the `AWS_BEARER_TOKEN_BEDROCK` environment variable to authenticate.
+
+---
+
+---
+
+## 2. Hardware Optimization
+
+To achieve smooth inference on high-end consumer hardware, utilize the following specialized backends.
+
+### Apple Silicon (M2 Max)
+`llama.cpp` has first-class **Metal** support.
+- **Flags**: Ensure `-ngl` (number of GPU layers) is set to the maximum (e.g., `-ngl 99`) to offload the entire model to the GPU.
+- **Threads**: Match the number of performance cores (e.g., `-t 8`).
+
+### AMD Strix Halo
+Strix Halo features a massive iGPU and a powerful NPU.
+- **Vulkan Backend**: Use the Vulkan backend for the iGPU (`LLAMA_VULKAN=1`).
+- **ROCm Backend**: For Linux users, ROCm provides near-native performance for AMD hardware.
+- **NPU Integration**: If using Windows/Linux with experimental NPU drivers, ensure `llama-server` is compiled with the relevant plugin (e.g., OpenVINO).
+
+---
+
+## 3. Overcoming "Slow PP" (Prompt Processing)
+
+Prompt Processing (PP) is often the bottleneck in agentic workflows where the context grows rapidly.
+
+### Persistent KV Caching (Slots)
+`llama-server` supports **slots**, which allow multiple sessions to share or persist their KV cache.
+- **Persistent Slot**: Use `--slot-save-path /path/to/cache` to save the context state between CLI restarts.
+- **Continuous Batching**: Use `--cont-batching` to allow the server to process new prompts while tokens are still being generated for other requests.
+
+### Configuration Tips
+- **Large Context**: Set a generous context size with `-c 32768` (or higher) to avoid frequent context shifting.
+- **Flash Attention**: Always enable Flash Attention (`--flash-attn`) to reduce memory bandwidth requirements during PP.
+
+---
+
+## 4. Supporting OSS Models
+
+Claude Code is tuned for Sonnet/Opus, but can be adapted for state-of-the-art open-source models:
+
+| Model | Mapping Suggestion | Strength |
+| :--- | :--- | :--- |
+| **Qwen3-72B-Instruct** | Map to `claude-3-opus-latest` | Excellent reasoning and tool use. |
+| **GPT-20-OSS** | Map to `claude-3-5-sonnet-latest` | High-speed, high-intelligence balance. |
+| **GPT-120-OSS** | Map to `claude-3-opus-latest` | Deep complex problem solving. |
+
+---
+
+## 5. Recommended `llama-server` Command
+
+For a dedicated local Claude Code backend:
+
+```bash
+./llama-server \
+  -m models/qwen3-72b-q4_k_m.gguf \
+  -c 32768 \
+  -ngl 99 \
+  --flash-attn \
+  --cont-batching \
+  --host 0.0.0.0 \
+  --port 8080 \
+  --api-key local-secret-token \
+  --slot-save-path ./llama_slots
+```
+
+---
+
+> [!CAUTION]
+> Using local models requires significant VRAM. A 70B model in 4-bit quantization requires ~40GB of VRAM. Ensure your hardware (like Strix Halo with 64GB+ shared RAM) can accommodate the model and KV cache.
+
+---
+
+## See Also
+- **[Authentication Guide](file:///Users/vlad/Developer/vlad/claude-code/docs/AUTH_GUIDE.md)**: Details on general environment variables and credential management.