wip:milestone 0 fixes

2026-03-15 12:35:42 +02:00
parent 6708cf28a7
commit cffdf8af86
61266 changed files with 4511646 additions and 1938 deletions
--- a/docs/AUTOBASE.md
+++ b/docs/AUTOBASE.md
@@ -0,0 +1,105 @@
+# MadBase State Pillar (Autobase + Redis)
+
+## Architecture
+
+The **State Pillar** (Pillar 3) is the centralized data layer of MadBase, hosting both durable and ephemeral state:
+
+- **PostgreSQL**: Persistent relational data (users, projects, storage metadata)
+- **Autobase**: HA and quorum management for PostgreSQL
+- **Redis**: High-performance caching and distributed state
+- **HAProxy**: Unified entry point for both databases
+
+## Components
+
+### PostgreSQL (Persistent State)
+- **Port**: 5432 (direct), 5433 (via HAProxy)
+- **Purpose**: ACID-compliant data storage
+- **Features**: 
+  - Automatic failover via Patroni
+  - etcd for leader election
+  - Replication for high availability
+
+### Redis (Ephemeral State)
+- **Port**: 6379 (via HAProxy)
+- **Purpose**: Shared caching and distributed coordination
+- **Features**:
+  - In-memory data structures
+  - TTL-based auto-expiration
+  - Pub/Sub messaging
+  - Atomic operations
+
+### Autobase Integration
+
+MadBase uses **Autobase** (PostgreSQL + Patroni + etcd) to provide a high-availability, self-healing database layer.
+
+## High Availability
+
+A minimum of **3 nodes** is required for quorum:
+
+- If the primary PostgreSQL fails, Patroni promotes a standby within <30 seconds
+- HAProxy automatically redirects traffic to the new leader
+- Redis uses Sentinel or Cluster for automatic failover
+
+## Scaling
+
+### Initial Setup
+- 1 node (non-HA, development)
+
+### Production
+- 3 or 5 nodes (HA with quorum)
+
+### Scaling Command
+```bash
+curl -X POST http://localhost:8001/api/v1/cluster/scale \
+  -d '{ "target_db_count": 3, "min_ha_nodes": true }'
+```
+
+## Use Cases
+
+### PostgreSQL (Persistent Data)
+- User accounts and authentication
+- Project configurations
+- Storage metadata
+- Function deployments
+- Audit logs
+
+### Redis (Ephemeral Data)
+- User sessions (shared across proxies)
+- Realtime presence tracking
+- Rate limiting counters
+- Distributed locks
+- API response caching
+
+## Monitoring
+
+Database health is monitored via the System Node:
+
+- Check Patroni status: `curl http://db-node:8008/health`
+- Check Redis: `redis-cli -h db-node ping`
+- HAProxy Stats: http://db-node:7000
+- Metrics available in "State Pillar Performance" Grafana dashboard
+
+## Backup Strategy
+
+- **PostgreSQL**: Daily automated backups to S3
+- **Redis**: Periodic RDB snapshots (configured via Redis config)
+- **HAProxy**: Configuration managed via Infrastructure as Code
+
+## Configuration
+
+### Environment Variables
+```bash
+DATABASE_URL="postgres://user:pass@db:5432/madbase"
+REDIS_URL="redis://db:6379/0"
+PATRONI_SCOPE=madbase-cluster
+```
+
+### Resource Requirements
+
+| Plan | RAM | CPU | Max Concurrent Connections |
+|------|-----|-----|---------------------------|
+| CX21 | 8GB | 3   | 100                       |
+| CX31 | 16GB | 4   | 200                       |
+| CX41 | 32GB | 8   | 500                       |
+
+See [CACHING_STRATEGY.md](CACHING_STRATEGY.md) for detailed caching information.
--- a/docs/CACHING_STRATEGY.md
+++ b/docs/CACHING_STRATEGY.md
@@ -0,0 +1,249 @@
+# MadBase Caching Strategy
+
+## Overview
+
+MadBase implements a **two-tier caching architecture** that maintains the simplicity of the 4-pillar system while providing enterprise-grade caching capabilities.
+
+## Architecture
+
+### Tier 1: L1 Cache (In-Memory)
+- **Technology**: moka (Rust)
+- **Location**: Proxy / Worker nodes
+- **Purpose**: Ultra-low latency for frequently accessed data
+- **Typical Use Cases**:
+  - Project configurations
+  - JWT validation cache
+  - Hot database query results
+  - API response caching
+
+### Tier 2: L2 Cache (Redis)
+- **Technology**: Redis 7
+- **Location**: State Pillar (Pillar 3)
+- **Purpose**: Shared state across the entire cluster
+- **Typical Use Cases**:
+  - Distributed session storage
+  - Realtime presence tracking
+  - Rate limiting counters
+  - Distributed locking
+  - Pub/Sub messaging
+
+## State Pillar Integration
+
+The **State Pillar** (formerly "Database Pillar") now hosts both PostgreSQL and Redis:
+
+```
+┌─────────────────────────────────────────┐
+│           State Pillar Node             │
+├─────────────────────────────────────────┤
+│  ┌──────────┐        ┌─────────────┐   │
+│  │PostgreSQL│        │    Redis    │   │
+│  │  :5432   │        │    :6379    │   │
+│  └──────────┘        └─────────────┘   │
+│         │                   │          │
+│         └─────────┬─────────┘          │
+│                   ▼                     │
+│            ┌─────────────┐             │
+│            │   HAProxy   │             │
+│            │ :5433/:6379 │             │
+│            └─────────────┘             │
+└─────────────────────────────────────────┘
+```
+
+### Why This Approach?
+
+1. **Resource Symmetry**: Both PostgreSQL and Redis are memory-intensive and share the same VPS requirements
+2. **HA Piggybacking**: Pillar 3 already manages HA via Patroni and etcd. Redis benefits from the same infrastructure
+3. **Centralized State**: Maintains clean separation of Compute (Worker/Proxy) vs. State (DB/Redis)
+4. **Zero Complexity**: No new pillar needed, just enhanced the existing one
+
+## Features
+
+### 1. Shared Auth Sessions
+
+Users can now stay logged in even if the Proxy node handling their request changes:
+
+```rust
+use auth::SessionManager;
+
+// Create a session
+let session_token = session_manager
+    .create_session(user_id, email, "authenticated".to_string())
+    .await?;
+
+// Validate on any proxy node
+let session = session_manager
+    .validate_session(&session_token)
+    .await?;
+```
+
+### 2. Realtime Presence
+
+Track "Who is online" across multiple Worker nodes:
+
+```rust
+use realtime::PresenceManager;
+
+// User joins a channel
+presence_manager
+    .join_channel(user_id, "public-chat".to_string(), None)
+    .await?;
+
+// Get online count
+let count = presence_manager
+    .get_channel_online_count("public-chat".to_string())
+    .await?;
+```
+
+### 3. Distributed Locking
+
+Prevent race conditions during background operations:
+
+```rust
+use common::DistributedLock;
+
+let lock = DistributedLock::new(
+    redis_client,
+    "migration:lock".to_string(),
+    30, // 30 seconds TTL
+);
+
+if lock.acquire().await? {
+    // Perform critical section
+    lock.release().await?;
+}
+```
+
+### 4. Rate Limiting
+
+Distributed rate limiting across all instances:
+
+```rust
+use gateway::rate_limit::RateLimitMiddleware;
+
+// Check IP-based rate limit
+if !middleware.check_ip(&user_ip).await? {
+    return Err("Rate limit exceeded");
+}
+```
+
+## Configuration
+
+### Environment Variables
+
+```bash
+# PostgreSQL
+DATABASE_URL="postgres://user:pass@db:5432/madbase"
+
+# Redis (Optional - will fallback to L1 only)
+REDIS_URL="redis://db:6379/0"
+
+# Cache TTL
+CACHE_TTL_SECONDS=3600
+```
+
+### Cache Keyspaces
+
+| Pattern | Purpose | TTL |
+|---------|---------|-----|
+| `session:{token}` | User sessions | 3600s |
+| `presence:channel:{name}:user:{id}` | User presence | 60s |
+| `ratelimit:ip:{addr}` | IP rate limiting | 60s |
+| `ratelimit:user:{id}` | User rate limiting | 60s |
+| `lock:{name}` | Distributed locks | Configurable |
+
+## HAProxy Configuration
+
+The State Pillar's HAProxy routes both PostgreSQL and Redis traffic:
+
+```haproxy
+listen primary
+    bind *:5433
+    mode tcp
+    server patroni1 patroni:5432 check
+
+listen redis
+    bind *:6379
+    mode tcp
+    server redis1 redis:6379 check
+```
+
+## Scaling Strategy
+
+### Horizontal Scaling
+
+- **Proxy Nodes**: Add more proxies, all share the same Redis cache
+- **Worker Nodes**: Add more workers, presence tracking works seamlessly
+- **State Nodes**: Scale to 3 or 5 nodes for HA, Redis is replicated via Sentinel/Cluster
+
+### Vertical Scaling
+
+- Upgrade State Node plan for more RAM (benefits both PostgreSQL and Redis)
+- Typical: CX21 (8GB) → CX31 (16GB) → CX41 (32GB)
+
+## Monitoring
+
+Redis is monitored alongside PostgreSQL:
+
+- **HAProxy Stats**: http://db-node:7000
+- **Grafana Dashboard**: "State Pillar Performance"
+- **Metrics**:
+  - Redis memory usage
+  - Cache hit/miss ratios
+  - Connection pool utilization
+  - Rate limit enforcement
+
+## Best Practices
+
+1. **Session Management**: Use appropriate TTLs (shorter for sensitive data)
+2. **Presence Tracking**: Implement heartbeats to keep users "online"
+3. **Rate Limiting**: Use different limits for different user tiers
+4. **Distributed Locks**: Always set reasonable TTLs to prevent deadlocks
+5. **Cache Invalidation**: Use versioned keys or explicit deletion
+
+## Migration Guide
+
+### From Single-Node to Cluster
+
+1. Update State Pillar image to include Redis
+2. Set `REDIS_URL` in all Proxy/Worker configurations
+3. Deploy SessionManager in Auth handlers
+4. Enable presence tracking in Realtime module
+5. Update rate limiting to use distributed counters
+
+### Testing
+
+```bash
+# Test Redis connection
+redis-cli -h db-node ping
+
+# Test session creation
+curl -X POST http://localhost:8000/auth/v1/token \
+  -d '{"email":"test@example.com","password":"password"}'
+
+# Check presence
+redis-cli -h db-node SMEMBERS "presence:channel:public:users"
+```
+
+## Performance
+
+### Expected Latency
+
+| Operation | L1 Cache (moka) | L2 Cache (Redis) | Database |
+|-----------|-----------------|------------------|----------|
+| Get       | <1μs            | 1-2ms            | 10-50ms  |
+| Set       | <1μs            | 1-2ms            | 10-50ms  |
+| Delete    | <1μs            | 1-2ms            | 10-50ms  |
+
+### Cache Hit Ratios
+
+- **L1 Hit**: 95%+ for frequently accessed data
+- **L2 Hit**: 80%+ for shared state
+- **Miss**: Falls through to database
+
+## Future Enhancements
+
+- [ ] Redis Cluster for horizontal scaling
+- [ ] Pub/Sub for real-time events
+- [ ] Bloom filters for existence checks
+- [ ] HyperLogLog for cardinality estimation
+- [ ] Geospatial indexing for location features
--- a/docs/DEPLOYMENT_GUIDE.md
+++ b/docs/DEPLOYMENT_GUIDE.md
@@ -0,0 +1,70 @@
+# MadBase Deployment Guide
+
+This guide covers everything from initial setup to high-availability scaling on Hetzner Cloud and other providers.
+
+## 1. Prerequisites
+
+1. **Hetzner Cloud Account** with API token (or other supported provider).
+2. **SSH Key** added to your provider.
+3. **PostgreSQL database** for the Control Plane state.
+4. **Docker** installed for local development and service deployment.
+
+## 2. Setting Up the Control Plane
+
+### Step 1: Environment Configuration
+```bash
+export HETZNER_API_KEY="your_token"
+export DATABASE_URL="postgresql://user:pass@localhost/madbase_control_plane"
+export HETZNER_SSH_KEY_PATH="~/.ssh/id_rsa"
+```
+
+### Step 2: Run the API
+```bash
+docker run -p 8001:8001 \
+  -e DATABASE_URL=$DATABASE_URL \
+  -e HETZNER_API_KEY=$HETZNER_API_KEY \
+  madbase/control-plane
+```
+
+## 3. Provisioning a Cluster
+
+### Adding a Node
+To add a node, send a POST request to the Control Plane API:
+```bash
+curl -X POST http://localhost:8001/api/v1/servers \
+  -d '{
+    "name": "worker-1",
+    "template": "worker-node",
+    "hetzner_plan": "CX11",
+    "region": "fsn1"
+  }'
+```
+
+Refer to [NODE_TEMPLATES.md](NODE_TEMPLATES.md) for available templates.
+
+## 4. Scaling Strategies
+
+### Horizontal Scaling
+The **Proxy/API** and **Worker** pillars are designed for horizontal expansion.
+- Use `POST /api/v1/cluster/scale` to target a specific node count.
+- The system handles drain-and-remove logic for safe scale-down.
+
+### Vertical Scaling (System Node)
+The **System Node** is non-horizontally scalable. To scale it:
+1. Upgrade the VPS plan in the Hetzner console.
+2. The Control Plane will detect the resource change on restart.
+
+## 5. Security Hardening
+Use the `/fortify` endpoint to secure your nodes:
+- Configures Hetzner Cloud Firewalls.
+- Disables root/password SSH access.
+- Installs `fail2ban`.
+
+## 6. High Availability (HA)
+For production deployments, always aim for:
+- 3+ Database nodes (for Quorum).
+- 2+ Proxy nodes (for Ingress HA).
+- Distributed regions (e.g., `fsn1`, `nbg1`).
+
+---
+For more details on multiple providers, see the specialized `MULTI_PROVIDER_VPS.md` implementation notes.
--- a/docs/KERNEL_ARCHITECTURE.md
+++ b/docs/KERNEL_ARCHITECTURE.md
@@ -0,0 +1,128 @@
+# MadBase Kernel Architecture
+
+This document defines the core organization and security model of the MadBase infrastructure.
+
+## Documentation Map
+- [**Deployment Guide**](DEPLOYMENT_GUIDE.md): Setup, Scaling, and Provider configuration.
+- [**Storage & Persistence**](STORAGE.md): DB, S3, and Backups.
+- [**State Pillar (Autobase + Redis)**](AUTOBASE.md): High-Availability State Node details.
+- [**Caching Strategy**](CACHING_STRATEGY.md): Two-tier caching architecture.
+- [**Node Templates**](NODE_TEMPLATES.md): Reference for server plans and services.
+
+---
+
+The "Kernel" architecture is the simplified, core organizational model for MadBase deployments. It collapses complex node roles into four manageable pillars, each with specific scaling characteristics and duties.
+
+## 0. System Pillar (The Foundation)
+A horizontally static but **vertically scalable** "seed" node that provides the cluster's base services.
+- **Components**:
+    - **Control Plane API**: Cluster management and orchestration.
+    - **Observability**: VictoriaMetrics, Loki, Grafana.
+- **Scaling**: Static horizontally (1 node). Supports **Vertical Scaling** via VPS plan upgrades (e.g., CX21 to CX41).
+
+## 1. Proxy / Public API (The Face)
+This pillar handles external communication and the public-facing API layer.
+- **Components**:
+    - **Gateway Proxy**: Ingress, SSL, and request routing.
+    - **Public API**: The core platform API (Auth, Storage metadata, etc.).
+    - **L1 Cache**: In-memory caching (moka) for ultra-low latency.
+- **Scaling**:
+    - **Range**: 1 to 100 nodes.
+- **Constraints**: Horizontally scalable via Anycast or Floating IP.
+
+## 2. Worker (The Muscle)
+This pillar executes business logic and Edge Functions.
+- **Components**:
+    - **Compute**: Deno/Wasm runners.
+    - **Realtime**: WebSocket managers with presence tracking.
+    - **L1 Cache**: In-memory caching for function results.
+- **Scaling**: 1+ nodes.
+- **Constraints**: Unlimited horizontal scaling.
+
+## 3. State Pillar (The Memory)
+Ensures data persistence, consistency, and distributed coordination.
+- **Components**:
+    - **PostgreSQL**: Primary data store (via Autobase).
+    - **Redis**: High-performance distributed cache.
+    - **HAProxy**: Unified entry point for both databases.
+- **Scaling**: 1, 3, or 5 nodes (Must be odd for quorum).
+- **Features**:
+    - Shared auth sessions across proxies
+    - Realtime presence tracking across workers
+    - Distributed locking for migrations
+    - Cluster-wide rate limiting
+
+---
+
+## Observability Strategy (Metrics & Logs)
+
+To maintain the performance of the four pillars, we implement a dedicated **System Pillar** for observability (often co-located or separate based on scale).
+
+- **VictoriaMetrics (VM)**: Fast, cost-effective time-series database for metrics.
+- **Loki**: Distributed log aggregation.
+- **Placement**:
+    - Small Clusters: Embedded in the **Control** nodes.
+    - High Throughput: Dedicated `system-node` to prevent observability overhead from impacting the application pillars.
+
+---
+
+## Network Isolation & Security Zones
+
+To ensure "Defense in Depth," the kernel is divided into two distinct network zones:
+
+### 1. Public Zone (The DMZ)
+- **Deployment**: Nodes have a Public IP and are attached to the Cluster VPC.
+- **Pillars**:
+    - **System Node**: For cluster administration and dashboard access.
+    - **Proxy / Public API**: For handling all incoming internet traffic.
+- **Access**: Restricted to HTTPS (443) and SSH (via safe-list).
+
+### 2. Private Zone (The Core)
+- **Deployment**: Nodes have **No Public IP**. They are accessible ONLY via the Cluster VPC (Private Network).
+- **Pillars**:
+    - **Worker Pillar**: Executes application code. 
+    - **State Pillar**: Stores sensitive project data (PostgreSQL + Redis).
+- **Access**: No direct internet access. All ingress must pass through the Proxy/API pillar. Egress is managed via a NAT Gateway (optional) or limited to OS updates.
+
+---
+
+## State Pillar: The "Memory" of the Cluster
+
+The State Pillar combines **persistent storage** (PostgreSQL) and **ephemeral state** (Redis) into a single, highly-available unit:
+
+### Why Combine Them?
+
+1. **Resource Symmetry**: Both PostgreSQL and Redis are memory-intensive and benefit from the same VPS plans (High-RAM nodes).
+2. **HA Piggybacking**: Pillar 3 already manages HA via Patroni and etcd. Redis leverages the same infrastructure.
+3. **Centralized Coordination**: Having all state (durable and ephemeral) in one place simplifies the architecture.
+4. **Zero Complexity**: No new pillar needed—we just enhanced the existing "Database" pillar.
+
+### Cache Distribution
+
+- **L1 Cache** (moka): Runs on each Proxy/Worker node for ultra-low latency
+- **L2 Cache** (Redis): Runs on State Pillar nodes for shared state
+
+```
+┌─────────────┐         ┌─────────────┐
+│  Proxy 1    │         │  Worker 1   │
+│  (L1 Cache) │         │  (L1 Cache) │
+└──────┬──────┘         └──────┬──────┘
+       │                      │
+       └──────────┬───────────┘
+                  │
+         ┌────────▼────────┐
+         │  State Pillar   │
+         │  ┌──────────┐   │
+         │  │PostgreSQL│   │
+         │  └──────────┘   │
+         │  ┌──────────┐   │
+         │  │  Redis   │   │
+         │  │(L2 Cache)│   │
+         │  └──────────┘   │
+         │  ┌──────────┐   │
+         │  │ HAProxy  │   │
+         │  └──────────┘   │
+         └─────────────────┘
+```
+
+For detailed caching architecture, see [CACHING_STRATEGY.md](CACHING_STRATEGY.md).
--- a/docs/NODE_TEMPLATES.md
+++ b/docs/NODE_TEMPLATES.md
@@ -0,0 +1,567 @@
+# Node Templates - Quick Reference
+
+Complete guide to MadBase node templates for Hetzner Cloud deployment.
+
+## Template Overview
+
+| Template | Pillar | Min Plan | Cost/Mo | Use Case | Services |
+|----------|--------|----------|---------|----------|----------|
+| Template | Pillar | Zone | Min Plan | Cost/Mo | Use Case | Services |
+|----------|--------|------|----------|---------|----------|----------|
+| **system-node** | System | Public | CX21 | €6.94 | Cluster Root | Control API + Grafana + VM + Loki |
+| **proxy-api-node** | Proxy / API | Public | CX11 | €3.69 | Scalable Ingress | Gateway + Platform API |
+| **worker-node** | Worker | Private | CX11 | €3.69 | Horizontal scaling | Worker + vmagent |
+| **db-node** | DB / State | Private | CX21 | €6.94 | Production database HA | PostgreSQL + Patroni + etcd + HAProxy |
+| **worker-db-combo** ⭐ | Mixed | CX31 | €14.21 | Smaller deployments | Worker + PostgreSQL + etcd + HAProxy |
+| **worker-monitor-combo** ⭐ | Mixed | CX21 | €6.94 | Cost-optimized | Worker + VictoriaMetrics + Loki |
+| **all-in-one** ⭐ | Unified | CX41 | €25.60 | Development/MVP | All services on one node |
+
+⭐ = Composite template (mixes multiple service types)
+
+---
+
+## Pure Templates (Single Service Type)
+
+### 1. Database Node (db-node.yaml)
+
+**Best for**: Production deployments requiring database HA
+
+**Server**: CX21 (4GB RAM, 2 vCPU)
+
+**Services**:
+- PostgreSQL 15 with Patroni (auto-failover)
+- etcd (distributed consensus)
+- HAProxy (connection pooling + read/write splitting)
+
+**Scaling**: 3-7 nodes (odd number for quorum)
+
+**When to use**:
+- Production traffic >1000 req/min
+- Need database auto-failover
+- Want separate database cluster
+
+### 2. Worker Node (worker-node.yaml)
+
+**Best for**: Horizontal scaling of API workers
+
+**Server**: CX11 (4GB RAM, 2 vCPU)
+
+**Services**:
+- MadBase Worker (API processing)
+- vmagent (metrics collection)
+
+**Scaling**: 1-20 nodes
+
+**Auto-scaling rules**:
+- Scale up: CPU > 70%
+- Scale down: CPU < 20%
+
+**When to use**:
+- Need to scale workers independently
+- Separate database cluster already exists
+- Production deployments
+
+### 3. Control Plane Node (control-plane-node.yaml)
+
+**Best for**: Management UI and APIs
+
+**Server**: CX11 (4GB RAM, 2 vCPU)
+
+**Services**:
+- Gateway Proxy (port 8080)
+- Control Plane API (port 8001)
+- Grafana (port 3030)
+- Keepalived (HA with floating IP)
+
+**Scaling**: 1-2 nodes (HA mode)
+
+**When to use**:
+- Need web UI for server management
+- Want to provision servers via API
+- Production deployments
+
+### 4. Monitoring Node (monitoring-node.yaml)
+
+**Best for**: Centralized metrics and logging
+
+**Server**: CX11 (4GB RAM, 2 vCPU)
+
+**Services**:
+- VictoriaMetrics (metrics database)
+- Loki (log aggregation)
+- Alertmanager (optional)
+
+**Scaling**: 1-2 nodes (can be HA)
+
+**When to use**:
+- Production deployments
+- Want centralized monitoring
+- Need log aggregation
+
+---
+
+## Composite Templates (Mix Multiple Service Types)
+
+### 5. Worker + Database Combo (worker-db-combo.yaml) ⭐
+
+**Best for**: 2-3 server deployments with database and worker on same node
+
+**Server**: CX31 (8GB RAM, 2 vCPU)
+
+**Services**:
+- PostgreSQL 15 with Patroni
+- etcd
+- HAProxy
+- MadBase Worker
+- vmagent
+
+**Why use this**:
+- Cost savings (€6.94 vs €10.63 for separate nodes)
+- Simpler architecture for smaller deployments
+- Easy to scale later
+
+**Scaling**: 1-2 nodes
+
+**Upgrade path**: When CPU > 60% or RAM > 70%, migrate to dedicated db-node + worker-node
+
+**Deployment example**:
+```yaml
+Server 1 (worker-db-combo): PostgreSQL + Worker
+Server 2 (control-plane): Proxy + Control + Grafana
+Server 3 (monitoring): VictoriaMetrics + Loki
+```
+
+### 6. Worker + Monitoring Combo (worker-monitor-combo.yaml) ⭐
+
+**Best for**: Cost-optimized deployments with monitoring on worker node
+
+**Server**: CX21 (4GB RAM, 2 vCPU)
+
+**Services**:
+- MadBase Worker
+- VictoriaMetrics
+- Loki
+- vmagent
+- Promtail
+
+**Why use this**:
+- Save €3.69/mo (no dedicated monitoring node)
+- Monitoring co-located with worker
+- Good for 2-3 server deployments
+
+**Scaling**: 1-3 nodes
+
+**When to upgrade**: 
+- Worker CPU > 60% (monitoring competes for resources)
+- Need to scale workers horizontally
+
+**Deployment example**:
+```yaml
+Server 1 (worker-monitor-combo): Worker + VictoriaMetrics + Loki
+Server 2 (db-node): PostgreSQL + etcd + HAProxy
+Server 3 (control-plane): Proxy + Control + Grafana
+```
+
+### 7. All-in-One (all-in-one.yaml) ⭐
+
+**Best for**: Development, testing, or MVP deployments
+
+**Server**: CX41 (16GB RAM, 4 vCPU)
+
+**Services**: ALL (PostgreSQL, etcd, HAProxy, Redis, MinIO, Workers, Proxy, Control, VictoriaMetrics, Loki, Grafana)
+
+**Why use this**:
+- Simplest deployment
+- Single server for everything
+- Great for development/testing
+
+**When to upgrade**:
+- Production traffic > 100 req/min
+- CPU usage > 70% sustained
+- Need HA for database
+
+---
+
+## Monitoring Stack: VictoriaMetrics + Loki
+
+### How It Works
+
+```
+┌──────────────┐         ┌──────────────┐         ┌──────────────┐
+│   Node 1     │         │   Node 2     │         │   Node 3     │
+│              │         │              │         │              │
+│ ┌──────────┐ │         │ ┌──────────┐ │         │ ┌──────────┐ │
+│ │ vmagent  │─┼─────────┼─│ vmagent  │─┼─────────┼─│ vmagent  │─┼──┐
+│ └──────────┘ │         │ └──────────┘ │         │ └──────────┘ │  │
+│   Scans:     │         │   Scans:     │         │   Scans:     │  │
+│   - worker   │         │   - worker   │         │   - db       │  │
+│   - system   │         │   - system   │         │   - system   │  │
+└──────────────┘         └──────────────┘         └──────────────┘  │
+                                                                       │
+                                                                       ▼
+                                                    ┌───────────────────────┐
+                                                    │  VictoriaMetrics      │
+                                                    │  Port: 8428           │
+                                                    │  Type: Metrics DB     │
+                                                    └───────────┬───────────┘
+                                                                │
+                                                                ▼
+                                                    ┌───────────────────────┐
+                                                    │      Grafana          │
+                                                    │  Port: 3030           │
+                                                    │  Queries VM + Loki    │
+                                                    └───────────────────────┘
+
+┌──────────────┐         ┌──────────────┐
+│   Node 1     │         │   Node 2     │
+│              │         │              │
+│ ┌──────────┐ │         │ ┌──────────┐ │
+│ │ Promtail │─┼─────────┼─│ Promtail │─┼───┐
+│ └──────────┘ │         │ └──────────┘ │   │
+│   Reads:     │         │   Reads:     │   │
+│   - logs/*   │         │   - logs/*   │   │
+└──────────────┘         └──────────────┘   │
+                                             │
+                                             ▼
+                                  ┌───────────────────────┐
+                                  │         Loki          │
+                                  │  Port: 3100           │
+                                  │  Type: Log Aggregation│
+                                  └───────────┬───────────┘
+                                              │
+                                              ▼
+                                  ┌───────────────────────┐
+                                  │      Grafana          │
+                                  │  LogQL Queries        │
+                                  └───────────────────────┘
+```
+
+### Components
+
+#### VictoriaMetrics (Metrics Database)
+
+**Purpose**: Store and query time-series metrics
+
+**Location**: 
+- Dedicated monitoring-node (recommended)
+- worker-monitor-combo (cost-optimized)
+- all-in-one (development)
+
+**Data Flow**:
+1. vmagent on each node scrapes metrics every 15s
+2. Metrics sent to VictoriaMetrics via remote write
+3. VictoriaMetrics stores metrics with 10x compression
+4. Grafana queries VictoriaMetrics for dashboards
+
+**Metrics Collected**:
+- **Worker**: Request rate, error rate, latency, queue depth
+- **PostgreSQL**: Connections, transactions, replication lag
+- **System**: CPU, memory, disk, network
+- **HAProxy**: Connection count, response time
+
+**Storage Requirements**:
+- ~1GB per million time series per day (compressed)
+- Default retention: 30 days
+- RAM: Minimal, scales with active queries
+
+#### Loki (Log Aggregation)
+
+**Purpose**: Store and query logs
+
+**Location**: 
+- Dedicated monitoring-node (recommended)
+- worker-monitor-combo (cost-optimized)
+- all-in-one (development)
+
+**Data Flow**:
+1. Promtail on each node tails log files
+2. Logs sent to Loki via HTTP API
+3. Loki indexes logs by labels (service, level, host)
+4. Grafana queries Loki using LogQL
+
+**Logs Collected**:
+- **Worker**: `/var/log/madbase/worker.log`
+- **PostgreSQL**: `/var/log/postgresql/*.log`
+- **System**: `/var/log/syslog`
+
+**Storage Requirements**:
+- ~10% of raw log size (with compression)
+- Default retention: 30 days
+- RAM: Minimal, scales with active queries
+
+#### vmagent (Metrics Collector)
+
+**Purpose**: Scrape metrics and send to VictoriaMetrics
+
+**Location**: Runs on EVERY node
+
+**Port**: 8429 (local debug endpoint)
+
+**Configuration**: `config/vmagent.yml`
+
+**Scrape Targets**:
+- Worker: `localhost:8002/metrics`
+- Patroni: `localhost:8008/metrics`
+- Node Exporter: `localhost:9100/metrics`
+- HAProxy: `localhost:7000/metrics`
+
+**Resource Usage**:
+- CPU: <5% of 1 core
+- Memory: ~50MB
+
+#### Promtail (Log Collector)
+
+**Purpose**: Tail log files and send to Loki
+
+**Location**: Runs on EVERY node
+
+**Configuration**: `config/promtail.yml`
+
+**Log Sources**:
+- `/var/log/madbase/worker.log` (worker logs)
+- `/var/log/postgresql/*.log` (database logs)
+- `/var/log/syslog` (system logs)
+
+**Resource Usage**:
+- CPU: <2% of 1 core
+- Memory: ~30MB
+
+### Grafana Integration
+
+Grafana connects to both VictoriaMetrics and Loki:
+
+**Example Dashboard Query**:
+``yaml
+Panel 1: Request Rate (Metrics)
+  Query: rate(http_requests_total[5m])
+
+Panel 2: Error Rate (Metrics)
+  Query: rate(http_requests_total{status=~"5.."}[5m])
+
+Panel 3: Recent Errors (Logs)
+  Query: {level="error"} | line format "{{.message}}"
+
+Panel 4: Trace Request by ID (Logs)
+  Query: {trace_id="abc123"} |= "timeout"
+```
+
+### Deployment Scenarios
+
+#### Scenario 1: Dedicated Monitoring Node (Production)
+
+``yaml
+servers:
+  - name: server1
+    template: control-plane-node
+    plan: CX11
+  - name: server2
+    template: db-node
+    plan: CX21
+  - name: server3
+    template: worker-node
+    plan: CX11
+  - name: server4
+    template: monitoring-node  ← Dedicated monitoring
+    plan: CX11
+```
+
+**Cost**: €17.22/mo (4 servers)
+**Best for**: Production with >1000 req/min
+
+#### Scenario 2: Worker + Monitoring Combo (Cost-Optimized)
+
+``yaml
+servers:
+  - name: server1
+    template: control-plane-node
+    plan: CX11
+  - name: server2
+    template: db-node
+    plan: CX21
+  - name: server3
+    template: worker-monitor-combo  ← Combined
+    plan: CX21
+```
+
+**Cost**: €13.53/mo (3 servers)
+**Best for**: Cost-optimized production with <1000 req/min
+
+#### Scenario 3: All-in-One (Development)
+
+``yaml
+servers:
+  - name: dev-server
+    template: all-in-one
+    plan: CX41
+```
+
+**Cost**: €25.60/mo (1 server)
+**Best for**: Development, testing, MVP
+
+---
+
+## Deployment Examples
+
+### Example 1: Small Production (3 servers)
+
+``yaml
+Server 1 (CX21 - €6.94):
+  Template: worker-db-combo
+  Services: PostgreSQL + Worker
+  
+Server 2 (CX11 - €3.69):
+  Template: control-plane-node
+  Services: Proxy + Control + Grafana
+  
+Server 3 (CX11 - €3.69):
+  Template: worker-monitor-combo
+  Services: Worker + VictoriaMetrics + Loki
+
+Total: €14.32/mo
+```
+
+### Example 2: Medium Production (4 servers)
+
+``yaml
+Server 1 (CX21 - €6.94):
+  Template: db-node
+  Services: PostgreSQL + etcd + HAProxy
+  
+Server 2 (CX11 - €3.69):
+  Template: worker-node
+  Services: Worker + vmagent
+  
+Server 3 (CX11 - €3.69):
+  Template: control-plane-node
+  Services: Proxy + Control + Grafana
+  
+Server 4 (CX11 - €3.69):
+  Template: monitoring-node
+  Services: VictoriaMetrics + Loki
+
+Total: €17.22/mo
+```
+
+### Example 3: Large Production (6 servers)
+
+``yaml
+Server 1-3 (CX21 - €6.94 each):
+  Template: db-node
+  Services: PostgreSQL cluster (3 nodes)
+  
+Server 4-5 (CX11 - €3.69 each):
+  Template: worker-node
+  Services: Workers (2 nodes)
+  
+Server 6 (CX11 - €3.69):
+  Template: control-plane-node
+  Services: Proxy + Control + Grafana + VictoriaMetrics + Loki
+
+Total: €30.70/mo
+```
+
+---
+
+## Template Selection Guide
+
+**Start with these questions**:
+
+1. **What's your budget?**
+   - €15/mo → Use composite templates
+   - €25/mo → Use pure templates
+
+2. **What's your traffic?**
+   - <100 req/min → all-in-one
+   - <1000 req/min → worker-db-combo
+   - >1000 req/min → pure templates
+
+3. **Do you need database HA?**
+   - Yes → db-node (3 nodes minimum)
+   - No → worker-db-combo
+
+4. **Do you need centralized monitoring?**
+   - Yes → monitoring-node or worker-monitor-combo
+   - No → Skip (use worker vmagent only)
+
+---
+
+## Control Plane API Integration
+
+Templates are used by the Control Plane API to provision servers:
+
+```http
+POST /api/v1/servers
+Content-Type: application/json
+
+{
+  "name": "worker-1",
+  "template": "worker-node",
+  "hetzner_plan": "CX11",
+  "region": "fsn1",
+  "features": ["worker", "monitoring"],
+  "environment": "production"
+}
+```
+
+**Response**:
+``json
+{
+  "server_id": "abc123",
+  "status": "provisioning",
+  "ip_address": "167.235.123.45",
+  "services": [
+    {"name": "worker", "port": 8002},
+    {"name": "vmagent", "port": 8429}
+  ]
+}
+```
+
+---
+
+## Resource Profiles
+
+Each service can be tuned with resource profiles:
+
+``yaml
+minimal:
+  cpu_limit: "0.5"
+  memory_limit: "512Mi"
+  
+balanced:
+  cpu_limit: "2"
+  memory_limit: "2Gi"
+  
+cpu_intensive:
+  cpu_limit: "4"
+  memory_limit: "4Gi"
+```
+
+Default profiles are assigned in templates but can be overridden:
+
+```http
+POST /api/v1/servers
+
+{
+  "template": "worker-node",
+  "overrides": {
+    "worker": {
+      "resource_profile": "cpu_intensive"
+    }
+  }
+}
+```
+
+---
+
+## Next Steps
+
+1. **Choose template** based on budget and traffic
+2. **Provision servers** via Control Plane API or Hetzner CLI
+3. **Configure monitoring** (vmagent + promtail)
+4. **Verify health** with Grafana dashboards
+5. **Scale up/down** as needed
+
+For more details, see:
+- `STORAGE_CONFIGURATION.md` - Storage backend setup
+- `QUICKSTART_HETZNER_STORAGE.md` - Hetzner Bucket Storage guide
+- `4SERVER_DEPLOYMENT_GUIDE.md` - Multi-server deployment
--- a/docs/REDIS_IMPLEMENTATION.md
+++ b/docs/REDIS_IMPLEMENTATION.md
@@ -0,0 +1,97 @@
+# MadBase Redis Integration Implementation Plan
+
+## Current Status Analysis
+
+### ✅ Already Implemented
+1. **Redis in State Pillar** - Redis is already added to docker-compose.pillar-database.yml
+2. **HAProxy Configuration** - Port 6379 routing is configured in autobase-haproxy.cfg
+3. **Config Support** - redis_url field exists in common/src/config.rs
+4. **L1 Cache Ready** - moka is already in gateway/Cargo.toml
+
+### 🔨 Needs Implementation
+
+## Phase 1: Core Redis Client (Common Crate)
+
+### File: common/src/cache.rs
+- Create Redis client wrapper with connection pooling
+- Implement L1/L2 cache abstraction layer
+- Add distributed locking primitives
+- Add session management utilities
+
+## Phase 2: Application Layer Integration
+
+### Gateway (Proxy/Worker)
+- File: gateway/src/cache_layer.rs
+  - L1 cache for project configs (moka)
+  - L2 cache for shared state (Redis)
+  - Cache warming strategies
+  - Cache invalidation logic
+
+### Auth Module
+- File: auth/src/session.rs
+  - Shared auth sessions across proxies
+  - Session tokens in Redis
+  - Multi-proxy logout support
+
+## Phase 3: Features
+
+### Realtime Presence
+- File: realtime/src/presence.rs
+  - Track online users across workers
+  - Channel presence management
+  - Heartbeat mechanism with Redis pub/sub
+
+### Distributed Locking
+- File: common/src/locking.rs
+  - Redlock implementation
+  - Migration coordination
+  - Background job synchronization
+
+### Rate Limiting
+- File: gateway/src/rate_limit.rs
+  - Sliding window rate limiting
+  - Distributed counters
+  - IP-based and user-based limits
+
+## Phase 4: Updates to Existing Files
+
+### Templates
+- Update db-node.yaml to include Redis service definition
+- Update all templates that include PostgreSQL to also include Redis
+
+### Documentation
+- Update AUTOBASE.md to reflect "State Node" concept
+- Create CACHING_STRATEGY.md with architecture details
+- Update NODE_TEMPLATES.md with Redis information
+
+### Tests
+- Integration tests for cache layers
+- Failover tests for Redis HA
+- Performance benchmarks
+
+## Implementation Order
+
+1. **Common Cache Layer** (Priority 1)
+   - Redis client with pooling
+   - Cache abstraction (L1/L2)
+   - Basic operations (get/set/delete)
+
+2. **Auth Sessions** (Priority 1)
+   - Shared session store
+   - Multi-proxy support
+
+3. **Presence Tracking** (Priority 2)
+   - User online status
+   - Channel presence
+
+4. **Distributed Locking** (Priority 2)
+   - Migration coordination
+   - Background job locks
+
+5. **Rate Limiting** (Priority 3)
+   - Distributed rate limiting
+   - Sliding windows
+
+6. **Documentation & Tests** (Priority 4)
+   - Update all docs
+   - Add comprehensive tests
--- a/docs/STORAGE.md
+++ b/docs/STORAGE.md
@@ -0,0 +1,37 @@
+# MadBase Storage & Persistence
+
+MadBase uses a tiered storage approach combining local persistence, distributed databases, and S3-compatible object storage.
+
+## 1. Local Block Storage
+Nodes use local NVMe/SSD storage for high-performance service data:
+- **Database**: PostgreSQL data stored in `/var/lib/postgresql/data`.
+- **Metrics**: VictoriaMetrics data stored in `/victoria-metrics-data`.
+- **Logs**: Loki chunks stored in `/loki`.
+
+## 2. Object Storage (S3)
+Used for backups, static assets, and long-term state retention.
+- **Backups**: Database backups are automatically piped to S3-compatible buckets.
+- **Provider**: Works with Hetzner Bucket Storage, AWS S3, or MinIO.
+
+### Configuration
+```bash
+S3_ENDPOINT="https://fsn1.your-storage-endpoint.com"
+S3_ACCESS_KEY="your_key"
+S3_SECRET_KEY="your_secret"
+S3_BUCKET="madbase-backups"
+```
+
+## 3. Backup & Restore
+### Manual Backup
+```bash
+curl -X POST http://localhost:8001/api/v1/cluster/database/backup
+```
+### Manual Restore
+```bash
+curl -X POST http://localhost:8001/api/v1/cluster/database/restore \
+  -d '{"backup_url": "s3://..."}'
+```
+
+## 4. Planning & Capacity
+- **DB Nodes**: Scale vertically (RAM) for active datasets; scale horizontally (Nodes) for read-throughput and HA.
+- **Retention**: Configure VictoriaMetrics and Loki retention periods in the System Node config to manage disk usage.
--- a/docs/WASI_DENO.md
+++ b/docs/WASI_DENO.md
@@ -0,0 +1,52 @@
+# Plan: Deno Compatibility for MadBase Edge Functions
+
+## Problem Statement
+Currently, MadBase executes Edge Functions as WASM modules via `wasmtime`. Supabase-compatible Edge Functions (like those in `accountaflow`) are written in TypeScript and target a Deno environment. Migrating these requires 1:1 compatibility for the `Deno` namespace, ES modules, and standard web APIs (Fetch, Request, Response).
+
+## Proposed Architecture
+
+### 1. Dual-Runtime Strategy
+Extend the `functions` crate to support two runtimes:
+- **WasmRuntime**: Existing `wasmtime` based executor for compiled modules.
+- **DenoRuntime**: A new V8-based executor utilizing `deno_core` and `deno_runtime`.
+
+### 2. Runtime Detection
+The gateway should detect the function type:
+- **DenoRuntime (V8)**: Files ending in `.ts` or `.js`. Recommended for standard Edge Functions due to JIT-optimized performance.
+- **WasmRuntime (Wasmtime)**: Native WASM binaries (Rust, Go, C++). Best for specialized, high-performance logic or pre-compiled modules.
+
+
+## Implementation Steps
+
+### Phase 1: Core Integration
+- Add `deno_core` and `deno_runtime` dependencies to `madbase/functions/Cargo.toml`.
+- Create `functions/src/deno_runtime.rs`.
+- Implement `execute_script(code: String, payload: Value)` using `JsRuntime`.
+
+### Phase 2: Supabase Environment Compatibility
+- **Process Environment**: Inject `SUPABASE_URL`, `SUPABASE_ANON_KEY`, and `SUPABASE_SERVICE_ROLE_KEY`.
+- **Global Objects**: Implement a shim for `Deno.serve` to capture the incoming request and route it to the script's handler.
+- **Header Parsing**: Ensure standard headers (`apikey`, `Authorization`) are passed through.
+
+### Phase 3: Module Resolution
+- Implement a `ModuleLoader` that handles imports from `https://esm.sh/`.
+- Support local imports from a shared functions directory (like `_shared`).
+
+## API Changes
+
+### Gateway
+Modify `POST /functions/v1` to accept `type: "typescript" | "wasm"`. Default to "typescript" for source code.
+
+### Deployment Table
+Update the `functions` table schema in the control plane to store the runtime type.
+
+## Verification Plan
+
+### Automated Tests
+1. **Hello World Test**: Deploy a simple `.ts` function and verify the output.
+2. **Supabase Client Test**: Deploy a function that imports `@supabase/supabase-js` from `esm.sh` and queries the MadBase Data API.
+3. **Environment Variable Test**: Verify `Deno.env.get` returns expected MadBase configuration.
+
+### Manual Verification
+1. Attempt to deploy the `invite-staff` function from `accountaflow` directly to MadBase.
+2. Verify cross-organization invitation logic works.