wip:milestone 0 fixes
Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped

This commit is contained in:
2026-03-15 12:35:42 +02:00
parent 6708cf28a7
commit cffdf8af86
61266 changed files with 4511646 additions and 1938 deletions

105
docs/AUTOBASE.md Normal file
View File

@@ -0,0 +1,105 @@
# MadBase State Pillar (Autobase + Redis)
## Architecture
The **State Pillar** (Pillar 3) is the centralized data layer of MadBase, hosting both durable and ephemeral state:
- **PostgreSQL**: Persistent relational data (users, projects, storage metadata)
- **Autobase**: HA and quorum management for PostgreSQL
- **Redis**: High-performance caching and distributed state
- **HAProxy**: Unified entry point for both databases
## Components
### PostgreSQL (Persistent State)
- **Port**: 5432 (direct), 5433 (via HAProxy)
- **Purpose**: ACID-compliant data storage
- **Features**:
- Automatic failover via Patroni
- etcd for leader election
- Replication for high availability
### Redis (Ephemeral State)
- **Port**: 6379 (via HAProxy)
- **Purpose**: Shared caching and distributed coordination
- **Features**:
- In-memory data structures
- TTL-based auto-expiration
- Pub/Sub messaging
- Atomic operations
### Autobase Integration
MadBase uses **Autobase** (PostgreSQL + Patroni + etcd) to provide a high-availability, self-healing database layer.
## High Availability
A minimum of **3 nodes** is required for quorum:
- If the primary PostgreSQL fails, Patroni promotes a standby within <30 seconds
- HAProxy automatically redirects traffic to the new leader
- Redis uses Sentinel or Cluster for automatic failover
## Scaling
### Initial Setup
- 1 node (non-HA, development)
### Production
- 3 or 5 nodes (HA with quorum)
### Scaling Command
```bash
curl -X POST http://localhost:8001/api/v1/cluster/scale \
-d '{ "target_db_count": 3, "min_ha_nodes": true }'
```
## Use Cases
### PostgreSQL (Persistent Data)
- User accounts and authentication
- Project configurations
- Storage metadata
- Function deployments
- Audit logs
### Redis (Ephemeral Data)
- User sessions (shared across proxies)
- Realtime presence tracking
- Rate limiting counters
- Distributed locks
- API response caching
## Monitoring
Database health is monitored via the System Node:
- Check Patroni status: `curl http://db-node:8008/health`
- Check Redis: `redis-cli -h db-node ping`
- HAProxy Stats: http://db-node:7000
- Metrics available in "State Pillar Performance" Grafana dashboard
## Backup Strategy
- **PostgreSQL**: Daily automated backups to S3
- **Redis**: Periodic RDB snapshots (configured via Redis config)
- **HAProxy**: Configuration managed via Infrastructure as Code
## Configuration
### Environment Variables
```bash
DATABASE_URL="postgres://user:pass@db:5432/madbase"
REDIS_URL="redis://db:6379/0"
PATRONI_SCOPE=madbase-cluster
```
### Resource Requirements
| Plan | RAM | CPU | Max Concurrent Connections |
|------|-----|-----|---------------------------|
| CX21 | 8GB | 3 | 100 |
| CX31 | 16GB | 4 | 200 |
| CX41 | 32GB | 8 | 500 |
See [CACHING_STRATEGY.md](CACHING_STRATEGY.md) for detailed caching information.

249
docs/CACHING_STRATEGY.md Normal file
View File

@@ -0,0 +1,249 @@
# MadBase Caching Strategy
## Overview
MadBase implements a **two-tier caching architecture** that maintains the simplicity of the 4-pillar system while providing enterprise-grade caching capabilities.
## Architecture
### Tier 1: L1 Cache (In-Memory)
- **Technology**: moka (Rust)
- **Location**: Proxy / Worker nodes
- **Purpose**: Ultra-low latency for frequently accessed data
- **Typical Use Cases**:
- Project configurations
- JWT validation cache
- Hot database query results
- API response caching
### Tier 2: L2 Cache (Redis)
- **Technology**: Redis 7
- **Location**: State Pillar (Pillar 3)
- **Purpose**: Shared state across the entire cluster
- **Typical Use Cases**:
- Distributed session storage
- Realtime presence tracking
- Rate limiting counters
- Distributed locking
- Pub/Sub messaging
## State Pillar Integration
The **State Pillar** (formerly "Database Pillar") now hosts both PostgreSQL and Redis:
```
┌─────────────────────────────────────────┐
│ State Pillar Node │
├─────────────────────────────────────────┤
│ ┌──────────┐ ┌─────────────┐ │
│ │PostgreSQL│ │ Redis │ │
│ │ :5432 │ │ :6379 │ │
│ └──────────┘ └─────────────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ HAProxy │ │
│ │ :5433/:6379 │ │
│ └─────────────┘ │
└─────────────────────────────────────────┘
```
### Why This Approach?
1. **Resource Symmetry**: Both PostgreSQL and Redis are memory-intensive and share the same VPS requirements
2. **HA Piggybacking**: Pillar 3 already manages HA via Patroni and etcd. Redis benefits from the same infrastructure
3. **Centralized State**: Maintains clean separation of Compute (Worker/Proxy) vs. State (DB/Redis)
4. **Zero Complexity**: No new pillar needed, just enhanced the existing one
## Features
### 1. Shared Auth Sessions
Users can now stay logged in even if the Proxy node handling their request changes:
```rust
use auth::SessionManager;
// Create a session
let session_token = session_manager
.create_session(user_id, email, "authenticated".to_string())
.await?;
// Validate on any proxy node
let session = session_manager
.validate_session(&session_token)
.await?;
```
### 2. Realtime Presence
Track "Who is online" across multiple Worker nodes:
```rust
use realtime::PresenceManager;
// User joins a channel
presence_manager
.join_channel(user_id, "public-chat".to_string(), None)
.await?;
// Get online count
let count = presence_manager
.get_channel_online_count("public-chat".to_string())
.await?;
```
### 3. Distributed Locking
Prevent race conditions during background operations:
```rust
use common::DistributedLock;
let lock = DistributedLock::new(
redis_client,
"migration:lock".to_string(),
30, // 30 seconds TTL
);
if lock.acquire().await? {
// Perform critical section
lock.release().await?;
}
```
### 4. Rate Limiting
Distributed rate limiting across all instances:
```rust
use gateway::rate_limit::RateLimitMiddleware;
// Check IP-based rate limit
if !middleware.check_ip(&user_ip).await? {
return Err("Rate limit exceeded");
}
```
## Configuration
### Environment Variables
```bash
# PostgreSQL
DATABASE_URL="postgres://user:pass@db:5432/madbase"
# Redis (Optional - will fallback to L1 only)
REDIS_URL="redis://db:6379/0"
# Cache TTL
CACHE_TTL_SECONDS=3600
```
### Cache Keyspaces
| Pattern | Purpose | TTL |
|---------|---------|-----|
| `session:{token}` | User sessions | 3600s |
| `presence:channel:{name}:user:{id}` | User presence | 60s |
| `ratelimit:ip:{addr}` | IP rate limiting | 60s |
| `ratelimit:user:{id}` | User rate limiting | 60s |
| `lock:{name}` | Distributed locks | Configurable |
## HAProxy Configuration
The State Pillar's HAProxy routes both PostgreSQL and Redis traffic:
```haproxy
listen primary
bind *:5433
mode tcp
server patroni1 patroni:5432 check
listen redis
bind *:6379
mode tcp
server redis1 redis:6379 check
```
## Scaling Strategy
### Horizontal Scaling
- **Proxy Nodes**: Add more proxies, all share the same Redis cache
- **Worker Nodes**: Add more workers, presence tracking works seamlessly
- **State Nodes**: Scale to 3 or 5 nodes for HA, Redis is replicated via Sentinel/Cluster
### Vertical Scaling
- Upgrade State Node plan for more RAM (benefits both PostgreSQL and Redis)
- Typical: CX21 (8GB) → CX31 (16GB) → CX41 (32GB)
## Monitoring
Redis is monitored alongside PostgreSQL:
- **HAProxy Stats**: http://db-node:7000
- **Grafana Dashboard**: "State Pillar Performance"
- **Metrics**:
- Redis memory usage
- Cache hit/miss ratios
- Connection pool utilization
- Rate limit enforcement
## Best Practices
1. **Session Management**: Use appropriate TTLs (shorter for sensitive data)
2. **Presence Tracking**: Implement heartbeats to keep users "online"
3. **Rate Limiting**: Use different limits for different user tiers
4. **Distributed Locks**: Always set reasonable TTLs to prevent deadlocks
5. **Cache Invalidation**: Use versioned keys or explicit deletion
## Migration Guide
### From Single-Node to Cluster
1. Update State Pillar image to include Redis
2. Set `REDIS_URL` in all Proxy/Worker configurations
3. Deploy SessionManager in Auth handlers
4. Enable presence tracking in Realtime module
5. Update rate limiting to use distributed counters
### Testing
```bash
# Test Redis connection
redis-cli -h db-node ping
# Test session creation
curl -X POST http://localhost:8000/auth/v1/token \
-d '{"email":"test@example.com","password":"password"}'
# Check presence
redis-cli -h db-node SMEMBERS "presence:channel:public:users"
```
## Performance
### Expected Latency
| Operation | L1 Cache (moka) | L2 Cache (Redis) | Database |
|-----------|-----------------|------------------|----------|
| Get | <1μs | 1-2ms | 10-50ms |
| Set | <1μs | 1-2ms | 10-50ms |
| Delete | <1μs | 1-2ms | 10-50ms |
### Cache Hit Ratios
- **L1 Hit**: 95%+ for frequently accessed data
- **L2 Hit**: 80%+ for shared state
- **Miss**: Falls through to database
## Future Enhancements
- [ ] Redis Cluster for horizontal scaling
- [ ] Pub/Sub for real-time events
- [ ] Bloom filters for existence checks
- [ ] HyperLogLog for cardinality estimation
- [ ] Geospatial indexing for location features

70
docs/DEPLOYMENT_GUIDE.md Normal file
View File

@@ -0,0 +1,70 @@
# MadBase Deployment Guide
This guide covers everything from initial setup to high-availability scaling on Hetzner Cloud and other providers.
## 1. Prerequisites
1. **Hetzner Cloud Account** with API token (or other supported provider).
2. **SSH Key** added to your provider.
3. **PostgreSQL database** for the Control Plane state.
4. **Docker** installed for local development and service deployment.
## 2. Setting Up the Control Plane
### Step 1: Environment Configuration
```bash
export HETZNER_API_KEY="your_token"
export DATABASE_URL="postgresql://user:pass@localhost/madbase_control_plane"
export HETZNER_SSH_KEY_PATH="~/.ssh/id_rsa"
```
### Step 2: Run the API
```bash
docker run -p 8001:8001 \
-e DATABASE_URL=$DATABASE_URL \
-e HETZNER_API_KEY=$HETZNER_API_KEY \
madbase/control-plane
```
## 3. Provisioning a Cluster
### Adding a Node
To add a node, send a POST request to the Control Plane API:
```bash
curl -X POST http://localhost:8001/api/v1/servers \
-d '{
"name": "worker-1",
"template": "worker-node",
"hetzner_plan": "CX11",
"region": "fsn1"
}'
```
Refer to [NODE_TEMPLATES.md](NODE_TEMPLATES.md) for available templates.
## 4. Scaling Strategies
### Horizontal Scaling
The **Proxy/API** and **Worker** pillars are designed for horizontal expansion.
- Use `POST /api/v1/cluster/scale` to target a specific node count.
- The system handles drain-and-remove logic for safe scale-down.
### Vertical Scaling (System Node)
The **System Node** is non-horizontally scalable. To scale it:
1. Upgrade the VPS plan in the Hetzner console.
2. The Control Plane will detect the resource change on restart.
## 5. Security Hardening
Use the `/fortify` endpoint to secure your nodes:
- Configures Hetzner Cloud Firewalls.
- Disables root/password SSH access.
- Installs `fail2ban`.
## 6. High Availability (HA)
For production deployments, always aim for:
- 3+ Database nodes (for Quorum).
- 2+ Proxy nodes (for Ingress HA).
- Distributed regions (e.g., `fsn1`, `nbg1`).
---
For more details on multiple providers, see the specialized `MULTI_PROVIDER_VPS.md` implementation notes.

128
docs/KERNEL_ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,128 @@
# MadBase Kernel Architecture
This document defines the core organization and security model of the MadBase infrastructure.
## Documentation Map
- [**Deployment Guide**](DEPLOYMENT_GUIDE.md): Setup, Scaling, and Provider configuration.
- [**Storage & Persistence**](STORAGE.md): DB, S3, and Backups.
- [**State Pillar (Autobase + Redis)**](AUTOBASE.md): High-Availability State Node details.
- [**Caching Strategy**](CACHING_STRATEGY.md): Two-tier caching architecture.
- [**Node Templates**](NODE_TEMPLATES.md): Reference for server plans and services.
---
The "Kernel" architecture is the simplified, core organizational model for MadBase deployments. It collapses complex node roles into four manageable pillars, each with specific scaling characteristics and duties.
## 0. System Pillar (The Foundation)
A horizontally static but **vertically scalable** "seed" node that provides the cluster's base services.
- **Components**:
- **Control Plane API**: Cluster management and orchestration.
- **Observability**: VictoriaMetrics, Loki, Grafana.
- **Scaling**: Static horizontally (1 node). Supports **Vertical Scaling** via VPS plan upgrades (e.g., CX21 to CX41).
## 1. Proxy / Public API (The Face)
This pillar handles external communication and the public-facing API layer.
- **Components**:
- **Gateway Proxy**: Ingress, SSL, and request routing.
- **Public API**: The core platform API (Auth, Storage metadata, etc.).
- **L1 Cache**: In-memory caching (moka) for ultra-low latency.
- **Scaling**:
- **Range**: 1 to 100 nodes.
- **Constraints**: Horizontally scalable via Anycast or Floating IP.
## 2. Worker (The Muscle)
This pillar executes business logic and Edge Functions.
- **Components**:
- **Compute**: Deno/Wasm runners.
- **Realtime**: WebSocket managers with presence tracking.
- **L1 Cache**: In-memory caching for function results.
- **Scaling**: 1+ nodes.
- **Constraints**: Unlimited horizontal scaling.
## 3. State Pillar (The Memory)
Ensures data persistence, consistency, and distributed coordination.
- **Components**:
- **PostgreSQL**: Primary data store (via Autobase).
- **Redis**: High-performance distributed cache.
- **HAProxy**: Unified entry point for both databases.
- **Scaling**: 1, 3, or 5 nodes (Must be odd for quorum).
- **Features**:
- Shared auth sessions across proxies
- Realtime presence tracking across workers
- Distributed locking for migrations
- Cluster-wide rate limiting
---
## Observability Strategy (Metrics & Logs)
To maintain the performance of the four pillars, we implement a dedicated **System Pillar** for observability (often co-located or separate based on scale).
- **VictoriaMetrics (VM)**: Fast, cost-effective time-series database for metrics.
- **Loki**: Distributed log aggregation.
- **Placement**:
- Small Clusters: Embedded in the **Control** nodes.
- High Throughput: Dedicated `system-node` to prevent observability overhead from impacting the application pillars.
---
## Network Isolation & Security Zones
To ensure "Defense in Depth," the kernel is divided into two distinct network zones:
### 1. Public Zone (The DMZ)
- **Deployment**: Nodes have a Public IP and are attached to the Cluster VPC.
- **Pillars**:
- **System Node**: For cluster administration and dashboard access.
- **Proxy / Public API**: For handling all incoming internet traffic.
- **Access**: Restricted to HTTPS (443) and SSH (via safe-list).
### 2. Private Zone (The Core)
- **Deployment**: Nodes have **No Public IP**. They are accessible ONLY via the Cluster VPC (Private Network).
- **Pillars**:
- **Worker Pillar**: Executes application code.
- **State Pillar**: Stores sensitive project data (PostgreSQL + Redis).
- **Access**: No direct internet access. All ingress must pass through the Proxy/API pillar. Egress is managed via a NAT Gateway (optional) or limited to OS updates.
---
## State Pillar: The "Memory" of the Cluster
The State Pillar combines **persistent storage** (PostgreSQL) and **ephemeral state** (Redis) into a single, highly-available unit:
### Why Combine Them?
1. **Resource Symmetry**: Both PostgreSQL and Redis are memory-intensive and benefit from the same VPS plans (High-RAM nodes).
2. **HA Piggybacking**: Pillar 3 already manages HA via Patroni and etcd. Redis leverages the same infrastructure.
3. **Centralized Coordination**: Having all state (durable and ephemeral) in one place simplifies the architecture.
4. **Zero Complexity**: No new pillar needed—we just enhanced the existing "Database" pillar.
### Cache Distribution
- **L1 Cache** (moka): Runs on each Proxy/Worker node for ultra-low latency
- **L2 Cache** (Redis): Runs on State Pillar nodes for shared state
```
┌─────────────┐ ┌─────────────┐
│ Proxy 1 │ │ Worker 1 │
│ (L1 Cache) │ │ (L1 Cache) │
└──────┬──────┘ └──────┬──────┘
│ │
└──────────┬───────────┘
┌────────▼────────┐
│ State Pillar │
│ ┌──────────┐ │
│ │PostgreSQL│ │
│ └──────────┘ │
│ ┌──────────┐ │
│ │ Redis │ │
│ │(L2 Cache)│ │
│ └──────────┘ │
│ ┌──────────┐ │
│ │ HAProxy │ │
│ └──────────┘ │
└─────────────────┘
```
For detailed caching architecture, see [CACHING_STRATEGY.md](CACHING_STRATEGY.md).

567
docs/NODE_TEMPLATES.md Normal file
View File

@@ -0,0 +1,567 @@
# Node Templates - Quick Reference
Complete guide to MadBase node templates for Hetzner Cloud deployment.
## Template Overview
| Template | Pillar | Min Plan | Cost/Mo | Use Case | Services |
|----------|--------|----------|---------|----------|----------|
| Template | Pillar | Zone | Min Plan | Cost/Mo | Use Case | Services |
|----------|--------|------|----------|---------|----------|----------|
| **system-node** | System | Public | CX21 | €6.94 | Cluster Root | Control API + Grafana + VM + Loki |
| **proxy-api-node** | Proxy / API | Public | CX11 | €3.69 | Scalable Ingress | Gateway + Platform API |
| **worker-node** | Worker | Private | CX11 | €3.69 | Horizontal scaling | Worker + vmagent |
| **db-node** | DB / State | Private | CX21 | €6.94 | Production database HA | PostgreSQL + Patroni + etcd + HAProxy |
| **worker-db-combo** ⭐ | Mixed | CX31 | €14.21 | Smaller deployments | Worker + PostgreSQL + etcd + HAProxy |
| **worker-monitor-combo** ⭐ | Mixed | CX21 | €6.94 | Cost-optimized | Worker + VictoriaMetrics + Loki |
| **all-in-one** ⭐ | Unified | CX41 | €25.60 | Development/MVP | All services on one node |
⭐ = Composite template (mixes multiple service types)
---
## Pure Templates (Single Service Type)
### 1. Database Node (db-node.yaml)
**Best for**: Production deployments requiring database HA
**Server**: CX21 (4GB RAM, 2 vCPU)
**Services**:
- PostgreSQL 15 with Patroni (auto-failover)
- etcd (distributed consensus)
- HAProxy (connection pooling + read/write splitting)
**Scaling**: 3-7 nodes (odd number for quorum)
**When to use**:
- Production traffic >1000 req/min
- Need database auto-failover
- Want separate database cluster
### 2. Worker Node (worker-node.yaml)
**Best for**: Horizontal scaling of API workers
**Server**: CX11 (4GB RAM, 2 vCPU)
**Services**:
- MadBase Worker (API processing)
- vmagent (metrics collection)
**Scaling**: 1-20 nodes
**Auto-scaling rules**:
- Scale up: CPU > 70%
- Scale down: CPU < 20%
**When to use**:
- Need to scale workers independently
- Separate database cluster already exists
- Production deployments
### 3. Control Plane Node (control-plane-node.yaml)
**Best for**: Management UI and APIs
**Server**: CX11 (4GB RAM, 2 vCPU)
**Services**:
- Gateway Proxy (port 8080)
- Control Plane API (port 8001)
- Grafana (port 3030)
- Keepalived (HA with floating IP)
**Scaling**: 1-2 nodes (HA mode)
**When to use**:
- Need web UI for server management
- Want to provision servers via API
- Production deployments
### 4. Monitoring Node (monitoring-node.yaml)
**Best for**: Centralized metrics and logging
**Server**: CX11 (4GB RAM, 2 vCPU)
**Services**:
- VictoriaMetrics (metrics database)
- Loki (log aggregation)
- Alertmanager (optional)
**Scaling**: 1-2 nodes (can be HA)
**When to use**:
- Production deployments
- Want centralized monitoring
- Need log aggregation
---
## Composite Templates (Mix Multiple Service Types)
### 5. Worker + Database Combo (worker-db-combo.yaml) ⭐
**Best for**: 2-3 server deployments with database and worker on same node
**Server**: CX31 (8GB RAM, 2 vCPU)
**Services**:
- PostgreSQL 15 with Patroni
- etcd
- HAProxy
- MadBase Worker
- vmagent
**Why use this**:
- Cost savings (€6.94 vs €10.63 for separate nodes)
- Simpler architecture for smaller deployments
- Easy to scale later
**Scaling**: 1-2 nodes
**Upgrade path**: When CPU > 60% or RAM > 70%, migrate to dedicated db-node + worker-node
**Deployment example**:
```yaml
Server 1 (worker-db-combo): PostgreSQL + Worker
Server 2 (control-plane): Proxy + Control + Grafana
Server 3 (monitoring): VictoriaMetrics + Loki
```
### 6. Worker + Monitoring Combo (worker-monitor-combo.yaml) ⭐
**Best for**: Cost-optimized deployments with monitoring on worker node
**Server**: CX21 (4GB RAM, 2 vCPU)
**Services**:
- MadBase Worker
- VictoriaMetrics
- Loki
- vmagent
- Promtail
**Why use this**:
- Save €3.69/mo (no dedicated monitoring node)
- Monitoring co-located with worker
- Good for 2-3 server deployments
**Scaling**: 1-3 nodes
**When to upgrade**:
- Worker CPU > 60% (monitoring competes for resources)
- Need to scale workers horizontally
**Deployment example**:
```yaml
Server 1 (worker-monitor-combo): Worker + VictoriaMetrics + Loki
Server 2 (db-node): PostgreSQL + etcd + HAProxy
Server 3 (control-plane): Proxy + Control + Grafana
```
### 7. All-in-One (all-in-one.yaml) ⭐
**Best for**: Development, testing, or MVP deployments
**Server**: CX41 (16GB RAM, 4 vCPU)
**Services**: ALL (PostgreSQL, etcd, HAProxy, Redis, MinIO, Workers, Proxy, Control, VictoriaMetrics, Loki, Grafana)
**Why use this**:
- Simplest deployment
- Single server for everything
- Great for development/testing
**When to upgrade**:
- Production traffic > 100 req/min
- CPU usage > 70% sustained
- Need HA for database
---
## Monitoring Stack: VictoriaMetrics + Loki
### How It Works
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ vmagent │─┼─────────┼─│ vmagent │─┼─────────┼─│ vmagent │─┼──┐
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ Scans: │ │ Scans: │ │ Scans: │ │
│ - worker │ │ - worker │ │ - db │ │
│ - system │ │ - system │ │ - system │ │
└──────────────┘ └──────────────┘ └──────────────┘ │
┌───────────────────────┐
│ VictoriaMetrics │
│ Port: 8428 │
│ Type: Metrics DB │
└───────────┬───────────┘
┌───────────────────────┐
│ Grafana │
│ Port: 3030 │
│ Queries VM + Loki │
└───────────────────────┘
┌──────────────┐ ┌──────────────┐
│ Node 1 │ │ Node 2 │
│ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Promtail │─┼─────────┼─│ Promtail │─┼───┐
│ └──────────┘ │ │ └──────────┘ │ │
│ Reads: │ │ Reads: │ │
│ - logs/* │ │ - logs/* │ │
└──────────────┘ └──────────────┘ │
┌───────────────────────┐
│ Loki │
│ Port: 3100 │
│ Type: Log Aggregation│
└───────────┬───────────┘
┌───────────────────────┐
│ Grafana │
│ LogQL Queries │
└───────────────────────┘
```
### Components
#### VictoriaMetrics (Metrics Database)
**Purpose**: Store and query time-series metrics
**Location**:
- Dedicated monitoring-node (recommended)
- worker-monitor-combo (cost-optimized)
- all-in-one (development)
**Data Flow**:
1. vmagent on each node scrapes metrics every 15s
2. Metrics sent to VictoriaMetrics via remote write
3. VictoriaMetrics stores metrics with 10x compression
4. Grafana queries VictoriaMetrics for dashboards
**Metrics Collected**:
- **Worker**: Request rate, error rate, latency, queue depth
- **PostgreSQL**: Connections, transactions, replication lag
- **System**: CPU, memory, disk, network
- **HAProxy**: Connection count, response time
**Storage Requirements**:
- ~1GB per million time series per day (compressed)
- Default retention: 30 days
- RAM: Minimal, scales with active queries
#### Loki (Log Aggregation)
**Purpose**: Store and query logs
**Location**:
- Dedicated monitoring-node (recommended)
- worker-monitor-combo (cost-optimized)
- all-in-one (development)
**Data Flow**:
1. Promtail on each node tails log files
2. Logs sent to Loki via HTTP API
3. Loki indexes logs by labels (service, level, host)
4. Grafana queries Loki using LogQL
**Logs Collected**:
- **Worker**: `/var/log/madbase/worker.log`
- **PostgreSQL**: `/var/log/postgresql/*.log`
- **System**: `/var/log/syslog`
**Storage Requirements**:
- ~10% of raw log size (with compression)
- Default retention: 30 days
- RAM: Minimal, scales with active queries
#### vmagent (Metrics Collector)
**Purpose**: Scrape metrics and send to VictoriaMetrics
**Location**: Runs on EVERY node
**Port**: 8429 (local debug endpoint)
**Configuration**: `config/vmagent.yml`
**Scrape Targets**:
- Worker: `localhost:8002/metrics`
- Patroni: `localhost:8008/metrics`
- Node Exporter: `localhost:9100/metrics`
- HAProxy: `localhost:7000/metrics`
**Resource Usage**:
- CPU: <5% of 1 core
- Memory: ~50MB
#### Promtail (Log Collector)
**Purpose**: Tail log files and send to Loki
**Location**: Runs on EVERY node
**Configuration**: `config/promtail.yml`
**Log Sources**:
- `/var/log/madbase/worker.log` (worker logs)
- `/var/log/postgresql/*.log` (database logs)
- `/var/log/syslog` (system logs)
**Resource Usage**:
- CPU: <2% of 1 core
- Memory: ~30MB
### Grafana Integration
Grafana connects to both VictoriaMetrics and Loki:
**Example Dashboard Query**:
``yaml
Panel 1: Request Rate (Metrics)
Query: rate(http_requests_total[5m])
Panel 2: Error Rate (Metrics)
Query: rate(http_requests_total{status=~"5.."}[5m])
Panel 3: Recent Errors (Logs)
Query: {level="error"} | line format "{{.message}}"
Panel 4: Trace Request by ID (Logs)
Query: {trace_id="abc123"} |= "timeout"
```
### Deployment Scenarios
#### Scenario 1: Dedicated Monitoring Node (Production)
``yaml
servers:
- name: server1
template: control-plane-node
plan: CX11
- name: server2
template: db-node
plan: CX21
- name: server3
template: worker-node
plan: CX11
- name: server4
template: monitoring-node ← Dedicated monitoring
plan: CX11
```
**Cost**: €17.22/mo (4 servers)
**Best for**: Production with >1000 req/min
#### Scenario 2: Worker + Monitoring Combo (Cost-Optimized)
``yaml
servers:
- name: server1
template: control-plane-node
plan: CX11
- name: server2
template: db-node
plan: CX21
- name: server3
template: worker-monitor-combo ← Combined
plan: CX21
```
**Cost**: €13.53/mo (3 servers)
**Best for**: Cost-optimized production with <1000 req/min
#### Scenario 3: All-in-One (Development)
``yaml
servers:
- name: dev-server
template: all-in-one
plan: CX41
```
**Cost**: €25.60/mo (1 server)
**Best for**: Development, testing, MVP
---
## Deployment Examples
### Example 1: Small Production (3 servers)
``yaml
Server 1 (CX21 - €6.94):
Template: worker-db-combo
Services: PostgreSQL + Worker
Server 2 (CX11 - €3.69):
Template: control-plane-node
Services: Proxy + Control + Grafana
Server 3 (CX11 - €3.69):
Template: worker-monitor-combo
Services: Worker + VictoriaMetrics + Loki
Total: €14.32/mo
```
### Example 2: Medium Production (4 servers)
``yaml
Server 1 (CX21 - €6.94):
Template: db-node
Services: PostgreSQL + etcd + HAProxy
Server 2 (CX11 - €3.69):
Template: worker-node
Services: Worker + vmagent
Server 3 (CX11 - €3.69):
Template: control-plane-node
Services: Proxy + Control + Grafana
Server 4 (CX11 - €3.69):
Template: monitoring-node
Services: VictoriaMetrics + Loki
Total: €17.22/mo
```
### Example 3: Large Production (6 servers)
``yaml
Server 1-3 (CX21 - €6.94 each):
Template: db-node
Services: PostgreSQL cluster (3 nodes)
Server 4-5 (CX11 - €3.69 each):
Template: worker-node
Services: Workers (2 nodes)
Server 6 (CX11 - €3.69):
Template: control-plane-node
Services: Proxy + Control + Grafana + VictoriaMetrics + Loki
Total: €30.70/mo
```
---
## Template Selection Guide
**Start with these questions**:
1. **What's your budget?**
- €15/mo → Use composite templates
- €25/mo → Use pure templates
2. **What's your traffic?**
- <100 req/min → all-in-one
- <1000 req/min → worker-db-combo
- >1000 req/min → pure templates
3. **Do you need database HA?**
- Yes → db-node (3 nodes minimum)
- No → worker-db-combo
4. **Do you need centralized monitoring?**
- Yes → monitoring-node or worker-monitor-combo
- No → Skip (use worker vmagent only)
---
## Control Plane API Integration
Templates are used by the Control Plane API to provision servers:
```http
POST /api/v1/servers
Content-Type: application/json
{
"name": "worker-1",
"template": "worker-node",
"hetzner_plan": "CX11",
"region": "fsn1",
"features": ["worker", "monitoring"],
"environment": "production"
}
```
**Response**:
``json
{
"server_id": "abc123",
"status": "provisioning",
"ip_address": "167.235.123.45",
"services": [
{"name": "worker", "port": 8002},
{"name": "vmagent", "port": 8429}
]
}
```
---
## Resource Profiles
Each service can be tuned with resource profiles:
``yaml
minimal:
cpu_limit: "0.5"
memory_limit: "512Mi"
balanced:
cpu_limit: "2"
memory_limit: "2Gi"
cpu_intensive:
cpu_limit: "4"
memory_limit: "4Gi"
```
Default profiles are assigned in templates but can be overridden:
```http
POST /api/v1/servers
{
"template": "worker-node",
"overrides": {
"worker": {
"resource_profile": "cpu_intensive"
}
}
}
```
---
## Next Steps
1. **Choose template** based on budget and traffic
2. **Provision servers** via Control Plane API or Hetzner CLI
3. **Configure monitoring** (vmagent + promtail)
4. **Verify health** with Grafana dashboards
5. **Scale up/down** as needed
For more details, see:
- `STORAGE_CONFIGURATION.md` - Storage backend setup
- `QUICKSTART_HETZNER_STORAGE.md` - Hetzner Bucket Storage guide
- `4SERVER_DEPLOYMENT_GUIDE.md` - Multi-server deployment

View File

@@ -0,0 +1,97 @@
# MadBase Redis Integration Implementation Plan
## Current Status Analysis
### ✅ Already Implemented
1. **Redis in State Pillar** - Redis is already added to docker-compose.pillar-database.yml
2. **HAProxy Configuration** - Port 6379 routing is configured in autobase-haproxy.cfg
3. **Config Support** - redis_url field exists in common/src/config.rs
4. **L1 Cache Ready** - moka is already in gateway/Cargo.toml
### 🔨 Needs Implementation
## Phase 1: Core Redis Client (Common Crate)
### File: common/src/cache.rs
- Create Redis client wrapper with connection pooling
- Implement L1/L2 cache abstraction layer
- Add distributed locking primitives
- Add session management utilities
## Phase 2: Application Layer Integration
### Gateway (Proxy/Worker)
- File: gateway/src/cache_layer.rs
- L1 cache for project configs (moka)
- L2 cache for shared state (Redis)
- Cache warming strategies
- Cache invalidation logic
### Auth Module
- File: auth/src/session.rs
- Shared auth sessions across proxies
- Session tokens in Redis
- Multi-proxy logout support
## Phase 3: Features
### Realtime Presence
- File: realtime/src/presence.rs
- Track online users across workers
- Channel presence management
- Heartbeat mechanism with Redis pub/sub
### Distributed Locking
- File: common/src/locking.rs
- Redlock implementation
- Migration coordination
- Background job synchronization
### Rate Limiting
- File: gateway/src/rate_limit.rs
- Sliding window rate limiting
- Distributed counters
- IP-based and user-based limits
## Phase 4: Updates to Existing Files
### Templates
- Update db-node.yaml to include Redis service definition
- Update all templates that include PostgreSQL to also include Redis
### Documentation
- Update AUTOBASE.md to reflect "State Node" concept
- Create CACHING_STRATEGY.md with architecture details
- Update NODE_TEMPLATES.md with Redis information
### Tests
- Integration tests for cache layers
- Failover tests for Redis HA
- Performance benchmarks
## Implementation Order
1. **Common Cache Layer** (Priority 1)
- Redis client with pooling
- Cache abstraction (L1/L2)
- Basic operations (get/set/delete)
2. **Auth Sessions** (Priority 1)
- Shared session store
- Multi-proxy support
3. **Presence Tracking** (Priority 2)
- User online status
- Channel presence
4. **Distributed Locking** (Priority 2)
- Migration coordination
- Background job locks
5. **Rate Limiting** (Priority 3)
- Distributed rate limiting
- Sliding windows
6. **Documentation & Tests** (Priority 4)
- Update all docs
- Add comprehensive tests

37
docs/STORAGE.md Normal file
View File

@@ -0,0 +1,37 @@
# MadBase Storage & Persistence
MadBase uses a tiered storage approach combining local persistence, distributed databases, and S3-compatible object storage.
## 1. Local Block Storage
Nodes use local NVMe/SSD storage for high-performance service data:
- **Database**: PostgreSQL data stored in `/var/lib/postgresql/data`.
- **Metrics**: VictoriaMetrics data stored in `/victoria-metrics-data`.
- **Logs**: Loki chunks stored in `/loki`.
## 2. Object Storage (S3)
Used for backups, static assets, and long-term state retention.
- **Backups**: Database backups are automatically piped to S3-compatible buckets.
- **Provider**: Works with Hetzner Bucket Storage, AWS S3, or MinIO.
### Configuration
```bash
S3_ENDPOINT="https://fsn1.your-storage-endpoint.com"
S3_ACCESS_KEY="your_key"
S3_SECRET_KEY="your_secret"
S3_BUCKET="madbase-backups"
```
## 3. Backup & Restore
### Manual Backup
```bash
curl -X POST http://localhost:8001/api/v1/cluster/database/backup
```
### Manual Restore
```bash
curl -X POST http://localhost:8001/api/v1/cluster/database/restore \
-d '{"backup_url": "s3://..."}'
```
## 4. Planning & Capacity
- **DB Nodes**: Scale vertically (RAM) for active datasets; scale horizontally (Nodes) for read-throughput and HA.
- **Retention**: Configure VictoriaMetrics and Loki retention periods in the System Node config to manage disk usage.

52
docs/WASI_DENO.md Normal file
View File

@@ -0,0 +1,52 @@
# Plan: Deno Compatibility for MadBase Edge Functions
## Problem Statement
Currently, MadBase executes Edge Functions as WASM modules via `wasmtime`. Supabase-compatible Edge Functions (like those in `accountaflow`) are written in TypeScript and target a Deno environment. Migrating these requires 1:1 compatibility for the `Deno` namespace, ES modules, and standard web APIs (Fetch, Request, Response).
## Proposed Architecture
### 1. Dual-Runtime Strategy
Extend the `functions` crate to support two runtimes:
- **WasmRuntime**: Existing `wasmtime` based executor for compiled modules.
- **DenoRuntime**: A new V8-based executor utilizing `deno_core` and `deno_runtime`.
### 2. Runtime Detection
The gateway should detect the function type:
- **DenoRuntime (V8)**: Files ending in `.ts` or `.js`. Recommended for standard Edge Functions due to JIT-optimized performance.
- **WasmRuntime (Wasmtime)**: Native WASM binaries (Rust, Go, C++). Best for specialized, high-performance logic or pre-compiled modules.
## Implementation Steps
### Phase 1: Core Integration
- Add `deno_core` and `deno_runtime` dependencies to `madbase/functions/Cargo.toml`.
- Create `functions/src/deno_runtime.rs`.
- Implement `execute_script(code: String, payload: Value)` using `JsRuntime`.
### Phase 2: Supabase Environment Compatibility
- **Process Environment**: Inject `SUPABASE_URL`, `SUPABASE_ANON_KEY`, and `SUPABASE_SERVICE_ROLE_KEY`.
- **Global Objects**: Implement a shim for `Deno.serve` to capture the incoming request and route it to the script's handler.
- **Header Parsing**: Ensure standard headers (`apikey`, `Authorization`) are passed through.
### Phase 3: Module Resolution
- Implement a `ModuleLoader` that handles imports from `https://esm.sh/`.
- Support local imports from a shared functions directory (like `_shared`).
## API Changes
### Gateway
Modify `POST /functions/v1` to accept `type: "typescript" | "wasm"`. Default to "typescript" for source code.
### Deployment Table
Update the `functions` table schema in the control plane to store the runtime type.
## Verification Plan
### Automated Tests
1. **Hello World Test**: Deploy a simple `.ts` function and verify the output.
2. **Supabase Client Test**: Deploy a function that imports `@supabase/supabase-js` from `esm.sh` and queries the MadBase Data API.
3. **Environment Variable Test**: Verify `Deno.env.get` returns expected MadBase configuration.
### Manual Verification
1. Attempt to deploy the `invite-staff` function from `accountaflow` directly to MadBase.
2. Verify cross-organization invitation logic works.