madbase/docs/NODE_TEMPLATES.md

# Node Templates - Quick Reference

Complete guide to MadBase node templates for Hetzner Cloud deployment.

## Template Overview

| Template | Pillar | Min Plan | Cost/Mo | Use Case | Services |
|----------|--------|----------|---------|----------|----------|
| Template | Pillar | Zone | Min Plan | Cost/Mo | Use Case | Services |
|----------|--------|------|----------|---------|----------|----------|
| **system-node** | System | Public | CX21 | €6.94 | Cluster Root | Control API + Grafana + VM + Loki |
| **proxy-api-node** | Proxy / API | Public | CX11 | €3.69 | Scalable Ingress | Gateway + Platform API |
| **worker-node** | Worker | Private | CX11 | €3.69 | Horizontal scaling | Worker + vmagent |
| **db-node** | DB / State | Private | CX21 | €6.94 | Production database HA | PostgreSQL + Patroni + etcd + HAProxy |
| **worker-db-combo** ⭐ | Mixed | CX31 | €14.21 | Smaller deployments | Worker + PostgreSQL + etcd + HAProxy |
| **worker-monitor-combo** ⭐ | Mixed | CX21 | €6.94 | Cost-optimized | Worker + VictoriaMetrics + Loki |
| **all-in-one** ⭐ | Unified | CX41 | €25.60 | Development/MVP | All services on one node |

⭐ = Composite template (mixes multiple service types)

---

## Pure Templates (Single Service Type)

### 1. Database Node (db-node.yaml)

**Best for**: Production deployments requiring database HA

**Server**: CX21 (4GB RAM, 2 vCPU)

**Services**:
- PostgreSQL 15 with Patroni (auto-failover)
- etcd (distributed consensus)
- HAProxy (connection pooling + read/write splitting)

**Scaling**: 3-7 nodes (odd number for quorum)

**When to use**:
- Production traffic >1000 req/min
- Need database auto-failover
- Want separate database cluster

### 2. Worker Node (worker-node.yaml)

**Best for**: Horizontal scaling of API workers

**Server**: CX11 (4GB RAM, 2 vCPU)

**Services**:
- MadBase Worker (API processing)
- vmagent (metrics collection)

**Scaling**: 1-20 nodes

**Auto-scaling rules**:
- Scale up: CPU > 70%
- Scale down: CPU < 20%

**When to use**:
- Need to scale workers independently
- Separate database cluster already exists
- Production deployments

### 3. Control Plane Node (control-plane-node.yaml)

**Best for**: Management UI and APIs

**Server**: CX11 (4GB RAM, 2 vCPU)

**Services**:
- Gateway Proxy (port 8080)
- Control Plane API (port 8001)
- Grafana (port 3030)
- Keepalived (HA with floating IP)

**Scaling**: 1-2 nodes (HA mode)

**When to use**:
- Need web UI for server management
- Want to provision servers via API
- Production deployments

### 4. Monitoring Node (monitoring-node.yaml)

**Best for**: Centralized metrics and logging

**Server**: CX11 (4GB RAM, 2 vCPU)

**Services**:
- VictoriaMetrics (metrics database)
- Loki (log aggregation)
- Alertmanager (optional)

**Scaling**: 1-2 nodes (can be HA)

**When to use**:
- Production deployments
- Want centralized monitoring
- Need log aggregation

---

## Composite Templates (Mix Multiple Service Types)

### 5. Worker + Database Combo (worker-db-combo.yaml) ⭐

**Best for**: 2-3 server deployments with database and worker on same node

**Server**: CX31 (8GB RAM, 2 vCPU)

**Services**:
- PostgreSQL 15 with Patroni
- etcd
- HAProxy
- MadBase Worker
- vmagent

**Why use this**:
- Cost savings (€6.94 vs €10.63 for separate nodes)
- Simpler architecture for smaller deployments
- Easy to scale later

**Scaling**: 1-2 nodes

**Upgrade path**: When CPU > 60% or RAM > 70%, migrate to dedicated db-node + worker-node

**Deployment example**:
```yaml
Server 1 (worker-db-combo): PostgreSQL + Worker
Server 2 (control-plane): Proxy + Control + Grafana
Server 3 (monitoring): VictoriaMetrics + Loki
```

### 6. Worker + Monitoring Combo (worker-monitor-combo.yaml) ⭐

**Best for**: Cost-optimized deployments with monitoring on worker node

**Server**: CX21 (4GB RAM, 2 vCPU)

**Services**:
- MadBase Worker
- VictoriaMetrics
- Loki
- vmagent
- Promtail

**Why use this**:
- Save €3.69/mo (no dedicated monitoring node)
- Monitoring co-located with worker
- Good for 2-3 server deployments

**Scaling**: 1-3 nodes

**When to upgrade**:
- Worker CPU > 60% (monitoring competes for resources)
- Need to scale workers horizontally

**Deployment example**:
```yaml
Server 1 (worker-monitor-combo): Worker + VictoriaMetrics + Loki
Server 2 (db-node): PostgreSQL + etcd + HAProxy
Server 3 (control-plane): Proxy + Control + Grafana
```

### 7. All-in-One (all-in-one.yaml) ⭐

**Best for**: Development, testing, or MVP deployments

**Server**: CX41 (16GB RAM, 4 vCPU)

**Services**: ALL (PostgreSQL, etcd, HAProxy, Redis, MinIO, Workers, Proxy, Control, VictoriaMetrics, Loki, Grafana)

**Why use this**:
- Simplest deployment
- Single server for everything
- Great for development/testing

**When to upgrade**:
- Production traffic > 100 req/min
- CPU usage > 70% sustained
- Need HA for database

---

## Monitoring Stack: VictoriaMetrics + Loki

### How It Works

```
┌──────────────┐         ┌──────────────┐         ┌──────────────┐
│   Node 1     │         │   Node 2     │         │   Node 3     │
│              │         │              │         │              │
│ ┌──────────┐ │         │ ┌──────────┐ │         │ ┌──────────┐ │
│ │ vmagent  │─┼─────────┼─│ vmagent  │─┼─────────┼─│ vmagent  │─┼──┐
│ └──────────┘ │         │ └──────────┘ │         │ └──────────┘ │  │
│   Scans:     │         │   Scans:     │         │   Scans:     │  │
│   - worker   │         │   - worker   │         │   - db       │  │
│   - system   │         │   - system   │         │   - system   │  │
└──────────────┘         └──────────────┘         └──────────────┘  │
                                                                       │
                                                                       ▼
                                                    ┌───────────────────────┐
                                                    │  VictoriaMetrics      │
                                                    │  Port: 8428           │
                                                    │  Type: Metrics DB     │
                                                    └───────────┬───────────┘
                                                                │
                                                                ▼
                                                    ┌───────────────────────┐
                                                    │      Grafana          │
                                                    │  Port: 3030           │
                                                    │  Queries VM + Loki    │
                                                    └───────────────────────┘

┌──────────────┐         ┌──────────────┐
│   Node 1     │         │   Node 2     │
│              │         │              │
│ ┌──────────┐ │         │ ┌──────────┐ │
│ │ Promtail │─┼─────────┼─│ Promtail │─┼───┐
│ └──────────┘ │         │ └──────────┘ │   │
│   Reads:     │         │   Reads:     │   │
│   - logs/*   │         │   - logs/*   │   │
└──────────────┘         └──────────────┘   │
                                             │
                                             ▼
                                  ┌───────────────────────┐
                                  │         Loki          │
                                  │  Port: 3100           │
                                  │  Type: Log Aggregation│
                                  └───────────┬───────────┘
                                              │
                                              ▼
                                  ┌───────────────────────┐
                                  │      Grafana          │
                                  │  LogQL Queries        │
                                  └───────────────────────┘
```

### Components

#### VictoriaMetrics (Metrics Database)

**Purpose**: Store and query time-series metrics

**Location**:
- Dedicated monitoring-node (recommended)
- worker-monitor-combo (cost-optimized)
- all-in-one (development)

**Data Flow**:
1. vmagent on each node scrapes metrics every 15s
2. Metrics sent to VictoriaMetrics via remote write
3. VictoriaMetrics stores metrics with 10x compression
4. Grafana queries VictoriaMetrics for dashboards

**Metrics Collected**:
- **Worker**: Request rate, error rate, latency, queue depth
- **PostgreSQL**: Connections, transactions, replication lag
- **System**: CPU, memory, disk, network
- **HAProxy**: Connection count, response time

**Storage Requirements**:
- ~1GB per million time series per day (compressed)
- Default retention: 30 days
- RAM: Minimal, scales with active queries

#### Loki (Log Aggregation)

**Purpose**: Store and query logs

**Location**:
- Dedicated monitoring-node (recommended)
- worker-monitor-combo (cost-optimized)
- all-in-one (development)

**Data Flow**:
1. Promtail on each node tails log files
2. Logs sent to Loki via HTTP API
3. Loki indexes logs by labels (service, level, host)
4. Grafana queries Loki using LogQL

**Logs Collected**:
- **Worker**: `/var/log/madbase/worker.log`
- **PostgreSQL**: `/var/log/postgresql/*.log`
- **System**: `/var/log/syslog`

**Storage Requirements**:
- ~10% of raw log size (with compression)
- Default retention: 30 days
- RAM: Minimal, scales with active queries

#### vmagent (Metrics Collector)

**Purpose**: Scrape metrics and send to VictoriaMetrics

**Location**: Runs on EVERY node

**Port**: 8429 (local debug endpoint)

**Configuration**: `config/vmagent.yml`

**Scrape Targets**:
- Worker: `localhost:8002/metrics`
- Patroni: `localhost:8008/metrics`
- Node Exporter: `localhost:9100/metrics`
- HAProxy: `localhost:7000/metrics`

**Resource Usage**:
- CPU: <5% of 1 core
- Memory: ~50MB

#### Promtail (Log Collector)

**Purpose**: Tail log files and send to Loki

**Location**: Runs on EVERY node

**Configuration**: `config/promtail.yml`

**Log Sources**:
- `/var/log/madbase/worker.log` (worker logs)
- `/var/log/postgresql/*.log` (database logs)
- `/var/log/syslog` (system logs)

**Resource Usage**:
- CPU: <2% of 1 core
- Memory: ~30MB

### Grafana Integration

Grafana connects to both VictoriaMetrics and Loki:

**Example Dashboard Query**:
``yaml
Panel 1: Request Rate (Metrics)
  Query: rate(http_requests_total[5m])

Panel 2: Error Rate (Metrics)
  Query: rate(http_requests_total{status=~"5.."}[5m])

Panel 3: Recent Errors (Logs)
  Query: {level="error"} | line format "{{.message}}"

Panel 4: Trace Request by ID (Logs)
  Query: {trace_id="abc123"} |= "timeout"
```

### Deployment Scenarios

#### Scenario 1: Dedicated Monitoring Node (Production)

``yaml
servers:
  - name: server1
    template: control-plane-node
    plan: CX11
  - name: server2
    template: db-node
    plan: CX21
  - name: server3
    template: worker-node
    plan: CX11
  - name: server4
    template: monitoring-node  ← Dedicated monitoring
    plan: CX11
```

**Cost**: €17.22/mo (4 servers)
**Best for**: Production with >1000 req/min

#### Scenario 2: Worker + Monitoring Combo (Cost-Optimized)

``yaml
servers:
  - name: server1
    template: control-plane-node
    plan: CX11
  - name: server2
    template: db-node
    plan: CX21
  - name: server3
    template: worker-monitor-combo  ← Combined
    plan: CX21
```

**Cost**: €13.53/mo (3 servers)
**Best for**: Cost-optimized production with <1000 req/min

#### Scenario 3: All-in-One (Development)

``yaml
servers:
  - name: dev-server
    template: all-in-one
    plan: CX41
```

**Cost**: €25.60/mo (1 server)
**Best for**: Development, testing, MVP

---

## Deployment Examples

### Example 1: Small Production (3 servers)

``yaml
Server 1 (CX21 - €6.94):
  Template: worker-db-combo
  Services: PostgreSQL + Worker

Server 2 (CX11 - €3.69):
  Template: control-plane-node
  Services: Proxy + Control + Grafana

Server 3 (CX11 - €3.69):
  Template: worker-monitor-combo
  Services: Worker + VictoriaMetrics + Loki

Total: €14.32/mo
```

### Example 2: Medium Production (4 servers)

``yaml
Server 1 (CX21 - €6.94):
  Template: db-node
  Services: PostgreSQL + etcd + HAProxy

Server 2 (CX11 - €3.69):
  Template: worker-node
  Services: Worker + vmagent

Server 3 (CX11 - €3.69):
  Template: control-plane-node
  Services: Proxy + Control + Grafana

Server 4 (CX11 - €3.69):
  Template: monitoring-node
  Services: VictoriaMetrics + Loki

Total: €17.22/mo
```

### Example 3: Large Production (6 servers)

``yaml
Server 1-3 (CX21 - €6.94 each):
  Template: db-node
  Services: PostgreSQL cluster (3 nodes)

Server 4-5 (CX11 - €3.69 each):
  Template: worker-node
  Services: Workers (2 nodes)

Server 6 (CX11 - €3.69):
  Template: control-plane-node
  Services: Proxy + Control + Grafana + VictoriaMetrics + Loki

Total: €30.70/mo
```

---

## Template Selection Guide

**Start with these questions**:

1. **What's your budget?**
   - €15/mo → Use composite templates
   - €25/mo → Use pure templates

2. **What's your traffic?**
   - <100 req/min → all-in-one
   - <1000 req/min → worker-db-combo
   - >1000 req/min → pure templates

3. **Do you need database HA?**
   - Yes → db-node (3 nodes minimum)
   - No → worker-db-combo

4. **Do you need centralized monitoring?**
   - Yes → monitoring-node or worker-monitor-combo
   - No → Skip (use worker vmagent only)

---

## Control Plane API Integration

Templates are used by the Control Plane API to provision servers:

```http
POST /api/v1/servers
Content-Type: application/json

{
  "name": "worker-1",
  "template": "worker-node",
  "hetzner_plan": "CX11",
  "region": "fsn1",
  "features": ["worker", "monitoring"],
  "environment": "production"
}
```

**Response**:
``json
{
  "server_id": "abc123",
  "status": "provisioning",
  "ip_address": "167.235.123.45",
  "services": [
    {"name": "worker", "port": 8002},
    {"name": "vmagent", "port": 8429}
  ]
}
```

---

## Resource Profiles

Each service can be tuned with resource profiles:

``yaml
minimal:
  cpu_limit: "0.5"
  memory_limit: "512Mi"

balanced:
  cpu_limit: "2"
  memory_limit: "2Gi"

cpu_intensive:
  cpu_limit: "4"
  memory_limit: "4Gi"
```

Default profiles are assigned in templates but can be overridden:

```http
POST /api/v1/servers

{
  "template": "worker-node",
  "overrides": {
    "worker": {
      "resource_profile": "cpu_intensive"
    }
  }
}
```

---

## Next Steps

1. **Choose template** based on budget and traffic
2. **Provision servers** via Control Plane API or Hetzner CLI
3. **Configure monitoring** (vmagent + promtail)
4. **Verify health** with Grafana dashboards
5. **Scale up/down** as needed

For more details, see:
- `STORAGE_CONFIGURATION.md` - Storage backend setup
- `QUICKSTART_HETZNER_STORAGE.md` - Hetzner Bucket Storage guide
- `4SERVER_DEPLOYMENT_GUIDE.md` - Multi-server deployment