Files
madbase/_milestones/M8_high_availability.md
Vlad Durnea cffdf8af86
Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped
wip:milestone 0 fixes
2026-03-15 12:35:42 +02:00

6.6 KiB

Milestone 8: High Availability & Scaling

Goal: The system survives node failures and handles horizontal scaling.

Depends on: M1 (Foundation), M7 (CI/CD)


8.1 — Database HA

8.1.1 Multi-node Patroni

File: autobase-haproxy.cfglisten primary block

Add replica backends:

listen primary
    bind *:5433
    mode tcp
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server patroni1 patroni1:5432 maxconn 300 check port 8008
    server patroni2 patroni2:5432 maxconn 300 check port 8008
    server patroni3 patroni3:5432 maxconn 300 check port 8008

listen replicas
    bind *:5434
    mode tcp
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server patroni1 patroni1:5432 maxconn 300 check port 8008
    server patroni2 patroni2:5432 maxconn 300 check port 8008
    server patroni3 patroni3:5432 maxconn 300 check port 8008

Update maxconn global to 1000.

8.1.2 3-node etcd

Update docker-compose.pillar-database.yml to include 3 etcd nodes with proper cluster configuration.

8.1.3 Read replica routing

Add READ_REPLICA_URL env var. In data_api/src/handlers.rs, route SELECT queries to the replica pool:

let pool = if is_read_only_query {
    state.replica_pool.as_ref().unwrap_or(&state.db)
} else {
    &state.db
};

8.1.4 Redis Sentinel

Replace single Redis with 3-node Sentinel setup. Update common/src/cache.rs to use redis::sentinel::SentinelClient.


8.2 — Proxy & Worker Scaling

8.2.1 Graceful shutdown

File: gateway/src/main.rs and all bin/*.rs

let listener = tokio::net::TcpListener::bind(addr).await?;
let server = axum::serve(listener, app.into_make_service());

// Wait for shutdown signal
let shutdown = async {
    tokio::signal::ctrl_c().await.ok();
    tracing::info!("Shutdown signal received, draining connections...");
};

server.with_graceful_shutdown(shutdown).await?;
tracing::info!("Server shut down cleanly");

8.2.2 Dynamic worker discovery

Instead of static WORKER_UPSTREAM_URLS, poll the control plane or use Redis pub/sub:

// Background task in proxy
tokio::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_secs(30));
    loop {
        interval.tick().await;
        match discover_workers(&control_url).await {
            Ok(new_workers) => {
                let mut upstreams = state.worker_upstreams.write().await;
                *upstreams = new_workers;
            }
            Err(e) => tracing::warn!("Worker discovery failed: {}", e),
        }
    }
});

8.2.3 Tenant pool eviction

File: gateway/src/middleware.rs

Replace HashMap<String, PgPool> with a moka::future::Cache that has TTL and max size:

use moka::future::Cache;

pub tenant_pools: Cache<String, PgPool>,

// Initialize with TTL and max entries
Cache::builder()
    .max_capacity(100)
    .time_to_idle(Duration::from_secs(300))
    .build()

8.2.4 Project config cache TTL

File: gateway/src/worker.rs line 97 and middleware.rs

// BEFORE
project_cache: moka::future::Cache::new(100),

// AFTER
project_cache: moka::future::Cache::builder()
    .max_capacity(100)
    .time_to_live(Duration::from_secs(60))
    .build(),

8.3 — TLS

8.3.1 TLS termination

Two options:

Option A: External reverse proxy (recommended for simplicity) Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates:

# Caddyfile
api.example.com {
    reverse_proxy proxy:8000
}

Option B: Built-in rustls Add axum-server with rustls feature:

use axum_server::tls_rustls::RustlsConfig;

let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?;
axum_server::bind_rustls(addr, tls_config)
    .serve(app.into_make_service())
    .await?;

Document both options. Recommend Option A for most deployments.


Completion Requirements

This milestone is not complete until every item below is satisfied.

1. Full Test Suite — All Green

  • cargo test --workspace passes with zero failures
  • All pre-existing tests still pass (no regressions)
  • New tests are written for HA features:
Test Location What it validates
test_graceful_shutdown_completes_inflight gateway/src/main.rs After SIGTERM, in-flight request completes before exit
test_graceful_shutdown_rejects_new gateway/src/main.rs After SIGTERM, new connections are refused
test_dynamic_worker_discovery gateway/src/proxy.rs Adding a worker URL to the discovery source → proxy routes to it
test_connection_pool_ttl_eviction gateway/src/proxy.rs or common/ Idle tenant pool is evicted after configured TTL
test_connection_pool_lru_eviction gateway/src/proxy.rs or common/ When max pools exceeded, least-recently-used is evicted
test_project_config_cache_ttl gateway/src/worker.rs Stale project config refreshes after TTL (not served forever)
test_read_replica_routing data_api/src/handlers.rs SELECT queries route to READ_REPLICA_URL when set
test_read_replica_fallback data_api/src/handlers.rs When READ_REPLICA_URL unset, SELECT uses primary
test_tls_rustls_config gateway/src/main.rs RustlsConfig::from_pem_file loads certs without error (unit)

2. HA / Chaos Verification

  • Kill one Patroni node → automatic failover within 30s, no request failures
  • Add a new worker node → proxy discovers it within 30s
  • SIGTERM to worker → in-flight requests complete, then process exits cleanly
  • SIGTERM to proxy → drains connections, then exits
  • Tenant pool cache evicts stale entries after configured TTL
  • Project config changes are reflected within 60 seconds without restart
  • Read queries route to replicas when READ_REPLICA_URL is set
  • HTTPS works via Caddy or built-in TLS
  • Redis Sentinel failover does not break sessions or cache

3. Load Testing

  • Proxy handles 1000 concurrent connections without OOM or thread exhaustion
  • Worker handles 500 req/s with p99 < 500ms for simple queries
  • Connection pool does not leak connections under sustained load (monitor via pg_stat_activity)

4. CI Gate

  • All unit tests run in cargo test --workspace
  • HA integration tests (Patroni failover, Redis Sentinel) are gated behind #[ignore] with documentation
  • Load tests are documented as runnable scripts (not in CI, but in scripts/ or tests/load/)