madapes/madbase

Fork 0

Files

Vlad Durnea cffdf8af86

CI/CD Pipeline / unit-tests (push) Failing after 1m16s

Details

CI/CD Pipeline / integration-tests (push) Failing after 2m32s

Details

CI/CD Pipeline / lint (push) Successful in 5m22s

Details

CI/CD Pipeline / e2e-tests (push) Has been skipped

Details

CI/CD Pipeline / build (push) Has been skipped

Details

wip:milestone 0 fixes

2026-03-15 12:35:42 +02:00

6.6 KiB

Raw Blame History

Milestone 8: High Availability & Scaling

Goal: The system survives node failures and handles horizontal scaling.

Depends on: M1 (Foundation), M7 (CI/CD)

8.1 — Database HA

8.1.1 Multi-node Patroni

File: autobase-haproxy.cfg — listen primary block

Add replica backends:

listen primary
    bind *:5433
    mode tcp
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server patroni1 patroni1:5432 maxconn 300 check port 8008
    server patroni2 patroni2:5432 maxconn 300 check port 8008
    server patroni3 patroni3:5432 maxconn 300 check port 8008

listen replicas
    bind *:5434
    mode tcp
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server patroni1 patroni1:5432 maxconn 300 check port 8008
    server patroni2 patroni2:5432 maxconn 300 check port 8008
    server patroni3 patroni3:5432 maxconn 300 check port 8008

Update maxconn global to 1000.

8.1.2 3-node etcd

Update docker-compose.pillar-database.yml to include 3 etcd nodes with proper cluster configuration.

8.1.3 Read replica routing

Add READ_REPLICA_URL env var. In data_api/src/handlers.rs, route SELECT queries to the replica pool:

let pool = if is_read_only_query {
    state.replica_pool.as_ref().unwrap_or(&state.db)
} else {
    &state.db
};

8.1.4 Redis Sentinel

Replace single Redis with 3-node Sentinel setup. Update common/src/cache.rs to use redis::sentinel::SentinelClient.

8.2 — Proxy & Worker Scaling

8.2.1 Graceful shutdown

File: gateway/src/main.rs and all bin/*.rs

let listener = tokio::net::TcpListener::bind(addr).await?;
let server = axum::serve(listener, app.into_make_service());

// Wait for shutdown signal
let shutdown = async {
    tokio::signal::ctrl_c().await.ok();
    tracing::info!("Shutdown signal received, draining connections...");
};

server.with_graceful_shutdown(shutdown).await?;
tracing::info!("Server shut down cleanly");

8.2.2 Dynamic worker discovery

Instead of static WORKER_UPSTREAM_URLS, poll the control plane or use Redis pub/sub:

// Background task in proxy
tokio::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_secs(30));
    loop {
        interval.tick().await;
        match discover_workers(&control_url).await {
            Ok(new_workers) => {
                let mut upstreams = state.worker_upstreams.write().await;
                *upstreams = new_workers;
            }
            Err(e) => tracing::warn!("Worker discovery failed: {}", e),
        }
    }
});

8.2.3 Tenant pool eviction

File: gateway/src/middleware.rs

Replace HashMap<String, PgPool> with a moka::future::Cache that has TTL and max size:

use moka::future::Cache;

pub tenant_pools: Cache<String, PgPool>,

// Initialize with TTL and max entries
Cache::builder()
    .max_capacity(100)
    .time_to_idle(Duration::from_secs(300))
    .build()

8.2.4 Project config cache TTL

File: gateway/src/worker.rs line 97 and middleware.rs

// BEFORE
project_cache: moka::future::Cache::new(100),

// AFTER
project_cache: moka::future::Cache::builder()
    .max_capacity(100)
    .time_to_live(Duration::from_secs(60))
    .build(),

8.3 — TLS

8.3.1 TLS termination

Two options:

Option A: External reverse proxy (recommended for simplicity) Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates:

# Caddyfile
api.example.com {
    reverse_proxy proxy:8000
}

Option B: Built-in rustls Add axum-server with rustls feature:

use axum_server::tls_rustls::RustlsConfig;

let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?;
axum_server::bind_rustls(addr, tls_config)
    .serve(app.into_make_service())
    .await?;

Document both options. Recommend Option A for most deployments.

Completion Requirements

This milestone is not complete until every item below is satisfied.

1. Full Test Suite — All Green

cargo test --workspace passes with zero failures
All pre-existing tests still pass (no regressions)
New tests are written for HA features:

Test	Location	What it validates
`test_graceful_shutdown_completes_inflight`	`gateway/src/main.rs`	After SIGTERM, in-flight request completes before exit
`test_graceful_shutdown_rejects_new`	`gateway/src/main.rs`	After SIGTERM, new connections are refused
`test_dynamic_worker_discovery`	`gateway/src/proxy.rs`	Adding a worker URL to the discovery source → proxy routes to it
`test_connection_pool_ttl_eviction`	`gateway/src/proxy.rs` or `common/`	Idle tenant pool is evicted after configured TTL
`test_connection_pool_lru_eviction`	`gateway/src/proxy.rs` or `common/`	When max pools exceeded, least-recently-used is evicted
`test_project_config_cache_ttl`	`gateway/src/worker.rs`	Stale project config refreshes after TTL (not served forever)
`test_read_replica_routing`	`data_api/src/handlers.rs`	SELECT queries route to `READ_REPLICA_URL` when set
`test_read_replica_fallback`	`data_api/src/handlers.rs`	When `READ_REPLICA_URL` unset, SELECT uses primary
`test_tls_rustls_config`	`gateway/src/main.rs`	`RustlsConfig::from_pem_file` loads certs without error (unit)

2. HA / Chaos Verification

Kill one Patroni node → automatic failover within 30s, no request failures
Add a new worker node → proxy discovers it within 30s
SIGTERM to worker → in-flight requests complete, then process exits cleanly
SIGTERM to proxy → drains connections, then exits
Tenant pool cache evicts stale entries after configured TTL
Project config changes are reflected within 60 seconds without restart
Read queries route to replicas when READ_REPLICA_URL is set
HTTPS works via Caddy or built-in TLS
Redis Sentinel failover does not break sessions or cache

3. Load Testing

Proxy handles 1000 concurrent connections without OOM or thread exhaustion
Worker handles 500 req/s with p99 < 500ms for simple queries
Connection pool does not leak connections under sustained load (monitor via pg_stat_activity)

4. CI Gate

All unit tests run in cargo test --workspace
HA integration tests (Patroni failover, Redis Sentinel) are gated behind #[ignore] with documentation
Load tests are documented as runnable scripts (not in CI, but in scripts/ or tests/load/)

6.6 KiB Raw Blame History

Milestone 8: High Availability & Scaling

8.1 — Database HA

8.1.1 Multi-node Patroni

8.1.2 3-node etcd

8.1.3 Read replica routing

8.1.4 Redis Sentinel

8.2 — Proxy & Worker Scaling

8.2.1 Graceful shutdown

8.2.2 Dynamic worker discovery

8.2.3 Tenant pool eviction

8.2.4 Project config cache TTL

8.3 — TLS

8.3.1 TLS termination

Completion Requirements

1. Full Test Suite — All Green

2. HA / Chaos Verification

3. Load Testing

4. CI Gate

6.6 KiB

Raw Blame History