6.6 KiB
Milestone 8: High Availability & Scaling
Goal: The system survives node failures and handles horizontal scaling.
Depends on: M1 (Foundation), M7 (CI/CD)
8.1 — Database HA
8.1.1 Multi-node Patroni
File: autobase-haproxy.cfg — listen primary block
Add replica backends:
listen primary
bind *:5433
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server patroni1 patroni1:5432 maxconn 300 check port 8008
server patroni2 patroni2:5432 maxconn 300 check port 8008
server patroni3 patroni3:5432 maxconn 300 check port 8008
listen replicas
bind *:5434
mode tcp
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server patroni1 patroni1:5432 maxconn 300 check port 8008
server patroni2 patroni2:5432 maxconn 300 check port 8008
server patroni3 patroni3:5432 maxconn 300 check port 8008
Update maxconn global to 1000.
8.1.2 3-node etcd
Update docker-compose.pillar-database.yml to include 3 etcd nodes with proper cluster configuration.
8.1.3 Read replica routing
Add READ_REPLICA_URL env var. In data_api/src/handlers.rs, route SELECT queries to the replica pool:
let pool = if is_read_only_query {
state.replica_pool.as_ref().unwrap_or(&state.db)
} else {
&state.db
};
8.1.4 Redis Sentinel
Replace single Redis with 3-node Sentinel setup. Update common/src/cache.rs to use redis::sentinel::SentinelClient.
8.2 — Proxy & Worker Scaling
8.2.1 Graceful shutdown
File: gateway/src/main.rs and all bin/*.rs
let listener = tokio::net::TcpListener::bind(addr).await?;
let server = axum::serve(listener, app.into_make_service());
// Wait for shutdown signal
let shutdown = async {
tokio::signal::ctrl_c().await.ok();
tracing::info!("Shutdown signal received, draining connections...");
};
server.with_graceful_shutdown(shutdown).await?;
tracing::info!("Server shut down cleanly");
8.2.2 Dynamic worker discovery
Instead of static WORKER_UPSTREAM_URLS, poll the control plane or use Redis pub/sub:
// Background task in proxy
tokio::spawn(async move {
let mut interval = tokio::time::interval(Duration::from_secs(30));
loop {
interval.tick().await;
match discover_workers(&control_url).await {
Ok(new_workers) => {
let mut upstreams = state.worker_upstreams.write().await;
*upstreams = new_workers;
}
Err(e) => tracing::warn!("Worker discovery failed: {}", e),
}
}
});
8.2.3 Tenant pool eviction
File: gateway/src/middleware.rs
Replace HashMap<String, PgPool> with a moka::future::Cache that has TTL and max size:
use moka::future::Cache;
pub tenant_pools: Cache<String, PgPool>,
// Initialize with TTL and max entries
Cache::builder()
.max_capacity(100)
.time_to_idle(Duration::from_secs(300))
.build()
8.2.4 Project config cache TTL
File: gateway/src/worker.rs line 97 and middleware.rs
// BEFORE
project_cache: moka::future::Cache::new(100),
// AFTER
project_cache: moka::future::Cache::builder()
.max_capacity(100)
.time_to_live(Duration::from_secs(60))
.build(),
8.3 — TLS
8.3.1 TLS termination
Two options:
Option A: External reverse proxy (recommended for simplicity) Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates:
# Caddyfile
api.example.com {
reverse_proxy proxy:8000
}
Option B: Built-in rustls
Add axum-server with rustls feature:
use axum_server::tls_rustls::RustlsConfig;
let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?;
axum_server::bind_rustls(addr, tls_config)
.serve(app.into_make_service())
.await?;
Document both options. Recommend Option A for most deployments.
Completion Requirements
This milestone is not complete until every item below is satisfied.
1. Full Test Suite — All Green
cargo test --workspacepasses with zero failures- All pre-existing tests still pass (no regressions)
- New tests are written for HA features:
| Test | Location | What it validates |
|---|---|---|
test_graceful_shutdown_completes_inflight |
gateway/src/main.rs |
After SIGTERM, in-flight request completes before exit |
test_graceful_shutdown_rejects_new |
gateway/src/main.rs |
After SIGTERM, new connections are refused |
test_dynamic_worker_discovery |
gateway/src/proxy.rs |
Adding a worker URL to the discovery source → proxy routes to it |
test_connection_pool_ttl_eviction |
gateway/src/proxy.rs or common/ |
Idle tenant pool is evicted after configured TTL |
test_connection_pool_lru_eviction |
gateway/src/proxy.rs or common/ |
When max pools exceeded, least-recently-used is evicted |
test_project_config_cache_ttl |
gateway/src/worker.rs |
Stale project config refreshes after TTL (not served forever) |
test_read_replica_routing |
data_api/src/handlers.rs |
SELECT queries route to READ_REPLICA_URL when set |
test_read_replica_fallback |
data_api/src/handlers.rs |
When READ_REPLICA_URL unset, SELECT uses primary |
test_tls_rustls_config |
gateway/src/main.rs |
RustlsConfig::from_pem_file loads certs without error (unit) |
2. HA / Chaos Verification
- Kill one Patroni node → automatic failover within 30s, no request failures
- Add a new worker node → proxy discovers it within 30s
- SIGTERM to worker → in-flight requests complete, then process exits cleanly
- SIGTERM to proxy → drains connections, then exits
- Tenant pool cache evicts stale entries after configured TTL
- Project config changes are reflected within 60 seconds without restart
- Read queries route to replicas when
READ_REPLICA_URLis set - HTTPS works via Caddy or built-in TLS
- Redis Sentinel failover does not break sessions or cache
3. Load Testing
- Proxy handles 1000 concurrent connections without OOM or thread exhaustion
- Worker handles 500 req/s with p99 < 500ms for simple queries
- Connection pool does not leak connections under sustained load (monitor via
pg_stat_activity)
4. CI Gate
- All unit tests run in
cargo test --workspace - HA integration tests (Patroni failover, Redis Sentinel) are gated behind
#[ignore]with documentation - Load tests are documented as runnable scripts (not in CI, but in
scripts/ortests/load/)