# Milestone 8: High Availability & Scaling **Goal:** The system survives node failures and handles horizontal scaling. **Depends on:** M1 (Foundation), M7 (CI/CD) --- ## 8.1 — Database HA ### 8.1.1 Multi-node Patroni **File:** `autobase-haproxy.cfg` — `listen primary` block Add replica backends: ``` listen primary bind *:5433 mode tcp option httpchk GET /primary http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server patroni1 patroni1:5432 maxconn 300 check port 8008 server patroni2 patroni2:5432 maxconn 300 check port 8008 server patroni3 patroni3:5432 maxconn 300 check port 8008 listen replicas bind *:5434 mode tcp balance roundrobin option httpchk GET /replica http-check expect status 200 default-server inter 3s fall 3 rise 2 server patroni1 patroni1:5432 maxconn 300 check port 8008 server patroni2 patroni2:5432 maxconn 300 check port 8008 server patroni3 patroni3:5432 maxconn 300 check port 8008 ``` Update `maxconn` global to `1000`. ### 8.1.2 3-node etcd Update `docker-compose.pillar-database.yml` to include 3 etcd nodes with proper cluster configuration. ### 8.1.3 Read replica routing Add `READ_REPLICA_URL` env var. In `data_api/src/handlers.rs`, route SELECT queries to the replica pool: ```rust let pool = if is_read_only_query { state.replica_pool.as_ref().unwrap_or(&state.db) } else { &state.db }; ``` ### 8.1.4 Redis Sentinel Replace single Redis with 3-node Sentinel setup. Update `common/src/cache.rs` to use `redis::sentinel::SentinelClient`. --- ## 8.2 — Proxy & Worker Scaling ### 8.2.1 Graceful shutdown **File:** `gateway/src/main.rs` and all `bin/*.rs` ```rust let listener = tokio::net::TcpListener::bind(addr).await?; let server = axum::serve(listener, app.into_make_service()); // Wait for shutdown signal let shutdown = async { tokio::signal::ctrl_c().await.ok(); tracing::info!("Shutdown signal received, draining connections..."); }; server.with_graceful_shutdown(shutdown).await?; tracing::info!("Server shut down cleanly"); ``` ### 8.2.2 Dynamic worker discovery Instead of static `WORKER_UPSTREAM_URLS`, poll the control plane or use Redis pub/sub: ```rust // Background task in proxy tokio::spawn(async move { let mut interval = tokio::time::interval(Duration::from_secs(30)); loop { interval.tick().await; match discover_workers(&control_url).await { Ok(new_workers) => { let mut upstreams = state.worker_upstreams.write().await; *upstreams = new_workers; } Err(e) => tracing::warn!("Worker discovery failed: {}", e), } } }); ``` ### 8.2.3 Tenant pool eviction **File:** `gateway/src/middleware.rs` Replace `HashMap` with a `moka::future::Cache` that has TTL and max size: ```rust use moka::future::Cache; pub tenant_pools: Cache, // Initialize with TTL and max entries Cache::builder() .max_capacity(100) .time_to_idle(Duration::from_secs(300)) .build() ``` ### 8.2.4 Project config cache TTL **File:** `gateway/src/worker.rs` line 97 and `middleware.rs` ```rust // BEFORE project_cache: moka::future::Cache::new(100), // AFTER project_cache: moka::future::Cache::builder() .max_capacity(100) .time_to_live(Duration::from_secs(60)) .build(), ``` --- ## 8.3 — TLS ### 8.3.1 TLS termination Two options: **Option A: External reverse proxy (recommended for simplicity)** Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates: ``` # Caddyfile api.example.com { reverse_proxy proxy:8000 } ``` **Option B: Built-in rustls** Add `axum-server` with `rustls` feature: ```rust use axum_server::tls_rustls::RustlsConfig; let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?; axum_server::bind_rustls(addr, tls_config) .serve(app.into_make_service()) .await?; ``` Document both options. Recommend Option A for most deployments. --- ## Completion Requirements This milestone is **not complete** until every item below is satisfied. ### 1. Full Test Suite — All Green - [ ] `cargo test --workspace` passes with **zero failures** - [ ] All **pre-existing tests** still pass (no regressions) - [ ] **New tests** are written for HA features: | Test | Location | What it validates | |------|----------|-------------------| | `test_graceful_shutdown_completes_inflight` | `gateway/src/main.rs` | After SIGTERM, in-flight request completes before exit | | `test_graceful_shutdown_rejects_new` | `gateway/src/main.rs` | After SIGTERM, new connections are refused | | `test_dynamic_worker_discovery` | `gateway/src/proxy.rs` | Adding a worker URL to the discovery source → proxy routes to it | | `test_connection_pool_ttl_eviction` | `gateway/src/proxy.rs` or `common/` | Idle tenant pool is evicted after configured TTL | | `test_connection_pool_lru_eviction` | `gateway/src/proxy.rs` or `common/` | When max pools exceeded, least-recently-used is evicted | | `test_project_config_cache_ttl` | `gateway/src/worker.rs` | Stale project config refreshes after TTL (not served forever) | | `test_read_replica_routing` | `data_api/src/handlers.rs` | SELECT queries route to `READ_REPLICA_URL` when set | | `test_read_replica_fallback` | `data_api/src/handlers.rs` | When `READ_REPLICA_URL` unset, SELECT uses primary | | `test_tls_rustls_config` | `gateway/src/main.rs` | `RustlsConfig::from_pem_file` loads certs without error (unit) | ### 2. HA / Chaos Verification - [ ] Kill one Patroni node → automatic failover within 30s, no request failures - [ ] Add a new worker node → proxy discovers it within 30s - [ ] SIGTERM to worker → in-flight requests complete, then process exits cleanly - [ ] SIGTERM to proxy → drains connections, then exits - [ ] Tenant pool cache evicts stale entries after configured TTL - [ ] Project config changes are reflected within 60 seconds without restart - [ ] Read queries route to replicas when `READ_REPLICA_URL` is set - [ ] HTTPS works via Caddy or built-in TLS - [ ] Redis Sentinel failover does not break sessions or cache ### 3. Load Testing - [ ] Proxy handles 1000 concurrent connections without OOM or thread exhaustion - [ ] Worker handles 500 req/s with p99 < 500ms for simple queries - [ ] Connection pool does not leak connections under sustained load (monitor via `pg_stat_activity`) ### 4. CI Gate - [ ] All unit tests run in `cargo test --workspace` - [ ] HA integration tests (Patroni failover, Redis Sentinel) are gated behind `#[ignore]` with documentation - [ ] Load tests are documented as runnable scripts (not in CI, but in `scripts/` or `tests/load/`)