Files
madbase/_milestones/M8_high_availability.md
Vlad Durnea cffdf8af86
Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped
wip:milestone 0 fixes
2026-03-15 12:35:42 +02:00

216 lines
6.6 KiB
Markdown

# Milestone 8: High Availability & Scaling
**Goal:** The system survives node failures and handles horizontal scaling.
**Depends on:** M1 (Foundation), M7 (CI/CD)
---
## 8.1 — Database HA
### 8.1.1 Multi-node Patroni
**File:** `autobase-haproxy.cfg``listen primary` block
Add replica backends:
```
listen primary
bind *:5433
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server patroni1 patroni1:5432 maxconn 300 check port 8008
server patroni2 patroni2:5432 maxconn 300 check port 8008
server patroni3 patroni3:5432 maxconn 300 check port 8008
listen replicas
bind *:5434
mode tcp
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server patroni1 patroni1:5432 maxconn 300 check port 8008
server patroni2 patroni2:5432 maxconn 300 check port 8008
server patroni3 patroni3:5432 maxconn 300 check port 8008
```
Update `maxconn` global to `1000`.
### 8.1.2 3-node etcd
Update `docker-compose.pillar-database.yml` to include 3 etcd nodes with proper cluster configuration.
### 8.1.3 Read replica routing
Add `READ_REPLICA_URL` env var. In `data_api/src/handlers.rs`, route SELECT queries to the replica pool:
```rust
let pool = if is_read_only_query {
state.replica_pool.as_ref().unwrap_or(&state.db)
} else {
&state.db
};
```
### 8.1.4 Redis Sentinel
Replace single Redis with 3-node Sentinel setup. Update `common/src/cache.rs` to use `redis::sentinel::SentinelClient`.
---
## 8.2 — Proxy & Worker Scaling
### 8.2.1 Graceful shutdown
**File:** `gateway/src/main.rs` and all `bin/*.rs`
```rust
let listener = tokio::net::TcpListener::bind(addr).await?;
let server = axum::serve(listener, app.into_make_service());
// Wait for shutdown signal
let shutdown = async {
tokio::signal::ctrl_c().await.ok();
tracing::info!("Shutdown signal received, draining connections...");
};
server.with_graceful_shutdown(shutdown).await?;
tracing::info!("Server shut down cleanly");
```
### 8.2.2 Dynamic worker discovery
Instead of static `WORKER_UPSTREAM_URLS`, poll the control plane or use Redis pub/sub:
```rust
// Background task in proxy
tokio::spawn(async move {
let mut interval = tokio::time::interval(Duration::from_secs(30));
loop {
interval.tick().await;
match discover_workers(&control_url).await {
Ok(new_workers) => {
let mut upstreams = state.worker_upstreams.write().await;
*upstreams = new_workers;
}
Err(e) => tracing::warn!("Worker discovery failed: {}", e),
}
}
});
```
### 8.2.3 Tenant pool eviction
**File:** `gateway/src/middleware.rs`
Replace `HashMap<String, PgPool>` with a `moka::future::Cache` that has TTL and max size:
```rust
use moka::future::Cache;
pub tenant_pools: Cache<String, PgPool>,
// Initialize with TTL and max entries
Cache::builder()
.max_capacity(100)
.time_to_idle(Duration::from_secs(300))
.build()
```
### 8.2.4 Project config cache TTL
**File:** `gateway/src/worker.rs` line 97 and `middleware.rs`
```rust
// BEFORE
project_cache: moka::future::Cache::new(100),
// AFTER
project_cache: moka::future::Cache::builder()
.max_capacity(100)
.time_to_live(Duration::from_secs(60))
.build(),
```
---
## 8.3 — TLS
### 8.3.1 TLS termination
Two options:
**Option A: External reverse proxy (recommended for simplicity)**
Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates:
```
# Caddyfile
api.example.com {
reverse_proxy proxy:8000
}
```
**Option B: Built-in rustls**
Add `axum-server` with `rustls` feature:
```rust
use axum_server::tls_rustls::RustlsConfig;
let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?;
axum_server::bind_rustls(addr, tls_config)
.serve(app.into_make_service())
.await?;
```
Document both options. Recommend Option A for most deployments.
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New tests** are written for HA features:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_graceful_shutdown_completes_inflight` | `gateway/src/main.rs` | After SIGTERM, in-flight request completes before exit |
| `test_graceful_shutdown_rejects_new` | `gateway/src/main.rs` | After SIGTERM, new connections are refused |
| `test_dynamic_worker_discovery` | `gateway/src/proxy.rs` | Adding a worker URL to the discovery source → proxy routes to it |
| `test_connection_pool_ttl_eviction` | `gateway/src/proxy.rs` or `common/` | Idle tenant pool is evicted after configured TTL |
| `test_connection_pool_lru_eviction` | `gateway/src/proxy.rs` or `common/` | When max pools exceeded, least-recently-used is evicted |
| `test_project_config_cache_ttl` | `gateway/src/worker.rs` | Stale project config refreshes after TTL (not served forever) |
| `test_read_replica_routing` | `data_api/src/handlers.rs` | SELECT queries route to `READ_REPLICA_URL` when set |
| `test_read_replica_fallback` | `data_api/src/handlers.rs` | When `READ_REPLICA_URL` unset, SELECT uses primary |
| `test_tls_rustls_config` | `gateway/src/main.rs` | `RustlsConfig::from_pem_file` loads certs without error (unit) |
### 2. HA / Chaos Verification
- [ ] Kill one Patroni node → automatic failover within 30s, no request failures
- [ ] Add a new worker node → proxy discovers it within 30s
- [ ] SIGTERM to worker → in-flight requests complete, then process exits cleanly
- [ ] SIGTERM to proxy → drains connections, then exits
- [ ] Tenant pool cache evicts stale entries after configured TTL
- [ ] Project config changes are reflected within 60 seconds without restart
- [ ] Read queries route to replicas when `READ_REPLICA_URL` is set
- [ ] HTTPS works via Caddy or built-in TLS
- [ ] Redis Sentinel failover does not break sessions or cache
### 3. Load Testing
- [ ] Proxy handles 1000 concurrent connections without OOM or thread exhaustion
- [ ] Worker handles 500 req/s with p99 < 500ms for simple queries
- [ ] Connection pool does not leak connections under sustained load (monitor via `pg_stat_activity`)
### 4. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] HA integration tests (Patroni failover, Redis Sentinel) are gated behind `#[ignore]` with documentation
- [ ] Load tests are documented as runnable scripts (not in CI, but in `scripts/` or `tests/load/`)