Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped
216 lines
6.6 KiB
Markdown
216 lines
6.6 KiB
Markdown
# Milestone 8: High Availability & Scaling
|
|
|
|
**Goal:** The system survives node failures and handles horizontal scaling.
|
|
|
|
**Depends on:** M1 (Foundation), M7 (CI/CD)
|
|
|
|
---
|
|
|
|
## 8.1 — Database HA
|
|
|
|
### 8.1.1 Multi-node Patroni
|
|
|
|
**File:** `autobase-haproxy.cfg` — `listen primary` block
|
|
|
|
Add replica backends:
|
|
```
|
|
listen primary
|
|
bind *:5433
|
|
mode tcp
|
|
option httpchk GET /primary
|
|
http-check expect status 200
|
|
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
|
|
server patroni1 patroni1:5432 maxconn 300 check port 8008
|
|
server patroni2 patroni2:5432 maxconn 300 check port 8008
|
|
server patroni3 patroni3:5432 maxconn 300 check port 8008
|
|
|
|
listen replicas
|
|
bind *:5434
|
|
mode tcp
|
|
balance roundrobin
|
|
option httpchk GET /replica
|
|
http-check expect status 200
|
|
default-server inter 3s fall 3 rise 2
|
|
server patroni1 patroni1:5432 maxconn 300 check port 8008
|
|
server patroni2 patroni2:5432 maxconn 300 check port 8008
|
|
server patroni3 patroni3:5432 maxconn 300 check port 8008
|
|
```
|
|
|
|
Update `maxconn` global to `1000`.
|
|
|
|
### 8.1.2 3-node etcd
|
|
|
|
Update `docker-compose.pillar-database.yml` to include 3 etcd nodes with proper cluster configuration.
|
|
|
|
### 8.1.3 Read replica routing
|
|
|
|
Add `READ_REPLICA_URL` env var. In `data_api/src/handlers.rs`, route SELECT queries to the replica pool:
|
|
|
|
```rust
|
|
let pool = if is_read_only_query {
|
|
state.replica_pool.as_ref().unwrap_or(&state.db)
|
|
} else {
|
|
&state.db
|
|
};
|
|
```
|
|
|
|
### 8.1.4 Redis Sentinel
|
|
|
|
Replace single Redis with 3-node Sentinel setup. Update `common/src/cache.rs` to use `redis::sentinel::SentinelClient`.
|
|
|
|
---
|
|
|
|
## 8.2 — Proxy & Worker Scaling
|
|
|
|
### 8.2.1 Graceful shutdown
|
|
|
|
**File:** `gateway/src/main.rs` and all `bin/*.rs`
|
|
|
|
```rust
|
|
let listener = tokio::net::TcpListener::bind(addr).await?;
|
|
let server = axum::serve(listener, app.into_make_service());
|
|
|
|
// Wait for shutdown signal
|
|
let shutdown = async {
|
|
tokio::signal::ctrl_c().await.ok();
|
|
tracing::info!("Shutdown signal received, draining connections...");
|
|
};
|
|
|
|
server.with_graceful_shutdown(shutdown).await?;
|
|
tracing::info!("Server shut down cleanly");
|
|
```
|
|
|
|
### 8.2.2 Dynamic worker discovery
|
|
|
|
Instead of static `WORKER_UPSTREAM_URLS`, poll the control plane or use Redis pub/sub:
|
|
|
|
```rust
|
|
// Background task in proxy
|
|
tokio::spawn(async move {
|
|
let mut interval = tokio::time::interval(Duration::from_secs(30));
|
|
loop {
|
|
interval.tick().await;
|
|
match discover_workers(&control_url).await {
|
|
Ok(new_workers) => {
|
|
let mut upstreams = state.worker_upstreams.write().await;
|
|
*upstreams = new_workers;
|
|
}
|
|
Err(e) => tracing::warn!("Worker discovery failed: {}", e),
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
### 8.2.3 Tenant pool eviction
|
|
|
|
**File:** `gateway/src/middleware.rs`
|
|
|
|
Replace `HashMap<String, PgPool>` with a `moka::future::Cache` that has TTL and max size:
|
|
|
|
```rust
|
|
use moka::future::Cache;
|
|
|
|
pub tenant_pools: Cache<String, PgPool>,
|
|
|
|
// Initialize with TTL and max entries
|
|
Cache::builder()
|
|
.max_capacity(100)
|
|
.time_to_idle(Duration::from_secs(300))
|
|
.build()
|
|
```
|
|
|
|
### 8.2.4 Project config cache TTL
|
|
|
|
**File:** `gateway/src/worker.rs` line 97 and `middleware.rs`
|
|
|
|
```rust
|
|
// BEFORE
|
|
project_cache: moka::future::Cache::new(100),
|
|
|
|
// AFTER
|
|
project_cache: moka::future::Cache::builder()
|
|
.max_capacity(100)
|
|
.time_to_live(Duration::from_secs(60))
|
|
.build(),
|
|
```
|
|
|
|
---
|
|
|
|
## 8.3 — TLS
|
|
|
|
### 8.3.1 TLS termination
|
|
|
|
Two options:
|
|
|
|
**Option A: External reverse proxy (recommended for simplicity)**
|
|
Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates:
|
|
|
|
```
|
|
# Caddyfile
|
|
api.example.com {
|
|
reverse_proxy proxy:8000
|
|
}
|
|
```
|
|
|
|
**Option B: Built-in rustls**
|
|
Add `axum-server` with `rustls` feature:
|
|
|
|
```rust
|
|
use axum_server::tls_rustls::RustlsConfig;
|
|
|
|
let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?;
|
|
axum_server::bind_rustls(addr, tls_config)
|
|
.serve(app.into_make_service())
|
|
.await?;
|
|
```
|
|
|
|
Document both options. Recommend Option A for most deployments.
|
|
|
|
---
|
|
|
|
## Completion Requirements
|
|
|
|
This milestone is **not complete** until every item below is satisfied.
|
|
|
|
### 1. Full Test Suite — All Green
|
|
|
|
- [ ] `cargo test --workspace` passes with **zero failures**
|
|
- [ ] All **pre-existing tests** still pass (no regressions)
|
|
- [ ] **New tests** are written for HA features:
|
|
|
|
| Test | Location | What it validates |
|
|
|------|----------|-------------------|
|
|
| `test_graceful_shutdown_completes_inflight` | `gateway/src/main.rs` | After SIGTERM, in-flight request completes before exit |
|
|
| `test_graceful_shutdown_rejects_new` | `gateway/src/main.rs` | After SIGTERM, new connections are refused |
|
|
| `test_dynamic_worker_discovery` | `gateway/src/proxy.rs` | Adding a worker URL to the discovery source → proxy routes to it |
|
|
| `test_connection_pool_ttl_eviction` | `gateway/src/proxy.rs` or `common/` | Idle tenant pool is evicted after configured TTL |
|
|
| `test_connection_pool_lru_eviction` | `gateway/src/proxy.rs` or `common/` | When max pools exceeded, least-recently-used is evicted |
|
|
| `test_project_config_cache_ttl` | `gateway/src/worker.rs` | Stale project config refreshes after TTL (not served forever) |
|
|
| `test_read_replica_routing` | `data_api/src/handlers.rs` | SELECT queries route to `READ_REPLICA_URL` when set |
|
|
| `test_read_replica_fallback` | `data_api/src/handlers.rs` | When `READ_REPLICA_URL` unset, SELECT uses primary |
|
|
| `test_tls_rustls_config` | `gateway/src/main.rs` | `RustlsConfig::from_pem_file` loads certs without error (unit) |
|
|
|
|
### 2. HA / Chaos Verification
|
|
|
|
- [ ] Kill one Patroni node → automatic failover within 30s, no request failures
|
|
- [ ] Add a new worker node → proxy discovers it within 30s
|
|
- [ ] SIGTERM to worker → in-flight requests complete, then process exits cleanly
|
|
- [ ] SIGTERM to proxy → drains connections, then exits
|
|
- [ ] Tenant pool cache evicts stale entries after configured TTL
|
|
- [ ] Project config changes are reflected within 60 seconds without restart
|
|
- [ ] Read queries route to replicas when `READ_REPLICA_URL` is set
|
|
- [ ] HTTPS works via Caddy or built-in TLS
|
|
- [ ] Redis Sentinel failover does not break sessions or cache
|
|
|
|
### 3. Load Testing
|
|
|
|
- [ ] Proxy handles 1000 concurrent connections without OOM or thread exhaustion
|
|
- [ ] Worker handles 500 req/s with p99 < 500ms for simple queries
|
|
- [ ] Connection pool does not leak connections under sustained load (monitor via `pg_stat_activity`)
|
|
|
|
### 4. CI Gate
|
|
|
|
- [ ] All unit tests run in `cargo test --workspace`
|
|
- [ ] HA integration tests (Patroni failover, Redis Sentinel) are gated behind `#[ignore]` with documentation
|
|
- [ ] Load tests are documented as runnable scripts (not in CI, but in `scripts/` or `tests/load/`)
|