wip:milestone 0 fixes
Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped

This commit is contained in:
2026-03-15 12:35:42 +02:00
parent 6708cf28a7
commit cffdf8af86
61266 changed files with 4511646 additions and 1938 deletions

View File

@@ -0,0 +1,540 @@
# Milestone 0: Security Hardening (CRITICAL)
**Goal:** Eliminate every exploitable vulnerability before any deployment or beta.
**Depends on:** Nothing — this is the first milestone.
**Blocks:** Everything else. Do not proceed to M1+ until every task here is complete.
---
## 0.1 — Secrets & Credential Hygiene
### 0.1.1 Remove all secret logging
**Files to audit:**
| File | Line | What's logged | Severity |
|------|------|---------------|----------|
| `auth/src/middleware.rs` | 46 | `tracing::info!("Using project-specific JWT secret: '{}'", ctx.jwt_secret)` | CRITICAL |
| `auth/src/middleware.rs` | 49 | `tracing::warn!("...Using global JWT secret: '{}'", state.config.jwt_secret)` | CRITICAL |
| `gateway/src/middleware.rs` | 139 | `tracing::info!("Injecting tenant pool for project={} db_url={}", ...)` — logs full DB URL with password | CRITICAL |
| `auth/src/handlers.rs` | 81-84 | `tracing::info!("Sending confirmation email to {}: token={}", ...)` — logs confirmation token | HIGH |
| `auth/src/handlers.rs` | 297-300 | `tracing::info!("Sending recovery email to {}: token={}", ...)` — logs recovery token | HIGH |
**How to fix:** Replace each with a sanitized log that omits the secret value:
```rust
// BEFORE (auth/src/middleware.rs:46)
tracing::info!("Using project-specific JWT secret: '{}'", ctx.jwt_secret);
// AFTER
tracing::debug!("Using project-specific JWT secret for project");
```
For DB URLs, log only the host/database, never the password:
```rust
// BEFORE (gateway/src/middleware.rs:139)
tracing::info!("Injecting tenant pool for project={} db_url={}", project_ctx.project_ref, db_url);
// AFTER
tracing::info!(project = %project_ctx.project_ref, "Injecting tenant pool");
```
**Audit procedure:** Run `rg 'jwt_secret|db_url|password|token=' --type rust -n` and review every hit. Replace INFO/WARN-level secret logs with DEBUG-level sanitized versions or remove entirely.
### 0.1.2 Make JWT_SECRET required
**File:** `common/src/config.rs` line 31
```rust
// BEFORE
let jwt_secret = env::var("JWT_SECRET")
.unwrap_or_else(|_| "super-secret-key-please-change".to_string());
// AFTER
let jwt_secret = env::var("JWT_SECRET")
.expect("JWT_SECRET must be set. Generate one with: openssl rand -hex 32");
```
Also enforce minimum length:
```rust
if jwt_secret.len() < 32 {
panic!("JWT_SECRET must be at least 32 characters");
}
```
### 0.1.3 Make ADMIN_PASSWORD required and enforce strength
**File:** `control_plane/src/lib.rs` line 335
```rust
// BEFORE
let admin_password = std::env::var("ADMIN_PASSWORD").unwrap_or_else(|_| "admin".to_string());
// AFTER
let admin_password = std::env::var("ADMIN_PASSWORD")
.expect("ADMIN_PASSWORD must be set");
```
### 0.1.4 Remove hardcoded fallback S3 credentials
**File:** `storage/src/backend.rs` lines 29-34
```rust
// BEFORE
let access_key = env::var("S3_ACCESS_KEY")
.or_else(|_| env::var("MINIO_ROOT_USER"))
.unwrap_or_else(|_| "minioadmin".to_string());
// AFTER
let access_key = env::var("S3_ACCESS_KEY")
.or_else(|_| env::var("MINIO_ROOT_USER"))
.expect("S3_ACCESS_KEY or MINIO_ROOT_USER must be set");
```
Apply the same to `S3_SECRET_KEY` / `MINIO_ROOT_PASSWORD`.
### 0.1.5 Remove Serialize derive from Config
**File:** `common/src/config.rs` line 4
```rust
// BEFORE
#[derive(Clone, Debug, Deserialize, Serialize)]
pub struct Config {
// AFTER
#[derive(Clone, Debug, Deserialize)]
pub struct Config {
```
Remove `use serde::Serialize` if no longer needed elsewhere in the module.
---
## 0.2 — Authentication & Authorization Fixes
### 0.2.1 Fix admin auth middleware
**File:** `gateway/src/admin_auth.rs` lines 27-33
**Current broken logic:**
```rust
let has_session = req.headers()
.get(axum::http::header::COOKIE)
.and_then(|h| h.to_str().ok())
.map(|s| s.contains("madbase_admin_session")) // Only checks name exists!
.unwrap_or(false)
|| req.headers().contains_key("x-admin-token"); // Any value!
```
**Replacement approach:** Use HMAC-signed session tokens. The admin login endpoint should:
1. Verify the password (hashed with Argon2)
2. Generate a random session ID
3. Store it in Redis with a TTL (e.g., 24h)
4. Set an `HttpOnly`, `SameSite=Strict`, `Secure` cookie with the session ID
5. The middleware reads the cookie, looks up the session in Redis, rejects if missing/expired
**Implementation sketch:**
```rust
pub async fn admin_auth_middleware(
State(state): State<AdminAuthState>,
req: Request,
next: Next,
) -> Result<Response, StatusCode> {
let path = req.uri().path();
if path == "/dashboard" || path == "/platform/v1/login" {
return Ok(next.run(req).await);
}
if !path.starts_with("/platform/v1") {
return Ok(next.run(req).await);
}
// Extract session token from cookie
let session_token = req.headers()
.get(axum::http::header::COOKIE)
.and_then(|h| h.to_str().ok())
.and_then(|cookies| {
cookies.split(';')
.find_map(|c| {
let c = c.trim();
c.strip_prefix("madbase_admin_session=")
})
});
// Also check X-Admin-Token header
let token = session_token
.or_else(|| req.headers()
.get("x-admin-token")
.and_then(|v| v.to_str().ok()));
let token = token.ok_or(StatusCode::UNAUTHORIZED)?;
// Validate against session store
let valid = state.session_store.validate(token).await
.map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?;
if !valid {
return Err(StatusCode::UNAUTHORIZED);
}
Ok(next.run(req).await)
}
```
**New struct needed:** `AdminAuthState` with a Redis-backed session store, or a shared `CacheLayer`.
### 0.2.2 Hash admin password
**File:** `control_plane/src/lib.rs``login` function (line 330+)
Use Argon2 to hash on first startup and verify on login:
```rust
pub async fn login(
State(state): State<ControlPlaneState>,
Json(payload): Json<LoginRequest>,
) -> Result<(CookieJar, StatusCode), (StatusCode, String)> {
let admin_password_hash = std::env::var("ADMIN_PASSWORD_HASH")
.expect("ADMIN_PASSWORD_HASH must be set");
let valid = auth::utils::verify_password(&payload.password, &admin_password_hash)
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, e.to_string()))?;
if !valid {
return Err((StatusCode::UNAUTHORIZED, "Invalid password".to_string()));
}
// Generate session...
}
```
Provide a CLI helper or startup script to generate the hash: `cargo run --bin hash_password -- "my-password"`.
### 0.2.3 Add auth to control-plane-api
**File:** `control-plane-api/src/lib.rs`
Add an API key middleware that reads `X-Api-Key` header and validates against `CONTROL_PLANE_API_KEY` env var. Apply to all routes except `/health`.
### 0.2.4 Add auth to function deploy/invoke
**File:** `functions/src/handlers.rs`
The function routes are nested under the auth middleware in `gateway/src/worker.rs`, so JWT auth should already apply. Verify this is actually enforced — currently the `/functions/v1/:name` POST route may be accessible with just an anon key. Deploy should require `service_role` or admin, invoke should require at least `authenticated`.
### 0.2.5 Add authorization to WebSocket subscriptions
**File:** `realtime/src/ws.rs` (or wherever WS join is handled)
On channel join, validate that the user's JWT role has SELECT permission on the requested table. Query `information_schema.role_table_grants` or attempt a `SELECT 1 FROM <table> LIMIT 0` within an RLS-scoped transaction.
---
## 0.3 — Injection & Input Sanitization
### 0.3.1 Fix SQL injection in SET LOCAL role
**Files:** `data_api/src/handlers.rs` and `storage/src/handlers.rs` (appears ~15 times)
```rust
// BEFORE (appears in every handler)
let role_query = format!("SET LOCAL role = '{}'", auth_ctx.role);
sqlx::query(&role_query).execute(&mut *tx).await?;
// AFTER — validate against allowlist
const ALLOWED_ROLES: &[&str] = &["anon", "authenticated", "service_role"];
if !ALLOWED_ROLES.contains(&auth_ctx.role.as_str()) {
return Err((StatusCode::FORBIDDEN, "Invalid role".to_string()));
}
let role_query = format!("SET LOCAL role = '{}'", auth_ctx.role);
sqlx::query(&role_query).execute(&mut *tx).await?;
```
> **Note:** PostgreSQL doesn't support `$1` parameter binding for `SET LOCAL role`. The allowlist approach is the correct fix. This will be further cleaned up in M1 when we extract the RLS middleware.
### 0.3.2 Fix SQL injection in table browser
**File:** `control_plane/src/lib.rs``get_table_data` function (line 278)
```rust
// BEFORE
let query = format!("SELECT * FROM \"{}\".\"{}\" LIMIT 100", schema, table);
// AFTER — validate against information_schema
let exists: Option<(String,)> = sqlx::query_as(
"SELECT table_name FROM information_schema.tables WHERE table_schema = $1 AND table_name = $2"
)
.bind(&schema)
.bind(&table)
.fetch_optional(&state.tenant_db)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, e.to_string()))?;
if exists.is_none() {
return Err((StatusCode::NOT_FOUND, "Table not found".to_string()));
}
let query = format!("SELECT * FROM \"{}\".\"{}\" LIMIT 100", schema, table);
```
### 0.3.3 Fix JavaScript injection in Deno runtime
**File:** `functions/src/deno_runtime.rs` lines 122-156
The current code does:
```rust
let module_code = format!(r#"
const req = new Request("http://localhost", {{
method: "POST",
body: {payload_json}, // <-- RAW interpolation!
headers: {headers_json} // <-- RAW interpolation!
}});
"#);
```
If `payload_json` contains backticks, template literal syntax, or escape sequences, it breaks out of the string context.
**Fix:** Pass data through a global variable set via the V8 API, not string interpolation:
```rust
// Before user code execution, set globalThis.__PAYLOAD__ and __HEADERS__
let setup_code = format!(
"globalThis.__PAYLOAD__ = JSON.parse({});globalThis.__HEADERS__ = JSON.parse({});",
serde_json::to_string(&payload_json)?,
serde_json::to_string(&headers_json)?
);
runtime.execute_script("<setup>", setup_code)?;
```
This double-serializes: the inner JSON becomes a string literal in JS, then `JSON.parse` deserializes it safely.
### 0.3.4 Fix path traversal in TUS uploads
**File:** `storage/src/tus.rs``get_upload_path` function (line 30)
```rust
// BEFORE
fn get_upload_path(id: &str) -> PathBuf {
let mut path = std::env::temp_dir();
path.push("madbase_tus");
path.push(id); // id could be "../../etc/passwd"
path
}
// AFTER
fn get_upload_path(id: &str) -> Result<PathBuf, &'static str> {
// Validate UUID format
Uuid::parse_str(id).map_err(|_| "Invalid upload ID")?;
let mut path = std::env::temp_dir();
path.push("madbase_tus");
path.push(id);
Ok(path)
}
```
Apply the same to `get_info_path`. Update all callers to propagate the error.
---
## 0.4 — Token & Session Security
### 0.4.1 Gate token issuance on email confirmation
**File:** `auth/src/handlers.rs``signup` function (lines 88-103)
```rust
// AFTER email confirmation check
// If auto-confirm is disabled (the default), return user without tokens
let auto_confirm = std::env::var("AUTH_AUTO_CONFIRM")
.map(|v| v == "true")
.unwrap_or(false);
if auto_confirm {
// Set confirmed_at immediately
sqlx::query("UPDATE users SET confirmed_at = now(), email_confirmed_at = now() WHERE id = $1")
.bind(user.id)
.execute(&db)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, e.to_string()))?;
// Issue tokens (existing logic)
let (token, expires_in, _) = generate_token(...)?;
let refresh_token = issue_refresh_token(...).await?;
return Ok(Json(AuthResponse { access_token: token, ... }));
}
// Not auto-confirmed: return user without tokens
Ok(Json(serde_json::json!({
"id": user.id,
"email": user.email,
"confirmation_sent_at": chrono::Utc::now(),
})))
```
### 0.4.2 Check confirmation status on login
**File:** `auth/src/handlers.rs``login` function, after password verification (line ~130)
```rust
// After verify_password succeeds:
if user.confirmed_at.is_none() && user.email_confirmed_at.is_none() {
return Err((StatusCode::FORBIDDEN, "Email not confirmed".to_string()));
}
```
### 0.4.3 Validate OAuth CSRF state
**File:** `auth/src/oauth.rs``authorize` and `callback`
In `authorize`, store the CSRF token in Redis:
```rust
let (auth_url, csrf_token) = auth_request.url();
// Store CSRF token with 10-minute TTL
if let Some(redis) = &cache.redis {
let key = format!("oauth_csrf:{}", csrf_token.secret());
cache.set(&key, &"valid").await.ok();
}
```
In `callback`, validate:
```rust
let csrf_key = format!("oauth_csrf:{}", query.state);
let valid = cache.exists(&csrf_key).await.unwrap_or(false);
if !valid {
return Err((StatusCode::BAD_REQUEST, "Invalid or expired state parameter".to_string()));
}
cache.delete(&csrf_key).await.ok();
```
### 0.4.4 Fix OAuth account takeover
**File:** `auth/src/oauth.rs` — callback, around line 234
Currently: if an existing user has the same email, the OAuth login returns that user's tokens.
**Fix:** Instead of implicit linking, create a new `identities` table. When an OAuth login matches an existing email:
- If the identity (provider + provider_id) is already linked, allow login
- If not linked, return an error: "An account with this email already exists. Log in with your password and link this provider from settings."
This is a larger change that can be partially deferred to M3 (identity linking), but the immediate fix is to **reject** the implicit match:
```rust
if existing_user.is_some() && !identity_linked {
return Err((StatusCode::CONFLICT,
"An account with this email already exists. Link this provider from your account settings.".to_string()));
}
```
---
## 0.5 — CORS & Transport Security
### 0.5.1 Restrict CORS origins
**Files:** `gateway/src/control.rs` line 104, `gateway/src/worker.rs` line 127
```rust
// BEFORE
CorsLayer::new()
.allow_origin(Any)
.allow_methods(Any)
.allow_headers(Any)
// AFTER
use tower_http::cors::AllowOrigin;
let allowed_origins = std::env::var("ALLOWED_ORIGINS")
.unwrap_or_else(|_| "http://localhost:3000,http://localhost:8000".to_string());
let origins: Vec<HeaderValue> = allowed_origins
.split(',')
.filter_map(|s| s.trim().parse().ok())
.collect();
CorsLayer::new()
.allow_origin(origins)
.allow_methods([Method::GET, Method::POST, Method::PUT, Method::DELETE, Method::PATCH, Method::OPTIONS])
.allow_headers([header::AUTHORIZATION, header::CONTENT_TYPE, HeaderName::from_static("apikey"), HeaderName::from_static("x-project-ref")])
.allow_credentials(true)
```
### 0.5.2 Stop exposing secrets in API responses
**File:** `control_plane/src/lib.rs`
In `list_projects` (line 61), create a `ProjectSummary` struct that omits `db_url`, `jwt_secret`, `anon_key`, `service_role_key`:
```rust
#[derive(Serialize, sqlx::FromRow)]
pub struct ProjectSummary {
pub id: Uuid,
pub name: String,
pub status: String,
pub created_at: Option<chrono::DateTime<chrono::Utc>>,
}
pub async fn list_projects(...) -> Result<Json<Vec<ProjectSummary>>, ...> {
let projects = sqlx::query_as::<_, ProjectSummary>(
"SELECT id, name, status, created_at FROM projects"
)...
}
```
Create a separate `GET /projects/:id/keys` endpoint that returns secrets only for the specifically requested project, requiring admin auth.
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every fix in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_jwt_secret_required` | `common/src/config.rs` | `Config::new()` panics when `JWT_SECRET` is unset |
| `test_jwt_secret_min_length` | `common/src/config.rs` | `Config::new()` panics when `JWT_SECRET` < 32 chars |
| `test_admin_password_required` | `control_plane/src/lib.rs` | Login panics when `ADMIN_PASSWORD` is unset |
| `test_s3_credentials_required` | `storage/src/backend.rs` | `AwsS3Backend::new()` panics when `S3_ACCESS_KEY` is unset |
| `test_admin_auth_rejects_forged_cookie` | `gateway/src/admin_auth.rs` | Middleware rejects `madbase_admin_session=anything` |
| `test_admin_auth_rejects_empty_token` | `gateway/src/admin_auth.rs` | Middleware rejects empty `X-Admin-Token` |
| `test_admin_auth_requires_valid_session` | `gateway/src/admin_auth.rs` | Middleware requires session ID in Redis |
| `test_role_allowlist` | `common/src/rls.rs` or `data_api/src/handlers.rs` | `SET LOCAL role` rejects roles not in `[anon, authenticated, service_role]` |
| `test_signup_no_tokens_without_confirm` | `auth/src/handlers.rs` | Signup with `AUTH_AUTO_CONFIRM=false` returns user without tokens |
| `test_login_rejects_unconfirmed` | `auth/src/handlers.rs` | Login returns 403 for `confirmed_at = NULL` |
| `test_oauth_csrf_validated` | `auth/src/oauth.rs` | Callback rejects mismatched/missing CSRF state |
| `test_tus_path_traversal_blocked` | `storage/src/tus.rs` | `get_upload_path("../../etc/passwd")` returns an error |
| `test_config_not_serializable` | `common/src/config.rs` | `Config` does not implement `Serialize` (compile-time; remove `Serialize` derive) |
| `test_cors_rejects_unknown_origin` | `gateway/src/worker.rs` or integration | Request from unlisted origin gets no `Access-Control-Allow-Origin` |
| `test_list_projects_hides_secrets` | `control_plane/src/lib.rs` | `list_projects` response does not contain `jwt_secret` or `db_url` |
### 2. Manual / Integration Verification
- [ ] `rg 'jwt_secret|db_url|password' --type rust -n` — audit every hit; no INFO/WARN-level secret logging remains
- [ ] Starting without `JWT_SECRET` env var panics with a clear message
- [ ] Starting without `ADMIN_PASSWORD` env var panics with a clear message
- [ ] Starting without `S3_ACCESS_KEY` panics with a clear message
- [ ] `curl -H "Cookie: madbase_admin_session=anything" http://localhost:8001/platform/v1/projects` returns 401
- [ ] `curl -H "X-Admin-Token: anything" http://localhost:8001/platform/v1/projects` returns 401
- [ ] `curl http://localhost:8001/platform/v1/projects` returns 401 (no credentials at all)
- [ ] SQL injection in `SET LOCAL role` is blocked by allowlist
- [ ] OAuth flow stores and validates CSRF state
- [ ] Signup without `AUTH_AUTO_CONFIRM=true` does not return access tokens
- [ ] Login with unconfirmed email returns 403
### 3. CI Gate
- [ ] All of the above tests are included in `cargo test --workspace`
- [ ] CI pipeline runs these tests (once M7 CI is in place, retroactively verify M0 tests are green in CI)

226
_milestones/M10_admin_ui.md Normal file
View File

@@ -0,0 +1,226 @@
# Milestone 10: Admin UI
**Goal:** MadBase Studio is a functional admin dashboard for core operations.
**Depends on:** M0 (Security), M1 (Foundation), M3 (Auth), M9 (Control Plane)
---
## 10.1 — Authentication
### 10.1.1 Real login form
**File:** `web/js/admin.js`
Replace the current auth check (hitting `/platform/v1/projects` and checking for 401) with a proper login flow:
```javascript
async login() {
const resp = await fetch('/platform/v1/login', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ password: this.loginPassword }),
});
if (resp.ok) {
this.isAuthenticated = true;
this.loginError = '';
await this.loadDashboard();
} else {
this.loginError = 'Invalid password';
}
}
```
The server sets an `HttpOnly` session cookie on success (implemented in M0).
### 10.1.2 Add logout
```javascript
async logout() {
await fetch('/platform/v1/logout', { method: 'POST' });
this.isAuthenticated = false;
// Clear all reactive state
}
```
### 10.1.3 CSRF protection
Generate a CSRF token on page load, include in all mutation requests:
```javascript
// On page load
const csrfResp = await fetch('/platform/v1/csrf-token');
this.csrfToken = (await csrfResp.json()).token;
// On mutations
headers: { 'X-CSRF-Token': this.csrfToken }
```
---
## 10.2 — Security Fixes
### 10.2.1 Stop sending service role key to browser
**File:** `web/js/admin.js` line ~121
Remove `fetchAdminConfig()` and the `serviceRoleKey` reactive variable. All admin API calls should use session auth (the `HttpOnly` cookie), not the service role key.
Replace storage/data API calls that use the service role key with admin-proxied endpoints:
```javascript
// BEFORE
headers: { 'Authorization': `Bearer ${this.serviceRoleKey}` }
// AFTER — use session cookie (automatic with same-origin requests)
// No Authorization header needed for /platform/v1/* routes
```
### 10.2.2 Bundle CDN dependencies
Replace CDN script tags with locally bundled files. Options:
1. **Simple:** Download Vue, Chart.js, Tailwind to `web/vendor/` and serve statically
2. **Better:** Add a minimal build step with Vite that bundles everything into `web/dist/`
For air-gapped deployments, option 1 is essential.
### 10.2.3 Fix Tailwind @apply
**File:** `web/css/admin.css`
`@apply` directives don't work with CDN Tailwind JIT. Either:
1. Remove `@apply` and use inline Tailwind classes in the HTML
2. Or add a build step that processes the CSS with Tailwind CLI
---
## 10.3 — Missing Views
### 10.3.1 Auth management tab
Add a view showing:
- User list with search/filter
- User detail: email, created_at, confirmed_at, last_sign_in_at, providers
- Actions: ban/unban, confirm email, delete user, reset password
### 10.3.2 Realtime console
Add a view showing:
- Active WebSocket connections count
- Active channel subscriptions
- Live event stream (filterable by table/event type)
- Presence information per channel
### 10.3.3 Object deletion in Storage view
Add a delete button next to each object in the storage file browser:
```javascript
async deleteObject(bucketId, objectName) {
if (!confirm(`Delete ${objectName}?`)) return;
await fetch(`/platform/v1/storage/${bucketId}/${objectName}`, { method: 'DELETE' });
await this.fetchObjects(bucketId);
}
```
---
## 10.4 — Usability
### 10.4.1 Configurable Grafana URL
**File:** `web/admin.html` — Grafana iframe (line ~414)
```html
<!-- BEFORE -->
<iframe src="http://localhost:3000" ...></iframe>
<!-- AFTER -->
<iframe :src="grafanaUrl" ...></iframe>
```
```javascript
// In admin.js data
grafanaUrl: window.MADBASE_GRAFANA_URL || '/grafana',
```
Set via env var or server-rendered config.
### 10.4.2 Confirmation dialogs
Add `confirm()` before all destructive operations:
- Delete project
- Delete user
- Delete storage object
- Remove server
### 10.4.3 Error handling
Add global error display:
```javascript
methods: {
async apiCall(url, options) {
try {
const resp = await fetch(url, options);
if (!resp.ok) {
const err = await resp.json();
this.showError(err.error || 'Request failed');
return null;
}
return resp;
} catch (e) {
this.showError(e.message);
return null;
}
},
showError(msg) {
this.errorMessage = msg;
setTimeout(() => this.errorMessage = '', 5000);
}
}
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures** (backend unchanged, but verify no regressions)
- [ ] All **pre-existing tests** still pass
- [ ] **New end-to-end / browser tests** cover the admin UI:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_login_success` | `tests/e2e/admin_ui.rs` or Playwright | Correct password → dashboard loads, session cookie set |
| `test_login_failure` | `tests/e2e/admin_ui.rs` or Playwright | Wrong password → error message shown, no cookie |
| `test_logout` | `tests/e2e/admin_ui.rs` or Playwright | Logout → redirected to login, session cookie cleared |
| `test_no_service_key_in_client` | `tests/e2e/admin_ui.rs` or Playwright | Service role key absent from page source and network requests |
| `test_auth_user_list` | `tests/e2e/admin_ui.rs` or Playwright | Auth tab renders user list from API |
| `test_auth_user_search` | `tests/e2e/admin_ui.rs` or Playwright | Typing in search filters the user list |
| `test_storage_delete_object` | `tests/e2e/admin_ui.rs` or Playwright | Delete button removes object; confirm dialog appears first |
| `test_grafana_iframe_configurable` | `tests/e2e/admin_ui.rs` or Playwright | Iframe `src` matches configured `MADBASE_GRAFANA_URL` |
| `test_delete_project_confirmation` | `tests/e2e/admin_ui.rs` or Playwright | Delete project requires confirmation dialog |
| `test_no_cdn_dependencies` | `web/admin.html` (static analysis) | No `<script>` or `<link>` tags referencing external CDN URLs |
| `test_csrf_token_present` | `tests/e2e/admin_ui.rs` or Playwright | Mutating requests include a CSRF token |
### 2. Manual / Visual Verification
- [ ] Login with correct password → dashboard loads
- [ ] Login with wrong password → error message shown
- [ ] Logout → redirected to login, session cookie cleared
- [ ] Service role key never appears in browser DevTools (Network, Application tabs)
- [ ] Auth tab shows user list with working search
- [ ] Storage tab allows deleting objects
- [ ] Grafana iframe loads from configured URL
- [ ] Delete project shows confirmation dialog
- [ ] Works in air-gapped environment (no CDN dependencies)
- [ ] Responsive layout works on 1024px and 1440px viewports
- [ ] Error toast appears on API failures (e.g., network down)
### 3. CI Gate
- [ ] `cargo test --workspace` green (backend)
- [ ] E2E tests (Playwright or equivalent) run in CI against a `docker compose up` stack
- [ ] Static analysis confirms no external CDN references in `web/` HTML/JS files
- [ ] All destructive API calls in the UI are confirmed via a dialog (code review checklist)

View File

@@ -0,0 +1,493 @@
# Milestone 1: Foundation — Make It Compile and Run Correctly
**Goal:** A developer can `docker compose up`, hit the API with supabase-js, and get correct behavior for basic flows.
**Depends on:** M0 (Security Hardening)
---
## 1.1 — Fix Critical Bugs
### 1.1.1 Fix proxy body forwarding
**File:** `gateway/src/proxy.rs``forward_request` function (line ~172)
The proxy builds a `reqwest` request with `.headers()` but never reads or forwards the request body. Every POST/PUT/PATCH through the proxy silently drops the body.
**Current code (broken):**
```rust
let request_builder = client
.request(req.method().clone(), &target_url)
.headers(req.headers().clone());
// Body is never set!
```
**Fix:** Read the body from the incoming axum `Request` and attach it to the outgoing `reqwest` request:
```rust
// Extract body before consuming the request
let (parts, body) = req.into_parts();
let body_bytes = axum::body::to_bytes(body, 1024 * 1024 * 100) // 100MB limit
.await
.map_err(|_| StatusCode::BAD_REQUEST)?;
let request_builder = client
.request(parts.method.clone(), &target_url)
.headers(parts.headers.clone())
.body(body_bytes);
```
For streaming (large uploads), use `reqwest::Body::wrap_stream()` instead of buffering.
### 1.1.2 Fix proxy round-robin
**File:** `gateway/src/proxy.rs``proxy_request` function (line ~147)
**Current broken logic:** `get_healthy_worker()` always returns the FIRST healthy worker. Round-robin (`get_next_worker()`) is only used as a fallback when NO workers are healthy.
**Fix:** Merge the two methods — round-robin among healthy workers:
```rust
async fn get_next_healthy_worker(&self) -> Option<Upstream> {
let upstreams = self.worker_upstreams.read().await;
let len = upstreams.len();
if len == 0 { return None; }
let mut index = self.current_worker_index.write().await;
for _ in 0..len {
let candidate = &upstreams[*index % len];
*index = (*index + 1) % len;
if *candidate.healthy.read().await {
return Some(candidate.clone());
}
}
// All unhealthy — return next in rotation anyway
let fallback = upstreams[*index % len].clone();
*index = (*index + 1) % len;
Some(fallback)
}
```
### 1.1.3 Fix proxy response streaming
**File:** `gateway/src/proxy.rs``forward_request` function (line ~200)
```rust
// BEFORE — loads entire response into memory
let body_bytes = response.bytes().await.map_err(|e| { ... })?;
response_builder.body(Body::from(body_bytes.to_vec()))
// AFTER — stream the response
let stream = response.bytes_stream();
let body = Body::from_stream(stream);
response_builder.body(body)
```
This prevents OOM on large file downloads through the proxy.
### 1.1.4 Pool HTTP clients
**Files:** `gateway/src/proxy.rs`, `gateway/src/control.rs`
Create `reqwest::Client` once at startup and store it in state:
```rust
// In ProxyState::new()
let http_client = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.pool_max_idle_per_host(20)
.build()
.unwrap();
```
Store in `ProxyState { http_client, ... }`. Pass to `forward_request`. Same for health check loop — use the shared client instead of creating one per iteration.
In `gateway/src/control.rs``logs_proxy_handler` (line 23): create the client in `ControlState` and pass via `State`, not `reqwest::Client::new()` per request.
### 1.1.5 Fix tracing in standalone binaries
**Files:** `gateway/src/bin/proxy.rs`, `bin/control.rs`, `bin/worker.rs`
All three have the same bug — `_rust_log` is unused:
```rust
// BEFORE
let _rust_log = std::env::var("RUST_LOG").unwrap_or_else(|_| "info".into());
tracing_subscriber::fmt::init();
// AFTER
tracing_subscriber::fmt()
.with_env_filter(
tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| tracing_subscriber::EnvFilter::new("info"))
)
.init();
```
Also note `bin/worker.rs` has a typo: `RUST_log` instead of `RUST_LOG`.
---
## 1.2 — Dev Stack That Actually Works
### 1.2.1 Updated docker-compose.yml
Add Redis, MinIO, health checks, and proper startup ordering:
```yaml
services:
db:
image: postgres:15-alpine
container_name: madbase_dev_db
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
ports:
- "5432:5432"
volumes:
- dev_db_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 10
redis:
image: redis:7-alpine
container_name: madbase_dev_redis
command: redis-server --appendonly yes
ports:
- "6379:6379"
volumes:
- dev_redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
minio:
image: quay.io/minio/minio:RELEASE.2024-06-13T22-53-53Z
container_name: madbase_dev_minio
command: server /data --console-address ":9001"
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: ${S3_ACCESS_KEY:-minioadmin}
MINIO_ROOT_PASSWORD: ${S3_SECRET_KEY:-minioadmin}
volumes:
- dev_minio_data:/data
healthcheck:
test: ["CMD", "mc", "ready", "local"]
interval: 5s
timeout: 3s
retries: 5
worker:
build:
context: .
target: worker-runtime
container_name: madbase_dev_worker
ports:
- "8002:8002"
environment:
DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD:-postgres}@db:5432/postgres
DEFAULT_TENANT_DB_URL: postgres://postgres:${POSTGRES_PASSWORD:-postgres}@db:5432/postgres
JWT_SECRET: ${JWT_SECRET}
REDIS_URL: redis://redis:6379
S3_ENDPOINT: http://minio:9000
S3_ACCESS_KEY: ${S3_ACCESS_KEY:-minioadmin}
S3_SECRET_KEY: ${S3_SECRET_KEY:-minioadmin}
S3_BUCKET: madbase
S3_REGION: us-east-1
RUST_LOG: info
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
minio:
condition: service_healthy
system:
build:
context: .
target: control-runtime
container_name: madbase_dev_system
ports:
- "8001:8001"
environment:
DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD:-postgres}@db:5432/postgres
DEFAULT_TENANT_DB_URL: postgres://postgres:${POSTGRES_PASSWORD:-postgres}@db:5432/postgres
JWT_SECRET: ${JWT_SECRET}
ADMIN_PASSWORD: ${ADMIN_PASSWORD}
RUST_LOG: info
depends_on:
db:
condition: service_healthy
proxy:
build:
context: .
target: proxy-runtime
container_name: madbase_dev_proxy
ports:
- "8000:8000"
environment:
CONTROL_UPSTREAM_URL: http://system:8001
WORKER_UPSTREAM_URLS: http://worker:8002
RUST_LOG: info
depends_on:
- system
- worker
volumes:
dev_db_data:
dev_redis_data:
dev_minio_data:
```
### 1.2.2 Create .env.example
```env
# Required
JWT_SECRET=generate-with-openssl-rand-hex-32
ADMIN_PASSWORD=change-me-in-production
DATABASE_URL=postgres://postgres:postgres@localhost:5432/postgres
DEFAULT_TENANT_DB_URL=postgres://postgres:postgres@localhost:5432/postgres
# Storage (MinIO for dev, Hetzner/AWS for production)
S3_ENDPOINT=http://localhost:9000
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin
S3_BUCKET=madbase
S3_REGION=us-east-1
# Optional
REDIS_URL=redis://localhost:6379
RUST_LOG=info
ALLOWED_ORIGINS=http://localhost:3000,http://localhost:8000
```
### 1.2.3 Create missing config files
Create `config/prometheus.yml`:
```yaml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'madbase-worker'
static_configs:
- targets: ['worker:8002']
metrics_path: /metrics
- job_name: 'madbase-control'
static_configs:
- targets: ['control:8001']
metrics_path: /metrics
- job_name: 'madbase-proxy'
static_configs:
- targets: ['proxy:8000']
metrics_path: /metrics
```
Create `config/vmagent.yml` with the same content.
### 1.2.4 Fix Grafana port
**File:** `docker-compose.pillar-system.yml` line 33
```yaml
# BEFORE
ports:
- "3030:3030"
# AFTER — Grafana listens on 3000 by default
ports:
- "3030:3000"
```
Or add `GF_SERVER_HTTP_PORT=3030` to the environment.
---
## 1.3 — Unified Error Handling
### 1.3.1 Create ApiError type
**File:** Create `common/src/error.rs`
```rust
use axum::http::StatusCode;
use axum::response::{IntoResponse, Response, Json};
use serde::Serialize;
#[derive(Debug)]
pub enum ApiError {
BadRequest(String),
Unauthorized(String),
Forbidden(String),
NotFound(String),
Conflict(String),
Internal(String),
Database(sqlx::Error),
}
#[derive(Serialize)]
struct ErrorResponse {
error: String,
code: u16,
#[serde(skip_serializing_if = "Option::is_none")]
detail: Option<String>,
}
impl IntoResponse for ApiError {
fn into_response(self) -> Response {
let (status, message, detail) = match &self {
ApiError::BadRequest(msg) => (StatusCode::BAD_REQUEST, msg.clone(), None),
ApiError::Unauthorized(msg) => (StatusCode::UNAUTHORIZED, msg.clone(), None),
ApiError::Forbidden(msg) => (StatusCode::FORBIDDEN, msg.clone(), None),
ApiError::NotFound(msg) => (StatusCode::NOT_FOUND, msg.clone(), None),
ApiError::Conflict(msg) => (StatusCode::CONFLICT, msg.clone(), None),
ApiError::Internal(msg) => {
tracing::error!("Internal error: {}", msg);
(StatusCode::INTERNAL_SERVER_ERROR, "Internal server error".to_string(), None)
}
ApiError::Database(e) => {
tracing::error!("Database error: {}", e);
(StatusCode::INTERNAL_SERVER_ERROR, "Database error".to_string(), None)
}
};
let body = ErrorResponse {
error: message,
code: status.as_u16(),
detail,
};
(status, Json(body)).into_response()
}
}
impl From<sqlx::Error> for ApiError {
fn from(e: sqlx::Error) -> Self {
ApiError::Database(e)
}
}
```
Gradually replace `(StatusCode, String)` return types with `Result<T, ApiError>` across all handlers.
---
## 1.4 — Extract RLS Middleware
### 1.4.1 Create RLS transaction extractor
The `BEGIN tx → SET LOCAL role → set_config` block is repeated ~15 times. Create an extractor:
**File:** Create `common/src/rls.rs`
```rust
use axum::extract::{Extension, FromRequestParts};
use auth::AuthContext;
use sqlx::{PgPool, Postgres, Transaction};
pub struct RlsTransaction {
pub tx: Transaction<'static, Postgres>,
}
impl RlsTransaction {
pub async fn begin(
pool: &PgPool,
auth_ctx: &AuthContext,
) -> Result<Self, ApiError> {
let mut tx = pool.begin().await?;
// Validate and set role
const ALLOWED_ROLES: &[&str] = &["anon", "authenticated", "service_role"];
if !ALLOWED_ROLES.contains(&auth_ctx.role.as_str()) {
return Err(ApiError::Forbidden("Invalid role".into()));
}
let role_query = format!("SET LOCAL role = '{}'", auth_ctx.role);
sqlx::query(&role_query).execute(&mut *tx).await?;
// Set JWT claims for RLS policies
if let Some(claims) = &auth_ctx.claims {
sqlx::query("SELECT set_config('request.jwt.claim.sub', $1, true)")
.bind(&claims.sub)
.execute(&mut *tx)
.await?;
}
Ok(Self { tx })
}
pub async fn commit(self) -> Result<(), ApiError> {
self.tx.commit().await.map_err(ApiError::from)
}
}
```
**Usage in handlers:**
```rust
pub async fn list_buckets(
State(state): State<StorageState>,
Extension(auth_ctx): Extension<AuthContext>,
db: Option<Extension<PgPool>>,
) -> Result<Json<Vec<Bucket>>, ApiError> {
let pool = db.map(|Extension(p)| p).unwrap_or_else(|| state.db.clone());
let mut rls = RlsTransaction::begin(&pool, &auth_ctx).await?;
let buckets = sqlx::query_as::<_, Bucket>("SELECT * FROM storage.buckets")
.fetch_all(&mut *rls.tx)
.await?;
Ok(Json(buckets))
// tx auto-rolls back on drop (read-only is fine)
}
```
This eliminates ~150 lines of duplicated error-mapping boilerplate.
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every fix in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_proxy_forwards_body` | `gateway/src/proxy.rs` | POST with 1MB body reaches the upstream intact |
| `test_proxy_streams_response` | `gateway/src/proxy.rs` | Large response is streamed, not buffered entirely |
| `test_proxy_round_robin` | `gateway/src/proxy.rs` | 4 requests to 2 workers distribute 2+2 |
| `test_proxy_single_http_client` | `gateway/src/proxy.rs` | `reqwest::Client` is reused (shared state, not per-request) |
| `test_worker_tracing_init` | `gateway/src/bin/worker.rs` | `RUST_LOG=debug` produces debug-level spans |
| `test_api_error_json_format` | `common/src/error.rs` | `ApiError::BadRequest("x")` serializes to `{"error":"x","code":400}` |
| `test_api_error_hides_db_detail` | `common/src/error.rs` | `ApiError::Database(e)` does not leak SQL in the response body |
| `test_rls_transaction_sets_role` | `common/src/rls.rs` | `RlsTransaction::begin()` issues `SET LOCAL role` with the auth context role |
| `test_rls_transaction_rejects_bad_role` | `common/src/rls.rs` | Role outside `[anon, authenticated, service_role]` returns `Forbidden` |
| `test_rls_transaction_sets_claims` | `common/src/rls.rs` | JWT `sub` claim is available via `current_setting('request.jwt.claim.sub')` |
### 2. Integration Verification
- [ ] `docker compose up` starts all services (db, redis, minio, worker, system, proxy) without crash-loops
- [ ] `curl -X POST http://localhost:8000/auth/v1/signup -H "apikey: <anon_key>" -d '{"email":"test@test.com","password":"password123"}'` returns a user (through the proxy)
- [ ] Large file upload (>5MB) through the proxy succeeds (body forwarding works)
- [ ] Proxy distributes requests across multiple workers (if configured)
- [ ] `RUST_LOG=debug` works in all three standalone binaries
- [ ] API errors return structured JSON, never raw SQL error messages
- [ ] `docker compose down && docker compose up` — idempotent restart with no data loss
### 3. CI Gate
- [ ] All of the above unit tests are included in `cargo test --workspace`
- [ ] No `#[ignore]` on any test added in this milestone unless it requires external services (and those must be documented)

View File

@@ -0,0 +1,517 @@
# Milestone 2: Storage Pillar
**Goal:** Storage becomes a first-class pillar supporting self-hosted MinIO or cloud S3 (Hetzner Object Storage, AWS S3, Backblaze B2). Complete the supabase-js `storage` API surface.
**Depends on:** M1 (Foundation)
---
## 2.1 — Storage Pillar Compose & Configuration
### 2.1.1 Create docker-compose.pillar-storage.yml
This compose file is used only for **self-hosted mode**. In cloud mode, workers connect directly to the external S3 endpoint and this compose file is not needed.
```yaml
# MadBase - Pillar: Storage (Self-Hosted)
# S3-compatible object storage via MinIO
services:
minio:
image: quay.io/minio/minio:RELEASE.2024-06-13T22-53-53Z
container_name: madbase_minio
command: server /data --console-address ":9001"
ports:
- "9000:9000"
- "9001:9001"
environment:
MINIO_ROOT_USER: ${S3_ACCESS_KEY}
MINIO_ROOT_PASSWORD: ${S3_SECRET_KEY}
MINIO_BROWSER_REDIRECT_URL: http://localhost:9001
volumes:
- minio_data:/data
healthcheck:
test: ["CMD", "mc", "ready", "local"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
volumes:
minio_data:
networks:
default:
name: madbase
external: true
```
### 2.1.2 Add STORAGE_MODE env var
**File:** `common/src/config.rs`
Add to `Config`:
```rust
pub storage_mode: StorageMode,
pub s3_endpoint: String,
pub s3_access_key: String,
pub s3_secret_key: String,
pub s3_bucket: String,
pub s3_region: String,
```
```rust
#[derive(Clone, Debug)]
pub enum StorageMode {
Cloud, // External S3 (Hetzner, AWS, B2)
SelfHosted, // MinIO
}
```
Load from env:
```rust
let storage_mode = match env::var("STORAGE_MODE").unwrap_or_else(|_| "self-hosted".into()).as_str() {
"cloud" | "s3" => StorageMode::Cloud,
_ => StorageMode::SelfHosted,
};
```
### 2.1.3 Create storage-node.yaml template
**File:** `templates/storage-node.yaml`
```yaml
id: storage-node
name: Dedicated Storage Node
description: MinIO object storage for self-hosted deployments
version: 1.0
min_hetzner_plan: CX21
estimated_monthly_cost: 6.94
services:
- id: minio
name: MinIO
image: quay.io/minio/minio:RELEASE.2024-06-13T22-53-53Z
ports: ["9000:9000", "9001:9001"]
command: ["server", "/data", "--console-address", ":9001"]
volumes:
- minio_data:/data
resource_profile: storage_intensive
requirements:
min_nodes: 1
max_nodes: 4
supports_ha: true
recommended_deployment: "Dedicated node with attached block storage"
notes: |
For HA, use distributed MinIO with 4+ nodes and erasure coding.
For cloud deployments, skip this node — use Hetzner Object Storage.
Estimated storage: 1TB on CX21 block storage = ~€6/mo additional.
```
### 2.1.4 Add shared Docker network
Add to each `docker-compose.pillar-*.yml`:
```yaml
networks:
default:
name: madbase
external: true
```
Create the network before first use: `docker network create madbase`
---
## 2.2 — Storage Backend Improvements
### 2.2.1 Route handlers through StorageBackend trait
**Current problem:** `StorageState` holds a raw `aws_sdk_s3::Client` and handlers call `state.s3_client.put_object()` directly, bypassing the `StorageBackend` trait entirely. The trait exists but is unused.
**Fix:**
1. Expand the `StorageBackend` trait:
```rust
#[async_trait]
pub trait StorageBackend: Send + Sync {
async fn put_object(&self, bucket: &str, key: &str, data: Bytes, content_type: Option<&str>) -> Result<()>;
async fn get_object(&self, bucket: &str, key: &str) -> Result<GetObjectResponse>;
async fn delete_object(&self, bucket: &str, key: &str) -> Result<()>;
async fn copy_object(&self, bucket: &str, src_key: &str, dst_key: &str) -> Result<()>;
async fn create_bucket(&self, bucket: &str) -> Result<()>;
async fn delete_bucket(&self, bucket: &str) -> Result<()>;
async fn head_object(&self, bucket: &str, key: &str) -> Result<ObjectMetadata>;
async fn list_objects(&self, bucket: &str, prefix: &str) -> Result<Vec<ObjectMetadata>>;
}
pub struct GetObjectResponse {
pub body: Pin<Box<dyn Stream<Item = Result<Bytes>> + Send>>,
pub content_type: Option<String>,
pub content_length: Option<i64>,
}
```
2. Change `StorageState`:
```rust
#[derive(Clone)]
pub struct StorageState {
pub db: PgPool,
pub backend: Arc<dyn StorageBackend>,
pub config: Config,
pub bucket_name: String,
}
```
3. Update `storage/src/lib.rs` init:
```rust
pub async fn init(db: PgPool, config: Config) -> Router {
let backend: Arc<dyn StorageBackend> = Arc::new(
AwsS3Backend::new(&config).await.expect("Failed to init storage backend")
);
let bucket_name = config.s3_bucket.clone();
backend.create_bucket(&bucket_name).await.ok();
let state = StorageState { db, backend, config, bucket_name };
// ...routes...
}
```
### 2.2.2 Add streaming to StorageBackend
Replace `get_object() -> Bytes` with streaming response. The AWS SDK already supports this:
```rust
async fn get_object(&self, _bucket: &str, key: &str) -> Result<GetObjectResponse> {
let resp = self.client.get_object()
.bucket(&self.bucket_name)
.key(key)
.send()
.await?;
let stream = resp.body.into_async_read();
let byte_stream = tokio_util::io::ReaderStream::new(stream);
let mapped = byte_stream.map(|r| r.map_err(|e| anyhow::anyhow!(e)));
Ok(GetObjectResponse {
body: Box::pin(mapped),
content_type: resp.content_type.map(|s| s.to_string()),
content_length: resp.content_length,
})
}
```
In the handler, convert to axum Body:
```rust
let resp = state.backend.get_object(&state.bucket_name, &key).await?;
let body = Body::from_stream(resp.body);
Ok((headers, body))
```
### 2.2.3 Add missing HTTP endpoints
**Delete object:** `DELETE /storage/v1/object/:bucket_id/*filename`
```rust
pub async fn delete_object(
State(state): State<StorageState>,
Extension(auth_ctx): Extension<AuthContext>,
Extension(project_ctx): Extension<ProjectContext>,
Path((bucket_id, filename)): Path<(String, String)>,
db: Option<Extension<PgPool>>,
) -> Result<StatusCode, ApiError> {
let pool = db.map(|Extension(p)| p).unwrap_or(state.db.clone());
let mut rls = RlsTransaction::begin(&pool, &auth_ctx).await?;
// Verify object exists under RLS
let exists = sqlx::query_scalar::<_, Uuid>(
"SELECT id FROM storage.objects WHERE bucket_id = $1 AND name = $2"
)
.bind(&bucket_id).bind(&filename)
.fetch_optional(&mut *rls.tx).await?;
if exists.is_none() {
return Err(ApiError::NotFound("Object not found".into()));
}
// Delete from S3
let key = format!("{}/{}/{}", project_ctx.project_ref, bucket_id, filename);
state.backend.delete_object(&state.bucket_name, &key).await
.map_err(|e| ApiError::Internal(e.to_string()))?;
// Delete from DB
sqlx::query("DELETE FROM storage.objects WHERE bucket_id = $1 AND name = $2")
.bind(&bucket_id).bind(&filename)
.execute(&mut *rls.tx).await?;
rls.commit().await?;
Ok(StatusCode::NO_CONTENT)
}
```
**Delete bucket:** `DELETE /storage/v1/bucket/:bucket_id`
**Copy object:** `POST /storage/v1/object/copy` with `{ "sourceKey": "bucket/path", "destinationKey": "bucket/path" }`
**Move object:** `POST /storage/v1/object/move` (copy + delete source)
**Public URL:** `GET /storage/v1/object/public/:bucket_id/*filename` — check `storage.buckets.public = true`, return redirect to S3 presigned URL or stream directly.
### 2.2.4 Add bucket constraints
**Migration:** Add columns to `storage.buckets`:
```sql
ALTER TABLE storage.buckets
ADD COLUMN IF NOT EXISTS file_size_limit BIGINT,
ADD COLUMN IF NOT EXISTS allowed_mime_types TEXT[];
```
**Validation in upload handler:**
```rust
// After fetching bucket info
if let Some(limit) = bucket.file_size_limit {
if data.len() as i64 > limit {
return Err(ApiError::BadRequest(format!(
"File size {} exceeds bucket limit {}", data.len(), limit
)));
}
}
if let Some(allowed) = &bucket.allowed_mime_types {
if !allowed.is_empty() && !allowed.contains(&content_type.to_string()) {
return Err(ApiError::BadRequest(format!(
"MIME type {} not allowed in this bucket", content_type
)));
}
}
```
### 2.2.5 Fix TUS completion — use S3 multipart upload
**File:** `storage/src/tus.rs``tus_patch_upload` completion block (line ~252)
**Current problem:** `fs::read(&upload_path)` loads the entire completed file into memory.
**Fix:** Use S3 multipart upload. On TUS create, start a multipart upload. On each PATCH, upload that chunk as a part. On completion, finalize the multipart upload.
Store the multipart upload ID in the `.info` file:
```json
{
"upload_length": 104857600,
"bucket_id": "avatars",
"filename": "photo.jpg",
"s3_upload_id": "abc123...",
"parts": [
{ "part_number": 1, "etag": "\"abc\"", "size": 5242880 },
{ "part_number": 2, "etag": "\"def\"", "size": 5242880 }
]
}
```
On PATCH:
```rust
let part_number = (current_offset / PART_SIZE) as i32 + 1;
let upload_part = state.backend.client()
.upload_part()
.bucket(&state.bucket_name)
.key(&key)
.upload_id(&s3_upload_id)
.part_number(part_number)
.body(ByteStream::from(data))
.send()
.await?;
// Store etag in info file
```
On completion:
```rust
state.backend.client()
.complete_multipart_upload()
.bucket(&state.bucket_name)
.key(&key)
.upload_id(&s3_upload_id)
.multipart_upload(completed_parts)
.send()
.await?;
// Clean up local temp files
```
> **Note:** S3 multipart parts must be at least 5MB (except the last part). Buffer PATCH data until 5MB before uploading a part.
---
## 2.3 — Storage Health & Observability
### 2.3.1 Health check endpoint
Add to `storage/src/lib.rs` router:
```rust
.route("/health", get(health_check))
```
```rust
async fn health_check(State(state): State<StorageState>) -> Result<&'static str, StatusCode> {
state.backend.head_bucket(&state.bucket_name)
.await
.map_err(|_| StatusCode::SERVICE_UNAVAILABLE)?;
Ok("OK")
}
```
### 2.3.2 Structured logging
Replace all `tracing::info!("File size: {} bytes", size)` with structured fields:
```rust
tracing::info!(
bucket = %bucket_id,
filename = %filename,
size_bytes = size,
"Upload completed"
);
```
### 2.3.3 Image transforms — run async
**File:** `storage/src/handlers.rs``transform_image` function (line 328)
Currently runs synchronously, blocking the async runtime. Use `tokio::task::spawn_blocking`:
```rust
if width.is_some() || height.is_some() || format.is_some() {
let body_clone = body_bytes.clone();
match tokio::task::spawn_blocking(move || {
transform_image(body_clone, width, height, quality, format)
}).await {
Ok(Ok((new_bytes, new_ct))) => { ... },
Ok(Err(e)) => { tracing::warn!(error = %e, "Image transform failed"); },
Err(e) => { tracing::warn!(error = %e, "Image transform panicked"); },
}
}
```
---
## 2.4 — MinIO HA (Optional)
### 2.4.1 Distributed MinIO documentation
For self-hosted production with HA, document the distributed mode setup:
```yaml
# docker-compose.pillar-storage-ha.yml
services:
minio1:
image: quay.io/minio/minio:RELEASE.2024-06-13T22-53-53Z
command: server http://minio{1...4}/data --console-address ":9001"
# ... same for minio2, minio3, minio4
```
Requires 4 nodes minimum for erasure coding. Each node needs its own block storage volume.
### 2.4.2 Lifecycle rules
Configure via MinIO client:
```bash
mc ilm rule add madbase/madbase \
--expire-delete-marker \
--noncurrent-expire-days 30 \
--prefix "tus-temp/"
```
This auto-cleans incomplete TUS uploads after 30 days.
---
## Route Summary (after M2)
| Method | Path | Handler | supabase-js method |
|--------|------|---------|-------------------|
| GET | `/storage/v1/bucket` | `list_buckets` | `listBuckets()` |
| POST | `/storage/v1/bucket` | `create_bucket` | `createBucket()` |
| DELETE | `/storage/v1/bucket/:id` | `delete_bucket` | `deleteBucket()` |
| POST | `/storage/v1/object/list/:bucket_id` | `list_objects` | `list()` |
| POST | `/storage/v1/object/:bucket_id/*filename` | `upload_object` | `upload()` |
| GET | `/storage/v1/object/:bucket_id/*filename` | `download_object` | `download()` |
| DELETE | `/storage/v1/object/:bucket_id/*filename` | `delete_object` | `remove()` |
| POST | `/storage/v1/object/copy` | `copy_object` | `copy()` |
| POST | `/storage/v1/object/move` | `move_object` | `move()` |
| POST | `/storage/v1/object/sign/:bucket_id/*filename` | `sign_object` | `createSignedUrl()` |
| GET | `/storage/v1/object/sign/:bucket_id/*filename` | `get_signed_object` | (signed URL access) |
| GET | `/storage/v1/object/public/:bucket_id/*filename` | `get_public_url` | `getPublicUrl()` |
| POST | `/storage/v1/upload/resumable` | `tus_create_upload` | (TUS) |
| PATCH | `/storage/v1/upload/resumable/:id` | `tus_patch_upload` | (TUS) |
| HEAD | `/storage/v1/upload/resumable/:id` | `tus_head_upload` | (TUS) |
| GET | `/storage/v1/health` | `health_check` | — |
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every feature in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_s3_put_object` | `storage/src/backend.rs` | `put_object` stores bytes and returns Ok |
| `test_s3_get_object_streaming` | `storage/src/backend.rs` | `get_object` returns a streaming body, not buffered |
| `test_s3_delete_object` | `storage/src/backend.rs` | `delete_object` removes the key; subsequent `head_object` returns NotFound |
| `test_s3_copy_object` | `storage/src/backend.rs` | `copy_object` duplicates object; both keys exist |
| `test_s3_move_object` | `storage/src/backend.rs` | After `move_object`, old key is gone, new key exists |
| `test_s3_list_objects` | `storage/src/backend.rs` | `list_objects` returns correct prefix-filtered results |
| `test_s3_head_object_metadata` | `storage/src/backend.rs` | `head_object` returns correct size and content_type |
| `test_s3_create_and_delete_bucket` | `storage/src/backend.rs` | `create_bucket` + `delete_bucket` round-trip succeeds |
| `test_bucket_file_size_limit` | `storage/src/handlers.rs` | Upload exceeding `file_size_limit` returns 413 |
| `test_bucket_allowed_mime_types` | `storage/src/handlers.rs` | Upload with disallowed MIME type returns 415 |
| `test_tus_multipart_completion` | `storage/src/tus.rs` | TUS completion assembles parts via S3 multipart, not in-memory buffer |
| `test_health_check_minio_up` | `storage/src/handlers.rs` | `/health` returns 200 when S3 is reachable |
| `test_health_check_minio_down` | `storage/src/handlers.rs` | `/health` returns 503 when S3 is unreachable |
| `test_storage_mode_self_hosted` | `storage/src/backend.rs` | `STORAGE_MODE=self-hosted` initializes with MinIO endpoint |
| `test_storage_mode_cloud` | `storage/src/backend.rs` | `STORAGE_MODE=cloud` initializes with custom S3 endpoint |
### 2. Integration Verification
- [ ] `STORAGE_MODE=self-hosted docker compose -f docker-compose.pillar-storage.yml up` starts MinIO and passes health checks
- [ ] Upload a 10MB file via `POST /storage/v1/object/test-bucket/big-file.bin` — verify it doesn't OOM
- [ ] Download a 10MB file — verify streaming (no OOM)
- [ ] Delete an object via `DELETE /storage/v1/object/test-bucket/file.txt` — verify removed from S3 and DB
- [ ] Copy an object — verify new key exists in S3
- [ ] Move an object — verify old key removed, new key exists
- [ ] Upload to a bucket with `file_size_limit = 1000` — verify rejection for files over 1KB
- [ ] TUS upload of a 50MB file completes without loading into memory
- [ ] `GET /storage/v1/health` returns 200 when MinIO is up, 503 when down
- [ ] `STORAGE_MODE=cloud S3_ENDPOINT=https://fsn1.your-objectstorage.com ...` works with Hetzner Object Storage
- [ ] Every route in the Route Summary table above returns the correct response for both success and error cases
### 3. supabase-js Client Compatibility
- [ ] `supabase.storage.listBuckets()` works
- [ ] `supabase.storage.from('bucket').upload('file.txt', blob)` works
- [ ] `supabase.storage.from('bucket').download('file.txt')` works
- [ ] `supabase.storage.from('bucket').remove(['file.txt'])` works
- [ ] `supabase.storage.from('bucket').copy('a.txt', 'b.txt')` works
- [ ] `supabase.storage.from('bucket').move('a.txt', 'c.txt')` works
- [ ] `supabase.storage.from('bucket').createSignedUrl('file.txt', 3600)` works
- [ ] `supabase.storage.from('public-bucket').getPublicUrl('file.txt')` works
### 4. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] Integration tests that require MinIO are gated behind `#[cfg(feature = "integration")]` or `#[ignore]` with clear documentation
- [ ] CI runs unit tests on every PR; integration tests run on merge to main (or nightly)

View File

@@ -0,0 +1,398 @@
# Milestone 3: Auth Completeness (supabase-js Compatibility)
**Goal:** `supabase.auth.*` works correctly for the core flows real apps need.
**Depends on:** M0 (Security), M1 (Foundation)
---
## 3.1 — Missing Core Endpoints
### 3.1.1 POST /auth/v1/logout
**File:** `auth/src/handlers.rs` (new function), `auth/src/lib.rs` (add route)
```rust
pub async fn logout(
State(state): State<AuthState>,
db: Option<Extension<PgPool>>,
Extension(auth_ctx): Extension<AuthContext>,
) -> Result<StatusCode, ApiError> {
let claims = auth_ctx.claims.ok_or(ApiError::Unauthorized("Not authenticated".into()))?;
let user_id = Uuid::parse_str(&claims.sub).map_err(|_| ApiError::Unauthorized("Invalid user ID".into()))?;
let db = db.map(|Extension(p)| p).unwrap_or(state.db.clone());
// Revoke all active refresh tokens for this user's current session
sqlx::query("UPDATE refresh_tokens SET revoked = true WHERE user_id = $1 AND revoked = false")
.bind(user_id)
.execute(&db)
.await?;
// If Redis sessions are active, destroy them
// if let Some(session_manager) = &state.session_manager {
// session_manager.delete_all_user_sessions(user_id).await.ok();
// }
Ok(StatusCode::NO_CONTENT)
}
```
Add route in `auth/src/lib.rs`:
```rust
.route("/logout", post(handlers::logout))
```
**supabase-js behavior:** `signOut()` calls `POST /auth/v1/logout` with the access token in the Authorization header. Expects 204 No Content.
### 3.1.2 GET /auth/v1/settings
Returns auth configuration that supabase-js reads during initialization:
```rust
pub async fn settings(
State(state): State<AuthState>,
) -> Json<serde_json::Value> {
Json(serde_json::json!({
"external": {
"google": state.config.google_client_id.is_some(),
"github": state.config.github_client_id.is_some(),
"azure": state.config.azure_client_id.is_some(),
"gitlab": state.config.gitlab_client_id.is_some(),
"bitbucket": state.config.bitbucket_client_id.is_some(),
"discord": state.config.discord_client_id.is_some(),
},
"disable_signup": false,
"mailer_autoconfirm": std::env::var("AUTH_AUTO_CONFIRM").map(|v| v == "true").unwrap_or(false),
"sms_provider": "",
"mfa_enabled": true,
}))
}
```
### 3.1.3 POST /auth/v1/magiclink
Generates a one-time login token, sends it via email. When the user clicks the link, they hit `/auth/v1/verify?type=magiclink&token=...` which issues tokens.
```rust
pub async fn magiclink(
State(state): State<AuthState>,
db: Option<Extension<PgPool>>,
Json(payload): Json<RecoverRequest>, // Reuses email-only request
) -> Result<Json<serde_json::Value>, ApiError> {
let db = db.map(|Extension(p)| p).unwrap_or(state.db.clone());
let token = generate_confirmation_token();
sqlx::query("UPDATE users SET confirmation_token = $1 WHERE email = $2")
.bind(&token).bind(&payload.email)
.execute(&db).await?;
tracing::info!(email = %payload.email, "Magic link requested (token suppressed)");
// TODO: Send email with link: {SITE_URL}/auth/confirm?token={token}&type=magiclink
Ok(Json(serde_json::json!({ "message": "Magic link sent if email exists" })))
}
```
### 3.1.4 DELETE /auth/v1/user
Self-deletion for authenticated users:
```rust
pub async fn delete_user(
State(state): State<AuthState>,
db: Option<Extension<PgPool>>,
Extension(auth_ctx): Extension<AuthContext>,
) -> Result<StatusCode, ApiError> {
let claims = auth_ctx.claims.ok_or(ApiError::Unauthorized("Not authenticated".into()))?;
let user_id = Uuid::parse_str(&claims.sub)?;
let db = db.map(|Extension(p)| p).unwrap_or(state.db.clone());
// Soft delete: set a deleted_at timestamp
sqlx::query("UPDATE users SET deleted_at = now() WHERE id = $1")
.bind(user_id).execute(&db).await?;
// Revoke all tokens
sqlx::query("UPDATE refresh_tokens SET revoked = true WHERE user_id = $1")
.bind(user_id).execute(&db).await?;
Ok(StatusCode::NO_CONTENT)
}
```
**Migration needed:** `ALTER TABLE users ADD COLUMN IF NOT EXISTS deleted_at TIMESTAMPTZ;`
---
## 3.2 — Fix Existing Flows
### 3.2.1 Recovery flow must accept new password
**File:** `auth/src/handlers.rs``verify` function, recovery branch (line ~335)
```rust
"recovery" => {
let user = sqlx::query_as::<_, User>(
"UPDATE users SET recovery_token = NULL WHERE recovery_token = $1 RETURNING *"
)
.bind(&payload.token)
.fetch_optional(&db).await?
.ok_or(ApiError::BadRequest("Invalid token".into()))?;
// Apply new password if provided
if let Some(new_password) = &payload.password {
let hashed = hash_password(new_password)
.map_err(|e| ApiError::Internal(e.to_string()))?;
sqlx::query("UPDATE users SET encrypted_password = $1 WHERE id = $2")
.bind(&hashed).bind(user.id)
.execute(&db).await?;
}
user
}
```
### 3.2.2 Email change must require re-verification
**File:** `auth/src/handlers.rs``update_user` function (line ~392)
Instead of immediately updating the email:
```rust
if let Some(new_email) = &payload.email {
let token = generate_confirmation_token();
sqlx::query(
"UPDATE users SET email_change = $1, email_change_token_new = $2 WHERE id = $3"
)
.bind(new_email).bind(&token).bind(user_id)
.execute(&mut *tx).await?;
// TODO: Send confirmation email to new_email with token
tracing::info!(user_id = %user_id, new_email = %new_email, "Email change requested");
}
```
The actual email update happens when the user verifies via `/auth/v1/verify?type=email_change&token=...`.
### 3.2.3 OAuth callback must redirect
**File:** `auth/src/oauth.rs``callback` function (end)
```rust
// BEFORE — returns JSON
Ok(Json(AuthResponse { access_token, ... }))
// AFTER — redirect with tokens in fragment
let site_url = std::env::var("SITE_URL").unwrap_or_else(|_| "http://localhost:3000".into());
let redirect_url = format!(
"{}#access_token={}&token_type=bearer&expires_in={}&refresh_token={}",
site_url, access_token, expires_in, refresh_token
);
Ok(Redirect::to(&redirect_url))
```
### 3.2.4 MFA verify must issue aal2 token
**File:** `auth/src/mfa.rs``verify` function (line ~179)
After successful TOTP verification:
```rust
// Issue upgraded JWT with aal2
let jwt_secret = project_ctx.jwt_secret.as_str();
let (token, expires_in, _) = generate_token_with_aal(
user_id, &email, "authenticated", jwt_secret, "aal2"
)?;
let refresh_token = issue_refresh_token(&db, user_id, Uuid::new_v4(), None).await
.map_err(|e| error_response(StatusCode::INTERNAL_SERVER_ERROR, e.1))?;
Ok(Json(serde_json::json!({
"access_token": token,
"token_type": "bearer",
"expires_in": expires_in,
"refresh_token": refresh_token,
})))
```
**New function needed in `auth/src/utils.rs`:** `generate_token_with_aal()` that adds `aal` and `amr` claims to the JWT.
**Update Claims model** in `auth/src/models.rs`:
```rust
pub struct Claims {
pub sub: String,
pub email: Option<String>,
pub role: String,
pub exp: usize,
pub iss: String,
pub aud: Option<String>,
pub iat: usize,
pub session_id: Option<String>, // NEW
pub aal: Option<String>, // NEW: "aal1" or "aal2"
pub amr: Option<Vec<AmrEntry>>, // NEW
}
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct AmrEntry {
pub method: String, // "password", "totp", "oauth"
pub timestamp: usize,
}
```
### 3.2.5 MFA challenge must validate ownership
**File:** `auth/src/mfa.rs``challenge` function (line ~186)
```rust
// BEFORE — no user check
let row = sqlx::query("SELECT factor_type FROM auth.mfa_factors WHERE id = $1 AND status = 'verified'")
.bind(factor_id)...
// AFTER — verify user owns the factor
let user_id = auth_ctx.claims.as_ref()
.and_then(|c| Uuid::parse_str(&c.sub).ok())
.ok_or(error_response(StatusCode::UNAUTHORIZED, "Invalid user".into()))?;
let row = sqlx::query("SELECT factor_type FROM auth.mfa_factors WHERE id = $1 AND user_id = $2 AND status = 'verified'")
.bind(factor_id)
.bind(user_id)...
```
### 3.2.6 Store and validate MFA challenges
**Migration:**
```sql
CREATE TABLE IF NOT EXISTS auth.mfa_challenges (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
factor_id UUID NOT NULL REFERENCES auth.mfa_factors(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
verified_at TIMESTAMPTZ,
ip_address TEXT
);
CREATE INDEX idx_mfa_challenges_factor ON auth.mfa_challenges(factor_id);
```
In `challenge` handler, insert a challenge row and return its ID. In `verify` handler, validate the challenge_id exists, is recent (< 5 minutes), and belongs to the correct factor.
---
## 3.3 — Session Management
### 3.3.1 Wire in SessionManager
**File:** `auth/src/handlers.rs``login` and `signup` functions
After generating tokens, create a session:
```rust
if let Some(session_manager) = &state.session_manager {
let session_token = session_manager.create_session(
user.id, user.email.clone(), "authenticated".into()
).await.ok();
// Include session_id in JWT claims (via generate_token)
}
```
**Add to AuthState:**
```rust
pub struct AuthState {
pub db: PgPool,
pub config: Config,
pub session_manager: Option<SessionManager>,
}
```
Initialize in `gateway/src/worker.rs` when Redis is available.
### 3.3.2 Add GET /auth/v1/sessions
List active sessions for "sign out other devices" UI.
---
## 3.4 — Token Quality
### 3.4.1 Configurable token expiry
**File:** `auth/src/utils.rs``generate_token` (line 65)
```rust
// BEFORE
Duration::seconds(3600)
// AFTER
let lifetime = std::env::var("ACCESS_TOKEN_LIFETIME")
.ok()
.and_then(|v| v.parse::<i64>().ok())
.unwrap_or(3600);
Duration::seconds(lifetime)
```
### 3.4.2 Hash confirmation/recovery tokens
**File:** `auth/src/handlers.rs``signup` function
```rust
let raw_token = generate_confirmation_token();
let hashed_token = hash_refresh_token(&raw_token); // Reuse SHA-256 hasher
// Store hashed version in DB
.bind(&hashed_token)
// Log/email the raw version
tracing::info!("Confirmation token generated for {}", user.email);
```
On verify, hash the incoming token before comparison:
```rust
let hashed_input = hash_refresh_token(&payload.token);
sqlx::query("... WHERE confirmation_token = $1").bind(&hashed_input)
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every feature/fix in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_logout_revokes_refresh_tokens` | `auth/src/handlers.rs` | `POST /logout` sets `revoked = true` on all user refresh tokens |
| `test_logout_returns_204` | `auth/src/handlers.rs` | `POST /logout` returns 204 No Content |
| `test_logout_unauthenticated_401` | `auth/src/handlers.rs` | `POST /logout` without bearer token returns 401 |
| `test_settings_endpoint` | `auth/src/handlers.rs` | `GET /settings` returns provider availability and `autoconfirm` flag |
| `test_magiclink_creates_token` | `auth/src/handlers.rs` | `POST /magiclink` generates a one-time token (DB row exists) |
| `test_recovery_requires_new_password` | `auth/src/handlers.rs` | Verification without `password` in body returns 422 |
| `test_recovery_updates_password` | `auth/src/handlers.rs` | After verification with new password, `password_hash` is updated |
| `test_email_change_requires_verification` | `auth/src/handlers.rs` | `PUT /user` with email change sets `new_email` but doesn't update `email` directly |
| `test_oauth_callback_redirects` | `auth/src/sso.rs` | OAuth callback returns 302 to `SITE_URL` with tokens in fragment |
| `test_mfa_verify_returns_aal2` | `auth/src/mfa.rs` | Successful TOTP verification returns JWT with `aal: aal2` claim |
| `test_mfa_rejects_wrong_factor_owner` | `auth/src/mfa.rs` | Verifying a factor not owned by the user returns 403 |
| `test_token_expiry_configurable` | `auth/src/utils.rs` | JWT `exp` respects `ACCESS_TOKEN_LIFETIME` env var |
| `test_session_list` | `auth/src/session.rs` | `GET /sessions` returns active sessions for the current user |
| `test_confirmation_tokens_hashed` | `auth/src/handlers.rs` | Confirmation token stored in DB is hashed, not plaintext |
### 2. Integration / supabase-js Compatibility Verification
- [ ] `supabase.auth.signOut()` → 204, refresh tokens revoked
- [ ] `supabase.auth.getSession()` returns null after signOut
- [ ] Password recovery flow: request → verify with new password → login with new password works
- [ ] Email change: request → confirmation email sent → verify → email updated
- [ ] OAuth callback redirects to `SITE_URL` with tokens in fragment
- [ ] MFA enroll → challenge → verify returns aal2 token
- [ ] MFA challenge rejects if user doesn't own the factor
- [ ] Token expiry respects `ACCESS_TOKEN_LIFETIME` env var
- [ ] `GET /auth/v1/settings` returns correct provider availability
- [ ] `supabase.auth.signUp()` / `signInWithPassword()` / `signInWithOAuth()` / `resetPasswordForEmail()` / `updateUser()` — all return shapes match supabase-js expectations
### 3. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] Tests that require a running Postgres are either:
- (a) using an in-memory SQLx test database, or
- (b) gated behind `#[ignore]` with a comment explaining the external dependency
- [ ] No `#[allow(unused)]` on any new code in this milestone

276
_milestones/M4_data_api.md Normal file
View File

@@ -0,0 +1,276 @@
# Milestone 4: Data API Completeness
**Goal:** `supabase.from(table).select().eq().order()` and the full PostgREST query surface works.
**Depends on:** M0 (Security), M1 (Foundation)
---
## 4.1 — Missing Operators & Features
### Implementation approach
All operators are parsed in `data_api/src/parser.rs` and applied in `data_api/src/handlers.rs`. The parser already handles `eq`, `neq`, `gt`, `gte`, `lt`, `lte`, `like`, `ilike`, `in`, `is`. It also has partial `or`/`and` support in `FilterNode::parse`.
### 4.1.1 or / not filters
The parser already parses `or(col1.eq.a,col2.eq.b)` into `FilterNode::Or(...)`. Verify the SQL generation in `build_filter_clause` correctly emits `(col1 = 'a' OR col2 = 'b')`.
Add `not` operator:
```rust
// In parser.rs Operator enum
Not, // Wraps another condition with NOT
// In parser.rs
"not" => Some(Operator::Not),
// In to_sql
Operator::Not => "NOT",
```
Usage: `?name=not.eq.null``NOT (name = NULL)` or more correctly `name IS NOT NULL`.
### 4.1.2 contains / containedBy
For JSONB and array columns:
```rust
Operator::Contains => "@>",
Operator::ContainedBy => "<@",
```
Parse: `?tags=cs.{a,b}``tags @> ARRAY['a','b']`
### 4.1.3 textSearch
```rust
Operator::TextSearch => "@@",
```
Parse: `?content=fts.hello+world``to_tsvector(content) @@ plainto_tsquery('hello world')`
### 4.1.4 Range pagination
Read `Range` header in handler:
```rust
let range = headers.get("Range")
.and_then(|v| v.to_str().ok())
.and_then(|s| {
let parts: Vec<&str> = s.split('-').collect();
Some((parts[0].parse::<usize>().ok()?, parts[1].parse::<usize>().ok()?))
});
if let Some((start, end)) = range {
// Add OFFSET start LIMIT (end - start + 1)
// Set Content-Range header in response: "0-9/100"
}
```
### 4.1.5 Prefer: count=exact
Read `Prefer` header:
```rust
let want_count = headers.get("Prefer")
.and_then(|v| v.to_str().ok())
.map(|s| s.contains("count=exact"))
.unwrap_or(false);
if want_count {
// Run a parallel COUNT(*) query
// Set Content-Range: "0-9/42" or "*/42"
}
```
### 4.1.6 single / maybeSingle
Read `Accept` header:
```rust
let want_single = headers.get("Accept")
.and_then(|v| v.to_str().ok())
.map(|s| s.contains("vnd.pgrst.object+json"))
.unwrap_or(false);
if want_single {
// LIMIT 1, return object instead of array
// If no rows: 406 Not Acceptable (for single), null (for maybeSingle)
}
```
### 4.1.7 Upsert
Read `Prefer` header for `resolution=merge-duplicates`:
```rust
let prefer_upsert = headers.get("Prefer")
.and_then(|v| v.to_str().ok())
.map(|s| s.contains("resolution=merge-duplicates"))
.unwrap_or(false);
if prefer_upsert {
// INSERT ... ON CONFLICT DO UPDATE SET col1 = EXCLUDED.col1, ...
}
```
### 4.1.8 RPC support
**File:** `data_api/src/handlers.rs` (new), `data_api/src/lib.rs` (add route)
```rust
.route("/rpc/:function_name", post(handlers::call_rpc))
```
```rust
pub async fn call_rpc(
State(state): State<DataState>,
Extension(auth_ctx): Extension<AuthContext>,
Path(function_name): Path<String>,
Json(params): Json<serde_json::Value>,
db: Option<Extension<PgPool>>,
) -> Result<Json<serde_json::Value>, ApiError> {
// Validate function_name is a valid identifier
if !is_valid_identifier(&function_name) {
return Err(ApiError::BadRequest("Invalid function name".into()));
}
let pool = db.map(|Extension(p)| p).unwrap_or(state.db.clone());
let mut rls = RlsTransaction::begin(&pool, &auth_ctx).await?;
// Build: SELECT * FROM function_name($1)
let query = format!("SELECT * FROM {}($1::jsonb)", function_name);
let rows = sqlx::query(&query)
.bind(&params)
.fetch_all(&mut *rls.tx)
.await?;
let result = rows_to_json(rows);
Ok(Json(result))
}
```
### 4.1.9 Schema selection
Read `Accept-Profile` / `Content-Profile` headers:
```rust
let schema = headers.get("Accept-Profile")
.or(headers.get("Content-Profile"))
.and_then(|v| v.to_str().ok())
.unwrap_or("public");
// Validate schema exists
// Add SET LOCAL search_path = schema in the RLS transaction
```
---
## 4.2 — Nested Resource Embedding
This is the most complex feature. PostgREST's `select=*,author:users(*)` generates JOINs based on FK relationships.
### Phase 1: Single-level explicit FK
The parser already handles `SelectNode::Relation("author:users", inner_columns)` via `SelectNode::parse`. The handler needs to:
1. Detect `Relation` nodes in the select list
2. Look up the FK between the main table and the related table
3. Generate a LEFT JOIN or subquery
4. Nest the results in the JSON response
**Schema introspection query:**
```sql
SELECT
tc.constraint_name,
kcu.column_name AS fk_column,
ccu.table_schema AS referenced_schema,
ccu.table_name AS referenced_table,
ccu.column_name AS referenced_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY' AND tc.table_name = $1
```
**Cache this** per table (see 4.3).
### Phase 2: Multi-level nesting
Recursive: for each `Relation` node, apply the same embedding logic to its inner `Relation` nodes.
### Phase 3: Computed/virtual relationships
Allow `!inner` (INNER JOIN) and `!left` (LEFT JOIN) hints in the select parameter.
---
## 4.3 — Performance
### 4.3.1 Cache schema introspection
Create a `SchemaCache` that loads FK and column metadata on first request per table, caches with 5-minute TTL:
```rust
use moka::future::Cache;
pub struct SchemaCache {
fk_cache: Cache<String, Vec<ForeignKey>>,
column_cache: Cache<String, Vec<ColumnInfo>>,
}
impl SchemaCache {
pub fn new() -> Self {
Self {
fk_cache: Cache::builder().time_to_live(Duration::from_secs(300)).build(),
column_cache: Cache::builder().time_to_live(Duration::from_secs(300)).build(),
}
}
}
```
Invalidate on DDL changes by listening to `pg_notify('schema_change', ...)` via a background task.
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every feature in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_parse_or_filter` | `data_api/src/parser.rs` | `or(title.eq.A,title.eq.B)` generates correct SQL with `OR` |
| `test_parse_not_filter` | `data_api/src/parser.rs` | `not.status.eq.draft` generates `NOT (status = 'draft')` |
| `test_parse_contains_jsonb` | `data_api/src/parser.rs` | `tags.cs.{a,b}` generates `tags @> $1` |
| `test_parse_contained_by` | `data_api/src/parser.rs` | `tags.cd.{a,b,c}` generates `tags <@ $1` |
| `test_parse_text_search` | `data_api/src/parser.rs` | `fts.english.hello` generates `to_tsvector('english', col) @@ to_tsquery($1)` |
| `test_range_header_pagination` | `data_api/src/handlers.rs` | `Range: 0-9` returns 10 rows with `Content-Range: 0-9/*` |
| `test_count_exact_header` | `data_api/src/handlers.rs` | `Prefer: count=exact` returns `Content-Range: 0-N/TOTAL` |
| `test_single_object_response` | `data_api/src/handlers.rs` | `Accept: application/vnd.pgrst.object+json` returns a single JSON object, not array |
| `test_single_object_406_on_multiple` | `data_api/src/handlers.rs` | Single-object mode with 2+ rows returns 406 |
| `test_upsert_merge_duplicates` | `data_api/src/handlers.rs` | `Prefer: resolution=merge-duplicates` upserts correctly |
| `test_rpc_call` | `data_api/src/handlers.rs` | `POST /rest/v1/rpc/my_func` with JSON params calls the function and returns results |
| `test_rpc_invalid_name_rejected` | `data_api/src/handlers.rs` | `POST /rest/v1/rpc/drop table` returns 400 |
| `test_schema_selection` | `data_api/src/handlers.rs` | `Accept-Profile: custom_schema` queries the correct schema |
| `test_nested_select_fk_join` | `data_api/src/handlers.rs` | `select=*,author:users(name)` returns nested objects |
| `test_schema_cache_invalidation` | `data_api/src/handlers.rs` | Schema cache refreshes after DDL changes (or after TTL) |
### 2. Integration / supabase-js Compatibility Verification
- [ ] `supabase.from('posts').select('*').or('title.eq.Hello,title.eq.World')` returns matching rows
- [ ] `supabase.from('posts').select('*, author:users(name)')` returns nested author objects
- [ ] `supabase.from('posts').select('*', { count: 'exact' })` returns count in `Content-Range` header
- [ ] `supabase.from('posts').upsert({ id: 1, title: 'Updated' })` creates or updates
- [ ] `supabase.rpc('my_function', { param: 'value' })` calls the Postgres function
- [ ] `supabase.from('posts').select('*').range(0, 9)` returns first 10 rows with `Content-Range`
- [ ] Schema selection via `Accept-Profile` header works
- [ ] `.single()` returns one object (not array) and 406 on 0 or 2+ results
- [ ] `.maybeSingle()` returns one object or null
### 3. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] Parser tests are pure (no DB needed) and run on every PR
- [ ] Handler integration tests that require Postgres are documented and gated appropriately
- [ ] `cargo clippy --workspace -- -D warnings` passes with no new warnings

273
_milestones/M5_realtime.md Normal file
View File

@@ -0,0 +1,273 @@
# Milestone 5: Realtime
**Goal:** `supabase.channel('room').on('postgres_changes', ...).subscribe()` delivers filtered change events to authorized clients.
**Depends on:** M0 (Security), M1 (Foundation)
---
## 5.1 — Fix Core Functionality
### 5.1.1 Make the realtime crate compile
**File:** `realtime/src/lib.rs`
Current issue: `pub mod lib;` is self-referential and will fail. The crate also references `PostgresPayload` and `PresenceMessage` types that don't exist.
**Fix:**
1. Remove `pub mod lib;` — it creates a circular module reference
2. Define the missing types in a `types.rs` module:
```rust
// realtime/src/types.rs
use serde::{Serialize, Deserialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PostgresPayload {
pub schema: String,
pub table: String,
#[serde(rename = "type")]
pub change_type: String, // INSERT, UPDATE, DELETE
pub record: Option<serde_json::Value>,
pub old_record: Option<serde_json::Value>,
pub columns: Option<Vec<ColumnInfo>>,
#[serde(default)]
pub truncated: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ColumnInfo {
pub name: String,
pub type_: String,
}
```
Update `lib.rs`:
```rust
pub mod types;
pub mod replication;
pub mod ws;
pub mod presence;
pub use types::*;
pub use presence::{PresenceManager, PresenceInfo, PresenceStatus};
```
### 5.1.2 Per-table broadcast channels
**Current problem:** A single `tokio::sync::broadcast::Sender` is used for all table changes. Every connected client receives every change from every table, then filters client-side.
**Fix:** Use a `DashMap<String, broadcast::Sender<PostgresPayload>>` keyed by `"schema.table"`:
```rust
use dashmap::DashMap;
pub struct RealtimeState {
pub channels: Arc<DashMap<String, broadcast::Sender<PostgresPayload>>>,
}
impl RealtimeState {
pub fn get_or_create_channel(&self, key: &str) -> broadcast::Sender<PostgresPayload> {
self.channels
.entry(key.to_string())
.or_insert_with(|| broadcast::channel(1024).0)
.clone()
}
}
```
The replication listener dispatches to the correct channel:
```rust
let key = format!("{}.{}", payload.schema, payload.table);
if let Some(sender) = state.channels.get(&key) {
let _ = sender.send(payload);
}
```
Clients subscribe to specific channels on join.
### 5.1.3 Authorization
On WebSocket join for a postgres_changes subscription:
```rust
async fn authorize_subscription(
pool: &PgPool,
auth_ctx: &AuthContext,
schema: &str,
table: &str,
) -> Result<bool, ApiError> {
let mut tx = pool.begin().await?;
// Set the user's role
let role_query = format!("SET LOCAL role = '{}'", auth_ctx.role);
sqlx::query(&role_query).execute(&mut *tx).await?;
if let Some(claims) = &auth_ctx.claims {
sqlx::query("SELECT set_config('request.jwt.claim.sub', $1, true)")
.bind(&claims.sub).execute(&mut *tx).await?;
}
// Attempt a SELECT — if RLS denies it, the user can't subscribe
let check = format!("SELECT 1 FROM \"{}\".\"{}\" LIMIT 0", schema, table);
match sqlx::query(&check).execute(&mut *tx).await {
Ok(_) => Ok(true),
Err(_) => Ok(false),
}
}
```
### 5.1.4 Event type filtering
Client sends a join message specifying which events to receive:
```json
["1", "1", "realtime:public:posts", "phx_join", {
"config": {
"postgres_changes": [{
"event": "INSERT",
"schema": "public",
"table": "posts",
"filter": "user_id=eq.123"
}]
}
}]
```
Server-side, filter before sending:
```rust
if let Some(event_filter) = &subscription.event_filter {
if !event_filter.contains(&payload.change_type) {
continue; // Skip this event
}
}
```
### 5.1.5 Row-level filtering
Apply the filter expression from the subscription config:
```rust
if let Some(filter) = &subscription.filter {
// Parse "user_id=eq.123" into a condition
// Check if payload.record matches the condition
if !matches_filter(&payload.record, filter) {
continue;
}
}
```
### 5.1.6 Replication listener retry
**File:** `gateway/src/worker.rs` — replication spawn (line ~66)
```rust
tokio::spawn(async move {
loop {
match realtime::replication::start_replication_listener(repl_config.clone(), repl_tx.clone()).await {
Ok(_) => {
tracing::warn!("Replication listener exited normally, restarting...");
}
Err(e) => {
tracing::error!("Replication listener failed: {}, retrying in 5s", e);
tokio::time::sleep(std::time::Duration::from_secs(5)).await;
}
}
}
});
```
---
## 5.2 — Broadcast & Presence
### 5.2.1 Broadcast channels
Broadcast channels are server-side fan-out without touching the database. Clients send messages to a topic, and all subscribers on that topic receive them.
```rust
// On receiving a broadcast message from a client:
let key = format!("broadcast:{}", topic);
let sender = state.get_or_create_channel(&key);
sender.send(BroadcastPayload { event, payload }).ok();
```
### 5.2.2 Wire in presence
Connect `realtime/src/presence.rs` to the WebSocket handler:
- On `phx_join` with presence config: call `presence_manager.join_channel(user_id, channel, metadata)`
- On `phx_leave` or disconnect: call `presence_manager.leave_channel(user_id, channel)`
- Periodic heartbeat: call `presence_manager.heartbeat(user_id, channel)`
- On `presence_state` request: return `presence_manager.get_channel_users(channel)`
- On presence change: broadcast `presence_diff` to all channel subscribers
---
## 5.3 — Phoenix Protocol
### 5.3.1 Message format
supabase-js sends and expects JSON arrays: `[join_ref, ref, topic, event, payload]`
Verify the server parses this correctly. The current WS handler may expect JSON objects. Test with:
```javascript
const channel = supabase.channel('test')
.on('postgres_changes', { event: 'INSERT', schema: 'public', table: 'posts' }, (payload) => {
console.log(payload);
})
.subscribe();
```
Server responses must also be arrays:
```json
["1", "1", "realtime:public:posts", "phx_reply", {"status": "ok", "response": {}}]
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] `cargo build -p realtime` compiles without errors
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every feature in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_postgres_payload_deserialize` | `realtime/src/lib.rs` | `PostgresPayload` correctly deserializes a pgoutput message |
| `test_column_info_mapping` | `realtime/src/lib.rs` | `ColumnInfo` maps OIDs to column names and types |
| `test_per_table_channel_isolation` | `realtime/src/lib.rs` | Messages for `public.posts` don't reach subscribers of `public.users` |
| `test_authorize_subscription_allowed` | `realtime/src/lib.rs` | User with SELECT on table → `authorize_subscription` returns true |
| `test_authorize_subscription_denied` | `realtime/src/lib.rs` | User without SELECT on table → `authorize_subscription` returns false |
| `test_event_type_filter` | `realtime/src/lib.rs` | Subscribing to INSERT only → UPDATE events are filtered out |
| `test_row_level_filter` | `realtime/src/lib.rs` | `filter: 'user_id=eq.123'` → only matching rows are delivered |
| `test_broadcast_delivery` | `realtime/src/lib.rs` | Broadcast message to a topic reaches all subscribers |
| `test_broadcast_no_cross_topic_leak` | `realtime/src/lib.rs` | Broadcast on topic A doesn't reach topic B subscribers |
| `test_presence_join` | `realtime/src/presence.rs` | Joining a channel broadcasts a `presence_state` event |
| `test_presence_leave` | `realtime/src/presence.rs` | Leaving a channel broadcasts an updated `presence_diff` |
| `test_presence_key_format_consistency` | `realtime/src/presence.rs` | All Redis keys use `presence:channel:{ch}:user:{id}` format (regression for the existing bug) |
| `test_replication_listener_retry` | `realtime/src/lib.rs` | After simulated disconnect, listener reconnects within 5s |
| `test_phoenix_message_format` | `realtime/src/lib.rs` | Outbound messages match `[join_ref, ref, topic, event, payload]` Phoenix format |
### 2. Integration / supabase-js Compatibility Verification
- [ ] WebSocket connection to `/realtime/v1/websocket` succeeds
- [ ] Subscribing to `postgres_changes` for a table the user has access to works
- [ ] Subscribing to a table the user does NOT have access to is rejected
- [ ] INSERT into a subscribed table delivers a change event to the client
- [ ] Event type filter: subscribing to INSERT only → UPDATE events are not received
- [ ] Row-level filter: `filter: 'user_id=eq.123'` → only matching changes are received
- [ ] Broadcast: sending a message to a topic → all subscribers receive it
- [ ] Presence: joining a channel → other members see the join event
- [ ] Replication listener auto-restarts after failure
- [ ] `supabase.channel('room').on('postgres_changes', { event: 'INSERT', schema: 'public', table: 'messages' }, callback).subscribe()` — full round-trip works
### 3. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] Tests requiring a Postgres replication slot are gated behind `#[ignore]` or `#[cfg(feature = "integration")]`
- [ ] `cargo build --workspace` succeeds (no compilation errors in `realtime`)

View File

@@ -0,0 +1,321 @@
# Milestone 6: Edge Functions
**Goal:** `supabase.functions.invoke('my-function')` executes user code safely with proper isolation.
**Depends on:** M0 (Security), M1 (Foundation)
---
## 6.1 — Security Fixes
### 6.1.1 Sandbox the Deno runtime
**File:** `functions/src/deno_runtime.rs` line 46
**Current problem:** `FsModuleLoader` allows user functions to `import` any file from the server filesystem, including `/etc/passwd`, source code, `.env` files, etc.
**Fix:** Create a restricted module loader:
```rust
use deno_core::{ModuleLoader, ModuleSource, ModuleSourceCode, ModuleType, ModuleLoadResponse, RequestedModuleType};
struct SandboxedModuleLoader {
allowed_dir: PathBuf,
}
impl ModuleLoader for SandboxedModuleLoader {
fn resolve(&self, specifier: &str, referrer: &str, _kind: deno_core::ResolutionKind) -> Result<deno_core::ModuleSpecifier, anyhow::Error> {
let resolved = deno_core::resolve_import(specifier, referrer)?;
// Only allow file:// URLs within the allowed directory
if resolved.scheme() == "file" {
let path = resolved.to_file_path()
.map_err(|_| anyhow::anyhow!("Invalid file path"))?;
let canonical = path.canonicalize()
.map_err(|_| anyhow::anyhow!("Path not found: {}", path.display()))?;
if !canonical.starts_with(&self.allowed_dir) {
return Err(anyhow::anyhow!(
"Import blocked: {} is outside the allowed directory", specifier
));
}
}
// Allow https:// imports (for deno.land, esm.sh, etc.)
// Block other schemes
if resolved.scheme() != "file" && resolved.scheme() != "https" {
return Err(anyhow::anyhow!("Blocked import scheme: {}", resolved.scheme()));
}
Ok(resolved)
}
fn load(&self, specifier: &deno_core::ModuleSpecifier, _maybe_referrer: Option<&deno_core::ModuleSpecifier>, _is_dynamic: bool) -> ModuleLoadResponse {
// For file:// — read from disk (already validated in resolve)
// For https:// — fetch (or block if network disabled)
// Default implementation delegates to FsModuleLoader for files
todo!("implement based on scheme")
}
}
```
Use in runtime creation:
```rust
let temp_dir = PathBuf::from(format!("/tmp/madbase_functions/{}", function_name));
let runtime = JsRuntime::new(deno_core::RuntimeOptions {
module_loader: Some(Rc::new(SandboxedModuleLoader { allowed_dir: temp_dir })),
..Default::default()
});
```
### 6.1.2 Pass data safely (fix JS injection)
**File:** `functions/src/deno_runtime.rs` lines 122-156
**Current (vulnerable):**
```rust
let module_code = format!(r#"
const req = new Request("http://localhost", {{
method: "POST",
body: {payload_json}, // INJECTION POINT
headers: {headers_json} // INJECTION POINT
}});
"#);
```
**Fix — double-serialize to create safe JS string literals:**
```rust
// Serialize payload/headers to JSON strings, then JSON-encode THOSE strings
// so they become valid JS string literals
let payload_str = serde_json::to_string(&payload_json)?; // JSON string
let headers_str = serde_json::to_string(&headers_json)?;
let safe_payload = serde_json::to_string(&payload_str)?; // "\"escaped JSON\""
let safe_headers = serde_json::to_string(&headers_str)?;
let module_code = format!(r#"
const req = new Request("http://localhost", {{
method: "POST",
body: JSON.parse({safe_payload}),
headers: JSON.parse({safe_headers})
}});
"#);
```
This guarantees the interpolated values are valid JSON string literals that cannot break out of the JS context.
### 6.1.3 Resource limits
Add execution limits:
```rust
// Timeout (already exists at 30s, keep it)
tokio::time::timeout(Duration::from_secs(30), rx).await
// Memory limit — use V8's heap limit
let mut runtime = JsRuntime::new(deno_core::RuntimeOptions {
create_params: Some(
v8::CreateParams::default()
.heap_limits(0, 128 * 1024 * 1024) // 128MB max heap
),
..Default::default()
});
// Register a near-heap-limit callback to terminate
let isolate = runtime.v8_isolate();
isolate.add_near_heap_limit_callback(|current, initial| {
// Terminate the isolate
current // Don't increase the limit
});
```
---
## 6.2 — Developer Experience
### 6.2.1 TypeScript support
Deno natively compiles TypeScript. The current setup writes `.js` files — change to `.ts`:
```rust
let temp_path = format!("/tmp/deno_main_{}.ts", uuid::Uuid::new_v4());
```
Deno will transparently compile TypeScript to JavaScript.
### 6.2.2 Support fetch()
The current preamble defines custom `Request`/`Response` classes but doesn't provide `fetch()`. Deno's built-in `fetch` requires network permissions. Since we're using `deno_core` (not the full Deno CLI), we need to either:
1. Add `deno_fetch` extension to the runtime
2. Or implement a minimal `fetch` via `Deno.core.ops`
Option 1 (recommended):
```rust
// Add deno_fetch to extensions
let mut runtime = JsRuntime::new(deno_core::RuntimeOptions {
extensions: vec![deno_fetch::deno_fetch::init_ops::<Permissions>(Default::default())],
..Default::default()
});
```
This requires adding `deno_fetch` as a dependency and implementing a `Permissions` struct that controls which URLs can be accessed.
### 6.2.3 Environment variables
Pass project-level env vars to the function context:
```rust
// Before executing user code
let env_vars = get_project_env_vars(&db, &project_ref).await?;
let env_json = serde_json::to_string(&env_vars)?;
runtime.execute_script("<env>", format!("globalThis._env = JSON.parse('{}');", env_json))?;
```
This makes `Deno.env.get("MY_VAR")` work (already polyfilled in the preamble).
### 6.2.4 Worker pooling
**Current problem:** Each invocation spawns a new OS thread + tokio runtime. This has ~10ms overhead per invocation and wastes memory.
**Fix:** Pre-warm a pool of worker threads:
```rust
use tokio::sync::mpsc;
pub struct DenoPool {
sender: mpsc::Sender<DenoTask>,
}
struct DenoTask {
code: String,
payload: Option<Value>,
headers: HashMap<String, String>,
response: oneshot::Sender<Result<(String, String, u16, HashMap<String, String>)>>,
}
impl DenoPool {
pub fn new(pool_size: usize) -> Self {
let (tx, rx) = mpsc::channel(pool_size * 2);
let rx = Arc::new(Mutex::new(rx));
for _ in 0..pool_size {
let rx = rx.clone();
std::thread::spawn(move || {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all().build().unwrap();
let local = tokio::task::LocalSet::new();
local.block_on(&rt, async {
loop {
let task = rx.lock().await.recv().await;
if let Some(task) = task {
let result = DenoRuntime::execute_inner(
task.code, task.payload, task.headers
).await;
let _ = task.response.send(result);
} else {
break;
}
}
});
});
}
Self { sender: tx }
}
pub async fn execute(&self, code: String, payload: Option<Value>, headers: HashMap<String, String>)
-> Result<(String, String, u16, HashMap<String, String>)>
{
let (tx, rx) = oneshot::channel();
self.sender.send(DenoTask { code, payload, headers, response: tx }).await
.map_err(|_| anyhow::anyhow!("Worker pool exhausted"))?;
rx.await.map_err(|_| anyhow::anyhow!("Worker panicked"))?
}
}
```
Initialize in `gateway/src/worker.rs` with `DENO_POOL_SIZE` env var (default: 4).
### 6.2.5 Function deletion
Add route in `functions/src/lib.rs`:
```rust
.route("/:name", get(handlers::get_function)
.post(handlers::invoke_function)
.delete(handlers::delete_function))
```
Handler deletes from DB. The function code is not stored on the filesystem in production.
### 6.2.6 Function logs
Capture `console.log` output by intercepting the `Deno.core.print` calls:
```rust
// In preamble, collect logs into an array
globalThis.__logs__ = [];
globalThis.console = {
log: (...args) => {
const msg = args.map(a => String(a)).join(" ");
globalThis.__logs__.push({ level: "info", msg, ts: Date.now() });
Deno.core.print(msg + "\n");
},
// ... same for error, warn, debug
};
```
After execution, extract logs:
```rust
let logs_val = runtime.execute_script("<logs>", "JSON.stringify(globalThis.__logs__)")?;
// Deserialize and include in InvokeResponse
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New unit tests** are written for every feature in this milestone:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_sandboxed_loader_blocks_etc_passwd` | `functions/src/deno_runtime.rs` | `resolve("/etc/passwd", ...)` returns an error |
| `test_sandboxed_loader_blocks_parent_traversal` | `functions/src/deno_runtime.rs` | `resolve("../../etc/passwd", ...)` returns an error |
| `test_sandboxed_loader_allows_local_import` | `functions/src/deno_runtime.rs` | `resolve("./helper.ts", ...)` within allowed dir succeeds |
| `test_sandboxed_loader_allows_https_import` | `functions/src/deno_runtime.rs` | `resolve("https://deno.land/std/...", ...)` succeeds |
| `test_sandboxed_loader_blocks_ftp` | `functions/src/deno_runtime.rs` | `resolve("ftp://...", ...)` returns an error |
| `test_js_injection_safe_payload` | `functions/src/deno_runtime.rs` | Payload containing `'; process.exit(); '` does not crash the runtime |
| `test_js_injection_safe_headers` | `functions/src/deno_runtime.rs` | Headers containing JS-breaking characters are safely passed |
| `test_memory_limit_enforcement` | `functions/src/deno_runtime.rs` | Function allocating >128MB is terminated with an error |
| `test_timeout_enforcement` | `functions/src/deno_runtime.rs` | Function with `while(true){}` is killed after configured timeout |
| `test_typescript_execution` | `functions/src/deno_runtime.rs` | `.ts` function with type annotations compiles and executes |
| `test_env_vars_accessible` | `functions/src/deno_runtime.rs` | `Deno.env.get('MY_VAR')` returns the configured value |
| `test_fetch_api_available` | `functions/src/deno_runtime.rs` | `fetch('https://...')` resolves inside a function |
| `test_worker_pool_concurrent` | `functions/src/deno_runtime.rs` | 10 concurrent invocations complete without thread exhaustion |
| `test_function_deletion` | `functions/src/handlers.rs` | `DELETE /functions/v1/:name` removes the function and returns 204 |
| `test_console_log_capture` | `functions/src/deno_runtime.rs` | `console.log("hello")` output appears in the invoke response |
### 2. Integration Verification
- [ ] A function cannot `import '/etc/passwd'` — blocked by sandboxed loader
- [ ] A function with `Deno.serve((req) => new Response("hello"))` works end-to-end
- [ ] TypeScript functions compile and execute via the API
- [ ] `fetch('https://httpbin.org/get')` works inside functions
- [ ] Environment variables are accessible via `Deno.env.get()`
- [ ] Function deletion via DELETE endpoint works
- [ ] `console.log` output appears in the invoke response
- [ ] Pool handles 10 concurrent invocations without thread exhaustion
- [ ] Memory limit: a function allocating >128MB is terminated
- [ ] Timeout: a function running >30s is terminated
- [ ] `supabase.functions.invoke('my-func', { body: { key: 'value' } })` — round-trip works
### 3. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] Deno binary is available in the CI environment (or tests that require it are gated)
- [ ] No `unsafe` code in the functions crate unless explicitly justified with a `// SAFETY:` comment

View File

@@ -0,0 +1,310 @@
# Milestone 7: CI/CD & Operability
**Goal:** Every commit is validated. Deployments are reproducible and observable.
**Depends on:** M0 (Security), M1 (Foundation)
---
## 7.1 — Rust CI Pipeline
### 7.1.1 Add Rust jobs to CI
**File:** `.github/workflows/ci.yml`
Add a new job before the existing frontend jobs:
```yaml
rust:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt, clippy
- name: Cache cargo registry and build
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
- name: Check formatting
run: cargo fmt --all --check
- name: Run clippy
run: cargo clippy --workspace -- -D warnings
- name: Build workspace
run: cargo build --workspace
- name: Run tests
run: cargo test --workspace
env:
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
JWT_SECRET: test-secret-for-ci-only-not-production
DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres
- name: Verify sqlx offline data
run: cargo sqlx prepare --check --workspace
env:
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
```
### 7.1.2 Enable sqlx offline mode
Run locally:
```bash
cargo sqlx prepare --workspace
```
This creates `.sqlx/` directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.
### 7.1.3 Fix the lint job
**File:** `.github/workflows/ci.yml` line 29
```yaml
# BEFORE
run: npm run lint || true
# AFTER
run: npm run lint
```
### 7.1.4 Pin GitHub Actions
Update all `@v3` to `@v4` throughout the file:
- `actions/checkout@v3``@v4`
- `actions/setup-node@v3``@v4`
- `actions/upload-artifact@v3``@v4`
- `codecov/codecov-action@v3``@v4`
### 7.1.5 Add Docker build job
```yaml
docker:
runs-on: ubuntu-latest
needs: rust
steps:
- uses: actions/checkout@v4
- name: Build gateway-runtime
run: docker build --target gateway-runtime -t madbase/gateway:ci .
- name: Build worker-runtime
run: docker build --target worker-runtime -t madbase/worker:ci .
- name: Build control-runtime
run: docker build --target control-runtime -t madbase/control:ci .
- name: Build proxy-runtime
run: docker build --target proxy-runtime -t madbase/proxy:ci .
```
---
## 7.2 — Docker Improvements
### 7.2.1 Slim runtime images
**File:** `Dockerfile` — all runtime stages
```dockerfile
# BEFORE
FROM rust:latest AS worker-runtime
# AFTER — shared base
FROM debian:bookworm-slim AS runtime-base
RUN apt-get update && apt-get install -y \
ca-certificates libssl3 \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false madbase
FROM runtime-base AS worker-runtime
WORKDIR /app
COPY --from=builder /app/target/release/worker .
USER madbase
EXPOSE 8002
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
CMD ["./worker"]
```
### 7.2.2 Create .dockerignore
```
.git
target
docs
*.md
env
scripts
_milestones
.github
control-plane-ui/node_modules
control-plane-ui/dist
```
### 7.2.3 Pin image tags
Replace all `:latest` tags:
- `cargo-chef:latest-rust-latest``cargo-chef:0.1.68-rust-1.77`
- `victoriametrics/victoria-metrics:latest``:v1.101.0`
- `grafana/loki:latest``:2.9.6`
- `grafana/grafana:latest``:10.4.2`
- `victoriametrics/vmagent:latest``:v1.101.0`
---
## 7.3 — Observability
### 7.3.1 Create config files
See M1 for `config/prometheus.yml` and `config/vmagent.yml` content.
### 7.3.2 Request correlation IDs
**File:** `gateway/src/proxy.rs``proxy_request` function
```rust
use uuid::Uuid;
// Generate or propagate request ID
let request_id = req.headers()
.get("x-request-id")
.and_then(|v| v.to_str().ok())
.map(|s| s.to_string())
.unwrap_or_else(|| Uuid::new_v4().to_string());
// Add to proxied request
request_builder = request_builder.header("x-request-id", &request_id);
// Add to response
response_builder = response_builder.header("x-request-id", &request_id);
```
Use `tracing::Span` with the request ID for log correlation:
```rust
let span = tracing::info_span!("request", id = %request_id);
```
### 7.3.3 OpenTelemetry tracing
Add dependencies:
```toml
opentelemetry = "0.22"
opentelemetry-otlp = "0.15"
tracing-opentelemetry = "0.23"
```
Initialize in `gateway/src/main.rs`:
```rust
if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
.install_batch(opentelemetry_sdk::runtime::Tokio)?;
let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
// Add to the subscriber registry
}
```
### 7.3.4 Alerting rules
Create `config/alerts.yml` for Grafana alerting or VictoriaMetrics vmalert:
```yaml
groups:
- name: madbase
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] `cargo fmt --all -- --check` passes (no formatting issues)
- [ ] `cargo clippy --workspace -- -D warnings` passes (no warnings)
- [ ] `cargo sqlx prepare --check` passes (offline query data is up to date)
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New tests** are written for CI/operability features:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_request_id_middleware` | `gateway/src/middleware.rs` | Request without `X-Request-Id` gets one generated; request with one keeps it |
| `test_request_id_propagated` | `gateway/src/proxy.rs` | `X-Request-Id` from proxy request appears in upstream headers |
| `test_health_endpoint_worker` | `gateway/src/bin/worker.rs` | `GET /health` returns 200 with JSON status |
| `test_health_endpoint_system` | `gateway/src/bin/system.rs` | `GET /health` returns 200 with JSON status |
| `test_health_endpoint_proxy` | `gateway/src/bin/proxy.rs` | `GET /health` returns 200 with JSON status |
| `test_docker_build_proxy` | `.github/workflows/ci.yml` | Docker build target `proxy-runtime` succeeds (CI job) |
| `test_docker_build_worker` | `.github/workflows/ci.yml` | Docker build target `worker-runtime` succeeds (CI job) |
| `test_docker_build_control` | `.github/workflows/ci.yml` | Docker build target `control-runtime` succeeds (CI job) |
### 2. CI Pipeline Verification
- [ ] CI passes on a clean PR: `cargo fmt`, `cargo clippy`, `cargo build`, `cargo test` all green
- [ ] `cargo sqlx prepare --check` passes in CI
- [ ] Docker build succeeds for all 4 targets (proxy, worker, control, functions)
- [ ] CI caches Rust build artifacts (via `actions-rust-lang/setup-rust-toolchain` or `Swatinem/rust-cache`)
- [ ] CI runs in under 15 minutes for a clean build
### 3. Docker / Operability Verification
- [ ] Runtime images are under 200MB each (down from ~1.5GB)
- [ ] Containers run as non-root user (`USER madbase`)
- [ ] `docker inspect <image>` shows a `HEALTHCHECK` for each runtime image
- [ ] `.dockerignore` exists and excludes `target/`, `.git/`, `env/`, `_milestones/`, `docs/`
- [ ] All Docker image tags are pinned (no `:latest`)
### 4. Observability Verification
- [ ] `X-Request-Id` header appears in proxy responses
- [ ] Logs contain structured JSON with request IDs (verify via `docker compose logs proxy | jq .`)
- [ ] Prometheus/VictoriaMetrics scrapes metrics from all services
- [ ] Grafana dashboards show request rate, latency p50/p95/p99, error rate
- [ ] Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s
### 5. CI Gate
- [ ] The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
- [ ] All milestones M0M6 tests pass in the CI pipeline retroactively

View File

@@ -0,0 +1,215 @@
# Milestone 8: High Availability & Scaling
**Goal:** The system survives node failures and handles horizontal scaling.
**Depends on:** M1 (Foundation), M7 (CI/CD)
---
## 8.1 — Database HA
### 8.1.1 Multi-node Patroni
**File:** `autobase-haproxy.cfg``listen primary` block
Add replica backends:
```
listen primary
bind *:5433
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server patroni1 patroni1:5432 maxconn 300 check port 8008
server patroni2 patroni2:5432 maxconn 300 check port 8008
server patroni3 patroni3:5432 maxconn 300 check port 8008
listen replicas
bind *:5434
mode tcp
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server patroni1 patroni1:5432 maxconn 300 check port 8008
server patroni2 patroni2:5432 maxconn 300 check port 8008
server patroni3 patroni3:5432 maxconn 300 check port 8008
```
Update `maxconn` global to `1000`.
### 8.1.2 3-node etcd
Update `docker-compose.pillar-database.yml` to include 3 etcd nodes with proper cluster configuration.
### 8.1.3 Read replica routing
Add `READ_REPLICA_URL` env var. In `data_api/src/handlers.rs`, route SELECT queries to the replica pool:
```rust
let pool = if is_read_only_query {
state.replica_pool.as_ref().unwrap_or(&state.db)
} else {
&state.db
};
```
### 8.1.4 Redis Sentinel
Replace single Redis with 3-node Sentinel setup. Update `common/src/cache.rs` to use `redis::sentinel::SentinelClient`.
---
## 8.2 — Proxy & Worker Scaling
### 8.2.1 Graceful shutdown
**File:** `gateway/src/main.rs` and all `bin/*.rs`
```rust
let listener = tokio::net::TcpListener::bind(addr).await?;
let server = axum::serve(listener, app.into_make_service());
// Wait for shutdown signal
let shutdown = async {
tokio::signal::ctrl_c().await.ok();
tracing::info!("Shutdown signal received, draining connections...");
};
server.with_graceful_shutdown(shutdown).await?;
tracing::info!("Server shut down cleanly");
```
### 8.2.2 Dynamic worker discovery
Instead of static `WORKER_UPSTREAM_URLS`, poll the control plane or use Redis pub/sub:
```rust
// Background task in proxy
tokio::spawn(async move {
let mut interval = tokio::time::interval(Duration::from_secs(30));
loop {
interval.tick().await;
match discover_workers(&control_url).await {
Ok(new_workers) => {
let mut upstreams = state.worker_upstreams.write().await;
*upstreams = new_workers;
}
Err(e) => tracing::warn!("Worker discovery failed: {}", e),
}
}
});
```
### 8.2.3 Tenant pool eviction
**File:** `gateway/src/middleware.rs`
Replace `HashMap<String, PgPool>` with a `moka::future::Cache` that has TTL and max size:
```rust
use moka::future::Cache;
pub tenant_pools: Cache<String, PgPool>,
// Initialize with TTL and max entries
Cache::builder()
.max_capacity(100)
.time_to_idle(Duration::from_secs(300))
.build()
```
### 8.2.4 Project config cache TTL
**File:** `gateway/src/worker.rs` line 97 and `middleware.rs`
```rust
// BEFORE
project_cache: moka::future::Cache::new(100),
// AFTER
project_cache: moka::future::Cache::builder()
.max_capacity(100)
.time_to_live(Duration::from_secs(60))
.build(),
```
---
## 8.3 — TLS
### 8.3.1 TLS termination
Two options:
**Option A: External reverse proxy (recommended for simplicity)**
Use Caddy or nginx in front of the proxy pillar. Caddy auto-provisions Let's Encrypt certificates:
```
# Caddyfile
api.example.com {
reverse_proxy proxy:8000
}
```
**Option B: Built-in rustls**
Add `axum-server` with `rustls` feature:
```rust
use axum_server::tls_rustls::RustlsConfig;
let tls_config = RustlsConfig::from_pem_file("cert.pem", "key.pem").await?;
axum_server::bind_rustls(addr, tls_config)
.serve(app.into_make_service())
.await?;
```
Document both options. Recommend Option A for most deployments.
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New tests** are written for HA features:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_graceful_shutdown_completes_inflight` | `gateway/src/main.rs` | After SIGTERM, in-flight request completes before exit |
| `test_graceful_shutdown_rejects_new` | `gateway/src/main.rs` | After SIGTERM, new connections are refused |
| `test_dynamic_worker_discovery` | `gateway/src/proxy.rs` | Adding a worker URL to the discovery source → proxy routes to it |
| `test_connection_pool_ttl_eviction` | `gateway/src/proxy.rs` or `common/` | Idle tenant pool is evicted after configured TTL |
| `test_connection_pool_lru_eviction` | `gateway/src/proxy.rs` or `common/` | When max pools exceeded, least-recently-used is evicted |
| `test_project_config_cache_ttl` | `gateway/src/worker.rs` | Stale project config refreshes after TTL (not served forever) |
| `test_read_replica_routing` | `data_api/src/handlers.rs` | SELECT queries route to `READ_REPLICA_URL` when set |
| `test_read_replica_fallback` | `data_api/src/handlers.rs` | When `READ_REPLICA_URL` unset, SELECT uses primary |
| `test_tls_rustls_config` | `gateway/src/main.rs` | `RustlsConfig::from_pem_file` loads certs without error (unit) |
### 2. HA / Chaos Verification
- [ ] Kill one Patroni node → automatic failover within 30s, no request failures
- [ ] Add a new worker node → proxy discovers it within 30s
- [ ] SIGTERM to worker → in-flight requests complete, then process exits cleanly
- [ ] SIGTERM to proxy → drains connections, then exits
- [ ] Tenant pool cache evicts stale entries after configured TTL
- [ ] Project config changes are reflected within 60 seconds without restart
- [ ] Read queries route to replicas when `READ_REPLICA_URL` is set
- [ ] HTTPS works via Caddy or built-in TLS
- [ ] Redis Sentinel failover does not break sessions or cache
### 3. Load Testing
- [ ] Proxy handles 1000 concurrent connections without OOM or thread exhaustion
- [ ] Worker handles 500 req/s with p99 < 500ms for simple queries
- [ ] Connection pool does not leak connections under sustained load (monitor via `pg_stat_activity`)
### 4. CI Gate
- [ ] All unit tests run in `cargo test --workspace`
- [ ] HA integration tests (Patroni failover, Redis Sentinel) are gated behind `#[ignore]` with documentation
- [ ] Load tests are documented as runnable scripts (not in CI, but in `scripts/` or `tests/load/`)

View File

@@ -0,0 +1,178 @@
# Milestone 9: Control Plane Consolidation
**Goal:** One control plane, one API, one source of truth for project and infrastructure management.
**Depends on:** M0 (Security), M1 (Foundation), M7 (CI/CD)
---
## 9.1 — Merge the Two Control Planes
### Current state
There are two parallel control plane implementations:
| | In-gateway `control_plane/` | Standalone `control-plane-api/` |
|---|---|---|
| **Binary** | Part of `control` binary | Separate `control-plane-api` binary |
| **Auth** | Admin cookie (broken, fixed in M0) | None |
| **API prefix** | `/platform/v1/*` | `/api/v1/*` |
| **Features** | Project CRUD, user mgmt, key rotation, DB browser | Server provisioning, scaling, health, templates |
| **Database** | Control DB (projects table) | Separate DB (servers, scaling_operations tables) |
| **UI** | `web/admin.html` (Vue) | `control-plane-ui/` (React/MUI) |
### Recommended approach
Merge `control-plane-api` server management into the gateway's control mode:
1. **Move server management routes** from `control-plane-api/src/lib.rs` to `control_plane/src/lib.rs` under `/platform/v1/servers`, `/platform/v1/scaling`, etc.
2. **Move the `ServerManager`** from `control-plane-api/src/server_manager.rs` into a new `control_plane/src/server_manager.rs`.
3. **Move provider code** from `control-plane-api/src/providers/` into `control_plane/src/providers/`.
4. **Consolidate the database schema.** Merge the `control-plane-api/migrations/001_initial.sql` tables (`servers`, `scaling_operations`, `cluster_events`, `server_metrics`) into the main migrations directory.
5. **Deprecate the standalone binary.** Remove `control-plane-api` from `Cargo.toml` workspace members. Keep the React UI if desired, but point it at the consolidated API.
6. **Use the admin auth** (fixed in M0) for all server management routes.
### Migration steps
```bash
# 1. Copy server management code
cp control-plane-api/src/server_manager.rs control_plane/src/
cp -r control-plane-api/src/providers/ control_plane/src/
cp control-plane-api/src/templates.rs control_plane/src/
cp control-plane-api/src/docker.rs control_plane/src/
# 2. Copy and merge migrations
cp control-plane-api/migrations/001_initial.sql migrations/20260320000000_server_management.sql
# 3. Update control_plane/src/lib.rs to add new routes
# 4. Update control_plane/Cargo.toml for new dependencies (reqwest, ssh2, etc.)
# 5. Remove control-plane-api from workspace
```
---
## 9.2 — Fix Server Provisioning
### 9.2.1 Implement provision_server
The current `provision_server` in `server_manager.rs` is a no-op. Wire it up:
1. Call `provider.create_server()` to create the VM
2. Wait for the VM to be reachable via SSH
3. Run bootstrap script (install Docker, pull images, configure services)
4. Register the server with the cluster
5. Update server status to "active"
### 9.2.2 Implement remove_server
1. Drain the server (remove from load balancer, wait for in-flight requests)
2. Stop services
3. Call `provider.delete_server()` to destroy the VM
4. Remove from database
### 9.2.3 Fix SQL parameter binding
**File:** `server_manager.rs` — search for `$2` and verify each query has matching `.bind()` calls. The known bugs:
- Line ~595: `WHERE id = $2` with only one `.bind(operation_id)` → should be `$1`
- Line ~610: Same issue
### 9.2.4 Real health data
Replace hardcoded `cluster_health()` and `get_pillar_stats()` with queries to VictoriaMetrics:
```rust
async fn get_pillar_stats(&self) -> Result<PillarStats> {
let vm_url = std::env::var("VICTORIA_METRICS_URL")?;
let client = reqwest::Client::new();
let cpu_query = format!("{}/api/v1/query?query=avg(rate(process_cpu_seconds_total[5m]))", vm_url);
let resp = client.get(&cpu_query).send().await?;
// Parse Prometheus response format
}
```
---
## 9.3 — Multi-Provider
### 9.3.1 DigitalOcean provider
**File:** `control_plane/src/providers/digitalocean.rs`
Implement using the DigitalOcean API v2:
- `create_server`: POST /v2/droplets
- `delete_server`: DELETE /v2/droplets/{id}
- `get_server`: GET /v2/droplets/{id}
- `list_servers`: GET /v2/droplets
### 9.3.2 Fix Hetzner plan validation
**File:** `control_plane/src/providers/mod.rs``validate_plan` (line ~134)
Correct the RAM mapping:
- CX11: 2GB (not 4GB)
- CX21: 4GB (not 8GB)
- CX31: 8GB
- CX41: 16GB
### 9.3.3 Add pagination to Hetzner list_servers
The Hetzner API returns max 25 results per page. Implement pagination:
```rust
let mut all_servers = Vec::new();
let mut page = 1;
loop {
let resp = client.get(&format!("{}/servers?page={}&per_page=50", api_url, page))...;
let page_data: HetznerListResponse = resp.json().await?;
all_servers.extend(page_data.servers);
if page_data.meta.pagination.next_page.is_none() { break; }
page += 1;
}
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New tests** are written for the consolidated control plane:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_list_servers` | `control_plane/src/server_manager.rs` | `GET /platform/v1/servers` returns server list |
| `test_create_server_hetzner` | `control_plane/src/providers/hetzner.rs` | `provision_server` sends correct API payload (mock HTTP) |
| `test_delete_server_hetzner` | `control_plane/src/providers/hetzner.rs` | `remove_server` sends DELETE to correct API endpoint (mock HTTP) |
| `test_create_server_digitalocean` | `control_plane/src/providers/digitalocean.rs` | `provision_server` sends correct Droplet payload (mock HTTP) |
| `test_hetzner_plan_validation` | `control_plane/src/providers/hetzner.rs` | CX11=2GB, CX21=4GB, CX31=8GB — correct RAM mapping |
| `test_hetzner_pagination` | `control_plane/src/providers/hetzner.rs` | `list_servers` paginates through multiple pages |
| `test_cluster_health_real_metrics` | `control_plane/src/lib.rs` | Health endpoint queries VictoriaMetrics (mock) and returns real CPU/mem |
| `test_sql_parameter_binding` | `control_plane/src/lib.rs` | All queries use `$1` binding, not string interpolation |
| `test_admin_auth_on_server_routes` | `control_plane/src/lib.rs` | `GET /platform/v1/servers` without admin auth returns 401 |
| `test_old_control_plane_api_removed` | workspace | `control-plane-api` is not in `Cargo.toml` workspace members |
### 2. Integration Verification
- [ ] All `/platform/v1/*` routes work through the consolidated control plane
- [ ] Server provisioning creates a real Hetzner VM (integration test with API key)
- [ ] Server removal destroys the VM
- [ ] Cluster health returns real CPU/memory metrics (not hardcoded)
- [ ] The old `control-plane-api` binary is no longer needed and has been removed from the workspace
- [ ] Admin auth protects all server management routes
- [ ] Scaling operations are recorded in the `scaling_operations` table
### 3. CI Gate
- [ ] All unit tests (with mocked HTTP) run in `cargo test --workspace`
- [ ] Integration tests against real cloud providers are gated behind `#[ignore]` and require `HETZNER_API_TOKEN` / `DO_API_TOKEN` env vars
- [ ] `cargo build --workspace` succeeds without the old `control-plane-api` crate