# Development Plan: Gateway ## Overview This plan breaks down the Gateway implementation into milestones ordered by dependency. Each milestone includes: - **Tasks** with clear deliverables - **Test Requirements** (unit tests + tautological tests + integration tests where applicable) - **Dependencies** on previous milestones **Development Approach:** 1. Complete one milestone at a time 2. Write tests before implementation (TDD where applicable) 3. Do not start the next milestone until the current milestone’s tests are passing (green) 4. Mark tasks complete with `[x]` as you progress --- ## Milestone 1: Project Foundation **Goal:** Create the Gateway service as a Rust project aligned with existing node conventions (Axum + Tokio + tracing + Prometheus metrics). ### Tasks - [x] **1.1** Initialize Cargo project - Create `src/lib.rs` and `src/main.rs` - Establish module layout for: http, grpc, authn, authz, routing, upstream, observability, config, storage - [x] **1.2** Choose and wire core dependencies (aligned with existing services) - HTTP: `axum` - gRPC: `tonic` - Runtime: `tokio` - Serialization: `serde`, `serde_json` - Errors: `thiserror`, `anyhow` - Telemetry: `tracing`, `metrics-exporter-prometheus` or existing metrics pattern in the codebase - [x] **1.3** Add baseline runtime endpoints - `GET /health`, `GET /ready`, `GET /metrics` - Structured logs with request id propagation ### Tests - [x] **T1.1** Project compiles - [x] **T1.2** `GET /health` returns 200 - [x] **T1.3** Tautological test: core state types are Send + Sync --- ## Milestone 2: Persistent State (Auth + RBAC + Sessions) for HA **Goal:** Define where Gateway state lives so the service can run as **HA (max 2 replicas)** without sticky sessions and without losing auth/admin consistency. ### Dependencies - Milestone 1 (project foundation) ### Tasks - [x] **2.1** Choose and implement the backing store for identity + authorization state - Recommended default for platform alignment: NATS JetStream KV buckets for: - users - identities (OIDC links) - password credential records (hash only) - refresh token/session records (hash only, revocable, rotating) - MFA enrollments + recovery codes (hash only) - rights/roles/assignments - audit log index (append-only model) - [x] **2.2** Define storage schema + versioning - Key naming conventions - JSON shapes (forward compatible) - Migration strategy for schema changes - [x] **2.3** Implement storage client abstraction - CRUD primitives with compare-and-set semantics where needed (e.g., refresh rotation) - Pagination/scan strategy for admin listing endpoints - Consistent error mapping for storage failures ### Tests - [x] **T2.1** Sensitive values are stored only as hashes (reset tokens, refresh tokens, recovery codes) - [x] **T2.2** Refresh token rotation is atomic (cannot be used twice under concurrency) - [x] **T2.3** Tautological test: storage client is Send + Sync --- ## Milestone 3: Routing Config + Service Discovery **Goal:** Implement the routing layer described in [prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md), supporting independent placement per service kind and hot reload. ### Dependencies - Milestone 1 (project foundation) ### Exit Criteria - All Milestone 3 tests pass ### Tasks - [x] **3.1** Define routing config model (in-memory) - Placement maps per service kind: `aggregate_placement`, `projection_placement`, `runner_placement` - Shard directory per service kind: `*_shards[shard_id] -> endpoint(s)` - Revision tracking and last-known-good semantics - [x] **3.2** Implement config sources - Static file config for local development - NATS JetStream KV watcher for production - [x] **3.3** Implement routing decision API - `(tenant_id, service_kind) -> selected endpoint` - Admin introspection: `GET /admin/routing` - [x] **3.4** Implement config reload semantics - `POST /admin/routing/reload` to force refresh - Watcher-based reload that updates atomically ### Tests - [x] **T3.1** Routing resolves endpoints for `(tenant_id, service_kind)` correctly - [x] **T3.2** Hot reload swaps routing tables atomically (no partial reads) - [x] **T3.3** Unknown tenant returns a consistent, typed routing error --- ## Milestone 4: AuthN Core (Tokens, Passwords, OIDC, MFA) **Goal:** Implement the authentication layer and the public AuthN HTTP APIs described in the PRD: signup/signin/signout/refresh/forgot/reset and MFA primitives. ### Dependencies - Milestone 1 (project foundation) - Milestone 2 (persistent state) ### Exit Criteria - All Milestone 4 tests pass ### Tasks - [x] **4.1** Implement token model - Access token (short-lived) - Refresh token (rotating, revocable) - Key rotation for signing keys - [x] **4.2** Implement password flows - `POST /v1/auth/signup`, `POST /v1/auth/signin`, `POST /v1/auth/signout`, `POST /v1/auth/refresh` - Forgot/reset: `POST /v1/auth/forgot`, `POST /v1/auth/reset` - [x] **4.3** Implement Google OIDC integration points - `POST /v1/auth/oidc/google/start` - `GET /v1/auth/oidc/google/callback` - Account linking rules - [x] **4.4** Implement MFA (TOTP) primitives - Enrollment start/confirm - Challenge and verification - Recovery codes - [x] **4.5** Abuse protections - Rate limits for signin/forgot/reset - Generic “account not found” responses where appropriate ### Tests - [x] **T4.1** Password hashing/verification works (Argon2id) - [x] **T4.2** Refresh token rotation: old refresh token is invalid after use - [x] **T4.3** Forgot/reset tokens are one-time and expire - [x] **T4.4** MFA TOTP enrollment and challenge succeed for valid codes and fail for invalid --- ## Milestone 5: AuthZ (RBAC) + Tenant Enforcement **Goal:** Enforce authorization decisions at the Gateway boundary, including tenant selection rules for `x-tenant-id`. ### Dependencies - Milestone 4 (authn) ### Exit Criteria - All Milestone 5 tests pass ### Tasks - [x] **5.1** Define RBAC model - Rights (permissions), roles, assignments (principal ↔ tenant ↔ role) - Platform admin vs tenant admin vs tenant member scoping rules - [x] **5.2** Implement authorization engine - Inputs: principal, tenant_id, action, resource attributes (aggregate_type, view_type) - Outputs: allow/deny with reason - [x] **5.3** Enforce `x-tenant-id` rules - Required on tenant-scoped endpoints - Validated format and tenant membership checks - [x] **5.4** Add consistent error envelope mapping (401/403/400) ### Tests - [x] **T5.1** Tenant spoofing is rejected (principal lacks membership) - [x] **T5.2** Role assignment enables expected actions and denies others - [x] **T5.3** Missing `x-tenant-id` on tenant routes returns 400 --- ## Milestone 6: Upstream Proxying (Aggregate / Projection / Runner) **Goal:** Route authenticated and authorized requests to the node services. ### Dependencies - Milestone 3 (routing) - Milestone 5 (authz) ### Exit Criteria - All Milestone 6 tests pass ### Tasks - [x] **6.1** Aggregate submit command proxy - gRPC server implementing `aggregate.gateway.v1.CommandService/SubmitCommand` - HTTP wrapper `POST /v1/commands/{aggregate_type}/{aggregate_id}` - Propagate `x-tenant-id` and correlation metadata - Ensure safe retry semantics using `command_id` idempotency - [x] **6.2** Projection query proxy - `POST /v1/query/{view_type}` forwarding to Projection query endpoint once available - [x] **6.3** Runner admin passthrough (admin-only) - `/admin/runner/*` forwarding with strict authorization ### Tests - [x] **T6.1** gRPC SubmitCommand forwards tenant metadata and returns upstream events - [x] **T6.2** HTTP command endpoint returns the same shape as gRPC response - [x] **T6.3** Query endpoint enforces tenant scoping and denies unauthorized callers --- ## Milestone 7: Admin IAM APIs (Users, Roles, Rights) **Goal:** Expose the admin IAM endpoints for the Admin UI node to manage authn/authz data. ### Dependencies - Milestone 4 (authn) - Milestone 5 (authz) ### Exit Criteria - All Milestone 7 tests pass ### Tasks - [x] **7.1** Implement admin IAM endpoints - Users CRUD and disable/delete - Identities link/unlink (OIDC), manage password credentials - Rights CRUD, roles CRUD, role↔rights management - Assignments CRUD (principal ↔ tenant ↔ role) - Service accounts credential create/rotate and tenant role assignment - MFA admin actions (reset MFA, revoke recovery codes) - Session revocation for user (global signout) - [x] **7.2** Implement audit trail for admin IAM actions - Immutable record of actor, action, target, tenant scope, timestamp, request metadata ### Tests - [x] **T7.1** Only platform/tenant admins can access relevant endpoints - [x] **T7.2** All admin mutations emit an audit record - [x] **T7.3** Assignment changes immediately affect authorization decisions --- ## Milestone 8: Rebalancing Operations (Control Plane Hooks) **Goal:** Provide the pieces needed to support tenant rebalancing as described in the PRD: visibility, readiness gates, and safe cutover support. ### Dependencies - Milestone 3 (routing config) - Milestone 6 (upstream proxying) ### Exit Criteria - All Milestone 8 tests pass ### Tasks - [x] **8.1** Expose placement introspection and status - Current placement revision per service kind - Effective routing decisions for a given tenant (admin-only) - [x] **8.2** Define and implement readiness gates used by rebalancer - Projection: warmup/catchup signal (lag) - Runner: tenant drained / checkpoint stable signal - Aggregate: tenant drain and state availability signal (as defined by upstream changes) - [x] **8.3** Add operator-facing rebalancing endpoints (optional if a separate rebalancer service exists) - Plan/apply/rollback APIs with strong authorization ### Tests - [x] **T8.1** Placement revision changes are visible immediately and atomically - [x] **T8.2** Rebalancing guardrails prevent cutover when target shard is not ready --- ## Milestone 9: Docker Swarm Deployment + HA (Max 2 Replicas) **Goal:** Define and validate the Docker Swarm architecture for Gateway, including HA behavior with at most **2 Gateway replicas**. ### Dependencies - Milestone 1 (health/ready/metrics) - Milestone 2 (persistent state suitable for HA) - Milestone 6 (proxying) for end-to-end smoke tests ### Exit Criteria - All Milestone 9 tests pass - The platform stack (`swarm/stacks/platform.yml`) can deploy the Gateway with `replicas: 2` and serve traffic during rolling updates ### Tasks - [x] **9.1** Build container image - Dockerfile, multi-stage build, minimal runtime image - Embed build metadata (version, git sha) - [x] **9.2** Define Swarm service topology (2 nodes max) - `gateway` service with `deploy.replicas: 2` - Healthcheck based on `/ready` - Rolling update strategy (start-first), rollback policy on failure - Network: overlay network for internal traffic to NATS and nodes - [x] **9.3** Define ingress and TLS termination strategy - Swarm routing mesh or an ingress proxy (document choice in stack) - Ensure HTTP and gRPC can be routed correctly - [x] **9.4** Secrets and config distribution - OIDC client secrets, JWT signing keys (rotation-ready), NATS credentials - Use Swarm secrets/configs instead of environment variables for secrets where possible - [x] **9.5** HA behavior validation - Run two replicas and ensure: - refresh token rotation works across replicas (no stickiness) - admin IAM updates are visible from both replicas - in-flight requests survive a single replica restart ### Tests - [x] **T9.1** `swarm/stacks/platform.yml` parses as valid YAML - [x] **T9.2** Smoke: deploy 2 replicas and confirm `/ready` is healthy on both - [x] **T9.3** Rolling update does not drop readiness below 1 available replica - [x] **T9.4** Auth session/refresh works across replicas (no sticky sessions required) --- ## Milestone 10: Observability + Hardening **Goal:** Make the Gateway production-ready with robust telemetry and safety defaults. ### Dependencies - Milestone 6 (proxying) ### Exit Criteria - All Milestone 10 tests pass ### Tasks - [x] **10.1** Structured logs with correlation - `request_id`, `trace_id`, principal id, tenant id (when present), upstream target - [x] **10.2** Metrics - Request counts/latency, auth failures, upstream errors, routing misses, rate limit blocks - [x] **10.3** Security hardening - CSRF protections for cookie-based flows - JWT key rotation strategy and config - mTLS/service auth boundary for internal upstreams - [x] **10.4** Load and failure testing strategy - Soak tests for routing reload + auth endpoints - Backpressure/timeouts/circuit breaker verification - [ ] **10.5** Correlation and trace context propagation (Gateway as source of truth) - Accept inbound `x-correlation-id` and `traceparent` on HTTP and gRPC requests - If missing, generate `x-correlation-id` at the start of request handling and start a new trace - Echo `x-correlation-id` (and `traceparent` when applicable) on responses - Propagate `x-correlation-id` and `traceparent` to upstream nodes (Aggregate/Projection/Runner) and record them in request spans/log fields ### Tests - [x] **T10.1** Metrics include expected labels and counters increment correctly - [x] **T10.2** Secrets never appear in logs in representative error cases - [x] **T10.3** Rate limits trigger under abusive patterns - [ ] **T10.4** Gateway generates `x-correlation-id` when missing and echoes it on responses - [ ] **T10.5** Gateway propagates `x-correlation-id` and `traceparent` to upstream calls and includes them in logs/spans