cloudlysis/gateway/DEVELOPMENT_PLAN.md

# Development Plan: Gateway

## Overview

This plan breaks down the Gateway implementation into milestones ordered by dependency. Each milestone includes:
- **Tasks** with clear deliverables
- **Test Requirements** (unit tests + tautological tests + integration tests where applicable)
- **Dependencies** on previous milestones

**Development Approach:**
1. Complete one milestone at a time
2. Write tests before implementation (TDD where applicable)
3. Do not start the next milestone until the current milestone’s tests are passing (green)
4. Mark tasks complete with `[x]` as you progress

---

## Milestone 1: Project Foundation

**Goal:** Create the Gateway service as a Rust project aligned with existing node conventions (Axum + Tokio + tracing + Prometheus metrics).

### Tasks
- [x] **1.1** Initialize Cargo project
  - Create `src/lib.rs` and `src/main.rs`
  - Establish module layout for: http, grpc, authn, authz, routing, upstream, observability, config, storage
- [x] **1.2** Choose and wire core dependencies (aligned with existing services)
  - HTTP: `axum`
  - gRPC: `tonic`
  - Runtime: `tokio`
  - Serialization: `serde`, `serde_json`
  - Errors: `thiserror`, `anyhow`
  - Telemetry: `tracing`, `metrics-exporter-prometheus` or existing metrics pattern in the codebase
- [x] **1.3** Add baseline runtime endpoints
  - `GET /health`, `GET /ready`, `GET /metrics`
  - Structured logs with request id propagation

### Tests
- [x] **T1.1** Project compiles
- [x] **T1.2** `GET /health` returns 200
- [x] **T1.3** Tautological test: core state types are Send + Sync

---

## Milestone 2: Persistent State (Auth + RBAC + Sessions) for HA

**Goal:** Define where Gateway state lives so the service can run as **HA (max 2 replicas)** without sticky sessions and without losing auth/admin consistency.

### Dependencies
- Milestone 1 (project foundation)

### Tasks
- [x] **2.1** Choose and implement the backing store for identity + authorization state
  - Recommended default for platform alignment: NATS JetStream KV buckets for:
    - users
    - identities (OIDC links)
    - password credential records (hash only)
    - refresh token/session records (hash only, revocable, rotating)
    - MFA enrollments + recovery codes (hash only)
    - rights/roles/assignments
    - audit log index (append-only model)
- [x] **2.2** Define storage schema + versioning
  - Key naming conventions
  - JSON shapes (forward compatible)
  - Migration strategy for schema changes
- [x] **2.3** Implement storage client abstraction
  - CRUD primitives with compare-and-set semantics where needed (e.g., refresh rotation)
  - Pagination/scan strategy for admin listing endpoints
  - Consistent error mapping for storage failures

### Tests
- [x] **T2.1** Sensitive values are stored only as hashes (reset tokens, refresh tokens, recovery codes)
- [x] **T2.2** Refresh token rotation is atomic (cannot be used twice under concurrency)
- [x] **T2.3** Tautological test: storage client is Send + Sync

---

## Milestone 3: Routing Config + Service Discovery

**Goal:** Implement the routing layer described in [prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md), supporting independent placement per service kind and hot reload.

### Dependencies
- Milestone 1 (project foundation)

### Exit Criteria
- All Milestone 3 tests pass

### Tasks
- [x] **3.1** Define routing config model (in-memory)
  - Placement maps per service kind: `aggregate_placement`, `projection_placement`, `runner_placement`
  - Shard directory per service kind: `*_shards[shard_id] -> endpoint(s)`
  - Revision tracking and last-known-good semantics
- [x] **3.2** Implement config sources
  - Static file config for local development
  - NATS JetStream KV watcher for production
- [x] **3.3** Implement routing decision API
  - `(tenant_id, service_kind) -> selected endpoint`
  - Admin introspection: `GET /admin/routing`
- [x] **3.4** Implement config reload semantics
  - `POST /admin/routing/reload` to force refresh
  - Watcher-based reload that updates atomically

### Tests
- [x] **T3.1** Routing resolves endpoints for `(tenant_id, service_kind)` correctly
- [x] **T3.2** Hot reload swaps routing tables atomically (no partial reads)
- [x] **T3.3** Unknown tenant returns a consistent, typed routing error

---

## Milestone 4: AuthN Core (Tokens, Passwords, OIDC, MFA)

**Goal:** Implement the authentication layer and the public AuthN HTTP APIs described in the PRD: signup/signin/signout/refresh/forgot/reset and MFA primitives.

### Dependencies
- Milestone 1 (project foundation)
 - Milestone 2 (persistent state)

### Exit Criteria
- All Milestone 4 tests pass

### Tasks
- [x] **4.1** Implement token model
  - Access token (short-lived)
  - Refresh token (rotating, revocable)
  - Key rotation for signing keys
- [x] **4.2** Implement password flows
  - `POST /v1/auth/signup`, `POST /v1/auth/signin`, `POST /v1/auth/signout`, `POST /v1/auth/refresh`
  - Forgot/reset: `POST /v1/auth/forgot`, `POST /v1/auth/reset`
- [x] **4.3** Implement Google OIDC integration points
  - `POST /v1/auth/oidc/google/start`
  - `GET /v1/auth/oidc/google/callback`
  - Account linking rules
- [x] **4.4** Implement MFA (TOTP) primitives
  - Enrollment start/confirm
  - Challenge and verification
  - Recovery codes
- [x] **4.5** Abuse protections
  - Rate limits for signin/forgot/reset
  - Generic “account not found” responses where appropriate

### Tests
- [x] **T4.1** Password hashing/verification works (Argon2id)
- [x] **T4.2** Refresh token rotation: old refresh token is invalid after use
- [x] **T4.3** Forgot/reset tokens are one-time and expire
- [x] **T4.4** MFA TOTP enrollment and challenge succeed for valid codes and fail for invalid

---

## Milestone 5: AuthZ (RBAC) + Tenant Enforcement

**Goal:** Enforce authorization decisions at the Gateway boundary, including tenant selection rules for `x-tenant-id`.

### Dependencies
- Milestone 4 (authn)

### Exit Criteria
- All Milestone 5 tests pass

### Tasks
- [x] **5.1** Define RBAC model
  - Rights (permissions), roles, assignments (principal ↔ tenant ↔ role)
  - Platform admin vs tenant admin vs tenant member scoping rules
- [x] **5.2** Implement authorization engine
  - Inputs: principal, tenant_id, action, resource attributes (aggregate_type, view_type)
  - Outputs: allow/deny with reason
- [x] **5.3** Enforce `x-tenant-id` rules
  - Required on tenant-scoped endpoints
  - Validated format and tenant membership checks
- [x] **5.4** Add consistent error envelope mapping (401/403/400)

### Tests
- [x] **T5.1** Tenant spoofing is rejected (principal lacks membership)
- [x] **T5.2** Role assignment enables expected actions and denies others
- [x] **T5.3** Missing `x-tenant-id` on tenant routes returns 400

---

## Milestone 6: Upstream Proxying (Aggregate / Projection / Runner)

**Goal:** Route authenticated and authorized requests to the node services.

### Dependencies
- Milestone 3 (routing)
- Milestone 5 (authz)

### Exit Criteria
- All Milestone 6 tests pass

### Tasks
- [x] **6.1** Aggregate submit command proxy
  - gRPC server implementing `aggregate.gateway.v1.CommandService/SubmitCommand`
  - HTTP wrapper `POST /v1/commands/{aggregate_type}/{aggregate_id}`
  - Propagate `x-tenant-id` and correlation metadata
  - Ensure safe retry semantics using `command_id` idempotency
- [x] **6.2** Projection query proxy
  - `POST /v1/query/{view_type}` forwarding to Projection query endpoint once available
- [x] **6.3** Runner admin passthrough (admin-only)
  - `/admin/runner/*` forwarding with strict authorization

### Tests
- [x] **T6.1** gRPC SubmitCommand forwards tenant metadata and returns upstream events
- [x] **T6.2** HTTP command endpoint returns the same shape as gRPC response
- [x] **T6.3** Query endpoint enforces tenant scoping and denies unauthorized callers

---

## Milestone 7: Admin IAM APIs (Users, Roles, Rights)

**Goal:** Expose the admin IAM endpoints for the Admin UI node to manage authn/authz data.

### Dependencies
- Milestone 4 (authn)
- Milestone 5 (authz)

### Exit Criteria
- All Milestone 7 tests pass

### Tasks
- [x] **7.1** Implement admin IAM endpoints
  - Users CRUD and disable/delete
  - Identities link/unlink (OIDC), manage password credentials
  - Rights CRUD, roles CRUD, role↔rights management
  - Assignments CRUD (principal ↔ tenant ↔ role)
  - Service accounts credential create/rotate and tenant role assignment
  - MFA admin actions (reset MFA, revoke recovery codes)
  - Session revocation for user (global signout)
- [x] **7.2** Implement audit trail for admin IAM actions
  - Immutable record of actor, action, target, tenant scope, timestamp, request metadata

### Tests
- [x] **T7.1** Only platform/tenant admins can access relevant endpoints
- [x] **T7.2** All admin mutations emit an audit record
- [x] **T7.3** Assignment changes immediately affect authorization decisions

---

## Milestone 8: Rebalancing Operations (Control Plane Hooks)

**Goal:** Provide the pieces needed to support tenant rebalancing as described in the PRD: visibility, readiness gates, and safe cutover support.

### Dependencies
- Milestone 3 (routing config)
- Milestone 6 (upstream proxying)

### Exit Criteria
- All Milestone 8 tests pass

### Tasks
- [x] **8.1** Expose placement introspection and status
  - Current placement revision per service kind
  - Effective routing decisions for a given tenant (admin-only)
- [x] **8.2** Define and implement readiness gates used by rebalancer
  - Projection: warmup/catchup signal (lag)
  - Runner: tenant drained / checkpoint stable signal
  - Aggregate: tenant drain and state availability signal (as defined by upstream changes)
- [x] **8.3** Add operator-facing rebalancing endpoints (optional if a separate rebalancer service exists)
  - Plan/apply/rollback APIs with strong authorization

### Tests
- [x] **T8.1** Placement revision changes are visible immediately and atomically
- [x] **T8.2** Rebalancing guardrails prevent cutover when target shard is not ready

---

## Milestone 9: Docker Swarm Deployment + HA (Max 2 Replicas)

**Goal:** Define and validate the Docker Swarm architecture for Gateway, including HA behavior with at most **2 Gateway replicas**.

### Dependencies
- Milestone 1 (health/ready/metrics)
- Milestone 2 (persistent state suitable for HA)
- Milestone 6 (proxying) for end-to-end smoke tests

### Exit Criteria
- All Milestone 9 tests pass
- The platform stack (`swarm/stacks/platform.yml`) can deploy the Gateway with `replicas: 2` and serve traffic during rolling updates

### Tasks
- [x] **9.1** Build container image
  - Dockerfile, multi-stage build, minimal runtime image
  - Embed build metadata (version, git sha)
- [x] **9.2** Define Swarm service topology (2 nodes max)
  - `gateway` service with `deploy.replicas: 2`
  - Healthcheck based on `/ready`
  - Rolling update strategy (start-first), rollback policy on failure
  - Network: overlay network for internal traffic to NATS and nodes
- [x] **9.3** Define ingress and TLS termination strategy
  - Swarm routing mesh or an ingress proxy (document choice in stack)
  - Ensure HTTP and gRPC can be routed correctly
- [x] **9.4** Secrets and config distribution
  - OIDC client secrets, JWT signing keys (rotation-ready), NATS credentials
  - Use Swarm secrets/configs instead of environment variables for secrets where possible
- [x] **9.5** HA behavior validation
  - Run two replicas and ensure:
    - refresh token rotation works across replicas (no stickiness)
    - admin IAM updates are visible from both replicas
    - in-flight requests survive a single replica restart

### Tests
- [x] **T9.1** `swarm/stacks/platform.yml` parses as valid YAML
- [x] **T9.2** Smoke: deploy 2 replicas and confirm `/ready` is healthy on both
- [x] **T9.3** Rolling update does not drop readiness below 1 available replica
- [x] **T9.4** Auth session/refresh works across replicas (no sticky sessions required)

---

## Milestone 10: Observability + Hardening

**Goal:** Make the Gateway production-ready with robust telemetry and safety defaults.

### Dependencies
- Milestone 6 (proxying)

### Exit Criteria
- All Milestone 10 tests pass

### Tasks
- [x] **10.1** Structured logs with correlation
  - `request_id`, `trace_id`, principal id, tenant id (when present), upstream target
- [x] **10.2** Metrics
  - Request counts/latency, auth failures, upstream errors, routing misses, rate limit blocks
- [x] **10.3** Security hardening
  - CSRF protections for cookie-based flows
  - JWT key rotation strategy and config
  - mTLS/service auth boundary for internal upstreams
- [x] **10.4** Load and failure testing strategy
  - Soak tests for routing reload + auth endpoints
  - Backpressure/timeouts/circuit breaker verification
- [ ] **10.5** Correlation and trace context propagation (Gateway as source of truth)
  - Accept inbound `x-correlation-id` and `traceparent` on HTTP and gRPC requests
  - If missing, generate `x-correlation-id` at the start of request handling and start a new trace
  - Echo `x-correlation-id` (and `traceparent` when applicable) on responses
  - Propagate `x-correlation-id` and `traceparent` to upstream nodes (Aggregate/Projection/Runner) and record them in request spans/log fields

### Tests
- [x] **T10.1** Metrics include expected labels and counters increment correctly
- [x] **T10.2** Secrets never appear in logs in representative error cases
- [x] **T10.3** Rate limits trigger under abusive patterns
- [ ] **T10.4** Gateway generates `x-correlation-id` when missing and echoes it on responses
- [ ] **T10.5** Gateway propagates `x-correlation-id` and `traceparent` to upstream calls and includes them in logs/spans