Files
cloudlysis/gateway/DEVELOPMENT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

340 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Development Plan: Gateway
## Overview
This plan breaks down the Gateway implementation into milestones ordered by dependency. Each milestone includes:
- **Tasks** with clear deliverables
- **Test Requirements** (unit tests + tautological tests + integration tests where applicable)
- **Dependencies** on previous milestones
**Development Approach:**
1. Complete one milestone at a time
2. Write tests before implementation (TDD where applicable)
3. Do not start the next milestone until the current milestones tests are passing (green)
4. Mark tasks complete with `[x]` as you progress
---
## Milestone 1: Project Foundation
**Goal:** Create the Gateway service as a Rust project aligned with existing node conventions (Axum + Tokio + tracing + Prometheus metrics).
### Tasks
- [x] **1.1** Initialize Cargo project
- Create `src/lib.rs` and `src/main.rs`
- Establish module layout for: http, grpc, authn, authz, routing, upstream, observability, config, storage
- [x] **1.2** Choose and wire core dependencies (aligned with existing services)
- HTTP: `axum`
- gRPC: `tonic`
- Runtime: `tokio`
- Serialization: `serde`, `serde_json`
- Errors: `thiserror`, `anyhow`
- Telemetry: `tracing`, `metrics-exporter-prometheus` or existing metrics pattern in the codebase
- [x] **1.3** Add baseline runtime endpoints
- `GET /health`, `GET /ready`, `GET /metrics`
- Structured logs with request id propagation
### Tests
- [x] **T1.1** Project compiles
- [x] **T1.2** `GET /health` returns 200
- [x] **T1.3** Tautological test: core state types are Send + Sync
---
## Milestone 2: Persistent State (Auth + RBAC + Sessions) for HA
**Goal:** Define where Gateway state lives so the service can run as **HA (max 2 replicas)** without sticky sessions and without losing auth/admin consistency.
### Dependencies
- Milestone 1 (project foundation)
### Tasks
- [x] **2.1** Choose and implement the backing store for identity + authorization state
- Recommended default for platform alignment: NATS JetStream KV buckets for:
- users
- identities (OIDC links)
- password credential records (hash only)
- refresh token/session records (hash only, revocable, rotating)
- MFA enrollments + recovery codes (hash only)
- rights/roles/assignments
- audit log index (append-only model)
- [x] **2.2** Define storage schema + versioning
- Key naming conventions
- JSON shapes (forward compatible)
- Migration strategy for schema changes
- [x] **2.3** Implement storage client abstraction
- CRUD primitives with compare-and-set semantics where needed (e.g., refresh rotation)
- Pagination/scan strategy for admin listing endpoints
- Consistent error mapping for storage failures
### Tests
- [x] **T2.1** Sensitive values are stored only as hashes (reset tokens, refresh tokens, recovery codes)
- [x] **T2.2** Refresh token rotation is atomic (cannot be used twice under concurrency)
- [x] **T2.3** Tautological test: storage client is Send + Sync
---
## Milestone 3: Routing Config + Service Discovery
**Goal:** Implement the routing layer described in [prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md), supporting independent placement per service kind and hot reload.
### Dependencies
- Milestone 1 (project foundation)
### Exit Criteria
- All Milestone 3 tests pass
### Tasks
- [x] **3.1** Define routing config model (in-memory)
- Placement maps per service kind: `aggregate_placement`, `projection_placement`, `runner_placement`
- Shard directory per service kind: `*_shards[shard_id] -> endpoint(s)`
- Revision tracking and last-known-good semantics
- [x] **3.2** Implement config sources
- Static file config for local development
- NATS JetStream KV watcher for production
- [x] **3.3** Implement routing decision API
- `(tenant_id, service_kind) -> selected endpoint`
- Admin introspection: `GET /admin/routing`
- [x] **3.4** Implement config reload semantics
- `POST /admin/routing/reload` to force refresh
- Watcher-based reload that updates atomically
### Tests
- [x] **T3.1** Routing resolves endpoints for `(tenant_id, service_kind)` correctly
- [x] **T3.2** Hot reload swaps routing tables atomically (no partial reads)
- [x] **T3.3** Unknown tenant returns a consistent, typed routing error
---
## Milestone 4: AuthN Core (Tokens, Passwords, OIDC, MFA)
**Goal:** Implement the authentication layer and the public AuthN HTTP APIs described in the PRD: signup/signin/signout/refresh/forgot/reset and MFA primitives.
### Dependencies
- Milestone 1 (project foundation)
- Milestone 2 (persistent state)
### Exit Criteria
- All Milestone 4 tests pass
### Tasks
- [x] **4.1** Implement token model
- Access token (short-lived)
- Refresh token (rotating, revocable)
- Key rotation for signing keys
- [x] **4.2** Implement password flows
- `POST /v1/auth/signup`, `POST /v1/auth/signin`, `POST /v1/auth/signout`, `POST /v1/auth/refresh`
- Forgot/reset: `POST /v1/auth/forgot`, `POST /v1/auth/reset`
- [x] **4.3** Implement Google OIDC integration points
- `POST /v1/auth/oidc/google/start`
- `GET /v1/auth/oidc/google/callback`
- Account linking rules
- [x] **4.4** Implement MFA (TOTP) primitives
- Enrollment start/confirm
- Challenge and verification
- Recovery codes
- [x] **4.5** Abuse protections
- Rate limits for signin/forgot/reset
- Generic “account not found” responses where appropriate
### Tests
- [x] **T4.1** Password hashing/verification works (Argon2id)
- [x] **T4.2** Refresh token rotation: old refresh token is invalid after use
- [x] **T4.3** Forgot/reset tokens are one-time and expire
- [x] **T4.4** MFA TOTP enrollment and challenge succeed for valid codes and fail for invalid
---
## Milestone 5: AuthZ (RBAC) + Tenant Enforcement
**Goal:** Enforce authorization decisions at the Gateway boundary, including tenant selection rules for `x-tenant-id`.
### Dependencies
- Milestone 4 (authn)
### Exit Criteria
- All Milestone 5 tests pass
### Tasks
- [x] **5.1** Define RBAC model
- Rights (permissions), roles, assignments (principal ↔ tenant ↔ role)
- Platform admin vs tenant admin vs tenant member scoping rules
- [x] **5.2** Implement authorization engine
- Inputs: principal, tenant_id, action, resource attributes (aggregate_type, view_type)
- Outputs: allow/deny with reason
- [x] **5.3** Enforce `x-tenant-id` rules
- Required on tenant-scoped endpoints
- Validated format and tenant membership checks
- [x] **5.4** Add consistent error envelope mapping (401/403/400)
### Tests
- [x] **T5.1** Tenant spoofing is rejected (principal lacks membership)
- [x] **T5.2** Role assignment enables expected actions and denies others
- [x] **T5.3** Missing `x-tenant-id` on tenant routes returns 400
---
## Milestone 6: Upstream Proxying (Aggregate / Projection / Runner)
**Goal:** Route authenticated and authorized requests to the node services.
### Dependencies
- Milestone 3 (routing)
- Milestone 5 (authz)
### Exit Criteria
- All Milestone 6 tests pass
### Tasks
- [x] **6.1** Aggregate submit command proxy
- gRPC server implementing `aggregate.gateway.v1.CommandService/SubmitCommand`
- HTTP wrapper `POST /v1/commands/{aggregate_type}/{aggregate_id}`
- Propagate `x-tenant-id` and correlation metadata
- Ensure safe retry semantics using `command_id` idempotency
- [x] **6.2** Projection query proxy
- `POST /v1/query/{view_type}` forwarding to Projection query endpoint once available
- [x] **6.3** Runner admin passthrough (admin-only)
- `/admin/runner/*` forwarding with strict authorization
### Tests
- [x] **T6.1** gRPC SubmitCommand forwards tenant metadata and returns upstream events
- [x] **T6.2** HTTP command endpoint returns the same shape as gRPC response
- [x] **T6.3** Query endpoint enforces tenant scoping and denies unauthorized callers
---
## Milestone 7: Admin IAM APIs (Users, Roles, Rights)
**Goal:** Expose the admin IAM endpoints for the Admin UI node to manage authn/authz data.
### Dependencies
- Milestone 4 (authn)
- Milestone 5 (authz)
### Exit Criteria
- All Milestone 7 tests pass
### Tasks
- [x] **7.1** Implement admin IAM endpoints
- Users CRUD and disable/delete
- Identities link/unlink (OIDC), manage password credentials
- Rights CRUD, roles CRUD, role↔rights management
- Assignments CRUD (principal ↔ tenant ↔ role)
- Service accounts credential create/rotate and tenant role assignment
- MFA admin actions (reset MFA, revoke recovery codes)
- Session revocation for user (global signout)
- [x] **7.2** Implement audit trail for admin IAM actions
- Immutable record of actor, action, target, tenant scope, timestamp, request metadata
### Tests
- [x] **T7.1** Only platform/tenant admins can access relevant endpoints
- [x] **T7.2** All admin mutations emit an audit record
- [x] **T7.3** Assignment changes immediately affect authorization decisions
---
## Milestone 8: Rebalancing Operations (Control Plane Hooks)
**Goal:** Provide the pieces needed to support tenant rebalancing as described in the PRD: visibility, readiness gates, and safe cutover support.
### Dependencies
- Milestone 3 (routing config)
- Milestone 6 (upstream proxying)
### Exit Criteria
- All Milestone 8 tests pass
### Tasks
- [x] **8.1** Expose placement introspection and status
- Current placement revision per service kind
- Effective routing decisions for a given tenant (admin-only)
- [x] **8.2** Define and implement readiness gates used by rebalancer
- Projection: warmup/catchup signal (lag)
- Runner: tenant drained / checkpoint stable signal
- Aggregate: tenant drain and state availability signal (as defined by upstream changes)
- [x] **8.3** Add operator-facing rebalancing endpoints (optional if a separate rebalancer service exists)
- Plan/apply/rollback APIs with strong authorization
### Tests
- [x] **T8.1** Placement revision changes are visible immediately and atomically
- [x] **T8.2** Rebalancing guardrails prevent cutover when target shard is not ready
---
## Milestone 9: Docker Swarm Deployment + HA (Max 2 Replicas)
**Goal:** Define and validate the Docker Swarm architecture for Gateway, including HA behavior with at most **2 Gateway replicas**.
### Dependencies
- Milestone 1 (health/ready/metrics)
- Milestone 2 (persistent state suitable for HA)
- Milestone 6 (proxying) for end-to-end smoke tests
### Exit Criteria
- All Milestone 9 tests pass
- The platform stack (`swarm/stacks/platform.yml`) can deploy the Gateway with `replicas: 2` and serve traffic during rolling updates
### Tasks
- [x] **9.1** Build container image
- Dockerfile, multi-stage build, minimal runtime image
- Embed build metadata (version, git sha)
- [x] **9.2** Define Swarm service topology (2 nodes max)
- `gateway` service with `deploy.replicas: 2`
- Healthcheck based on `/ready`
- Rolling update strategy (start-first), rollback policy on failure
- Network: overlay network for internal traffic to NATS and nodes
- [x] **9.3** Define ingress and TLS termination strategy
- Swarm routing mesh or an ingress proxy (document choice in stack)
- Ensure HTTP and gRPC can be routed correctly
- [x] **9.4** Secrets and config distribution
- OIDC client secrets, JWT signing keys (rotation-ready), NATS credentials
- Use Swarm secrets/configs instead of environment variables for secrets where possible
- [x] **9.5** HA behavior validation
- Run two replicas and ensure:
- refresh token rotation works across replicas (no stickiness)
- admin IAM updates are visible from both replicas
- in-flight requests survive a single replica restart
### Tests
- [x] **T9.1** `swarm/stacks/platform.yml` parses as valid YAML
- [x] **T9.2** Smoke: deploy 2 replicas and confirm `/ready` is healthy on both
- [x] **T9.3** Rolling update does not drop readiness below 1 available replica
- [x] **T9.4** Auth session/refresh works across replicas (no sticky sessions required)
---
## Milestone 10: Observability + Hardening
**Goal:** Make the Gateway production-ready with robust telemetry and safety defaults.
### Dependencies
- Milestone 6 (proxying)
### Exit Criteria
- All Milestone 10 tests pass
### Tasks
- [x] **10.1** Structured logs with correlation
- `request_id`, `trace_id`, principal id, tenant id (when present), upstream target
- [x] **10.2** Metrics
- Request counts/latency, auth failures, upstream errors, routing misses, rate limit blocks
- [x] **10.3** Security hardening
- CSRF protections for cookie-based flows
- JWT key rotation strategy and config
- mTLS/service auth boundary for internal upstreams
- [x] **10.4** Load and failure testing strategy
- Soak tests for routing reload + auth endpoints
- Backpressure/timeouts/circuit breaker verification
- [ ] **10.5** Correlation and trace context propagation (Gateway as source of truth)
- Accept inbound `x-correlation-id` and `traceparent` on HTTP and gRPC requests
- If missing, generate `x-correlation-id` at the start of request handling and start a new trace
- Echo `x-correlation-id` (and `traceparent` when applicable) on responses
- Propagate `x-correlation-id` and `traceparent` to upstream nodes (Aggregate/Projection/Runner) and record them in request spans/log fields
### Tests
- [x] **T10.1** Metrics include expected labels and counters increment correctly
- [x] **T10.2** Secrets never appear in logs in representative error cases
- [x] **T10.3** Rate limits trigger under abusive patterns
- [ ] **T10.4** Gateway generates `x-correlation-id` when missing and echoes it on responses
- [ ] **T10.5** Gateway propagates `x-correlation-id` and `traceparent` to upstream calls and includes them in logs/spans