340 lines
13 KiB
Markdown
340 lines
13 KiB
Markdown
# Development Plan: Gateway
|
||
|
||
## Overview
|
||
|
||
This plan breaks down the Gateway implementation into milestones ordered by dependency. Each milestone includes:
|
||
- **Tasks** with clear deliverables
|
||
- **Test Requirements** (unit tests + tautological tests + integration tests where applicable)
|
||
- **Dependencies** on previous milestones
|
||
|
||
**Development Approach:**
|
||
1. Complete one milestone at a time
|
||
2. Write tests before implementation (TDD where applicable)
|
||
3. Do not start the next milestone until the current milestone’s tests are passing (green)
|
||
4. Mark tasks complete with `[x]` as you progress
|
||
|
||
---
|
||
|
||
## Milestone 1: Project Foundation
|
||
|
||
**Goal:** Create the Gateway service as a Rust project aligned with existing node conventions (Axum + Tokio + tracing + Prometheus metrics).
|
||
|
||
### Tasks
|
||
- [x] **1.1** Initialize Cargo project
|
||
- Create `src/lib.rs` and `src/main.rs`
|
||
- Establish module layout for: http, grpc, authn, authz, routing, upstream, observability, config, storage
|
||
- [x] **1.2** Choose and wire core dependencies (aligned with existing services)
|
||
- HTTP: `axum`
|
||
- gRPC: `tonic`
|
||
- Runtime: `tokio`
|
||
- Serialization: `serde`, `serde_json`
|
||
- Errors: `thiserror`, `anyhow`
|
||
- Telemetry: `tracing`, `metrics-exporter-prometheus` or existing metrics pattern in the codebase
|
||
- [x] **1.3** Add baseline runtime endpoints
|
||
- `GET /health`, `GET /ready`, `GET /metrics`
|
||
- Structured logs with request id propagation
|
||
|
||
### Tests
|
||
- [x] **T1.1** Project compiles
|
||
- [x] **T1.2** `GET /health` returns 200
|
||
- [x] **T1.3** Tautological test: core state types are Send + Sync
|
||
|
||
---
|
||
|
||
## Milestone 2: Persistent State (Auth + RBAC + Sessions) for HA
|
||
|
||
**Goal:** Define where Gateway state lives so the service can run as **HA (max 2 replicas)** without sticky sessions and without losing auth/admin consistency.
|
||
|
||
### Dependencies
|
||
- Milestone 1 (project foundation)
|
||
|
||
### Tasks
|
||
- [x] **2.1** Choose and implement the backing store for identity + authorization state
|
||
- Recommended default for platform alignment: NATS JetStream KV buckets for:
|
||
- users
|
||
- identities (OIDC links)
|
||
- password credential records (hash only)
|
||
- refresh token/session records (hash only, revocable, rotating)
|
||
- MFA enrollments + recovery codes (hash only)
|
||
- rights/roles/assignments
|
||
- audit log index (append-only model)
|
||
- [x] **2.2** Define storage schema + versioning
|
||
- Key naming conventions
|
||
- JSON shapes (forward compatible)
|
||
- Migration strategy for schema changes
|
||
- [x] **2.3** Implement storage client abstraction
|
||
- CRUD primitives with compare-and-set semantics where needed (e.g., refresh rotation)
|
||
- Pagination/scan strategy for admin listing endpoints
|
||
- Consistent error mapping for storage failures
|
||
|
||
### Tests
|
||
- [x] **T2.1** Sensitive values are stored only as hashes (reset tokens, refresh tokens, recovery codes)
|
||
- [x] **T2.2** Refresh token rotation is atomic (cannot be used twice under concurrency)
|
||
- [x] **T2.3** Tautological test: storage client is Send + Sync
|
||
|
||
---
|
||
|
||
## Milestone 3: Routing Config + Service Discovery
|
||
|
||
**Goal:** Implement the routing layer described in [prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md), supporting independent placement per service kind and hot reload.
|
||
|
||
### Dependencies
|
||
- Milestone 1 (project foundation)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 3 tests pass
|
||
|
||
### Tasks
|
||
- [x] **3.1** Define routing config model (in-memory)
|
||
- Placement maps per service kind: `aggregate_placement`, `projection_placement`, `runner_placement`
|
||
- Shard directory per service kind: `*_shards[shard_id] -> endpoint(s)`
|
||
- Revision tracking and last-known-good semantics
|
||
- [x] **3.2** Implement config sources
|
||
- Static file config for local development
|
||
- NATS JetStream KV watcher for production
|
||
- [x] **3.3** Implement routing decision API
|
||
- `(tenant_id, service_kind) -> selected endpoint`
|
||
- Admin introspection: `GET /admin/routing`
|
||
- [x] **3.4** Implement config reload semantics
|
||
- `POST /admin/routing/reload` to force refresh
|
||
- Watcher-based reload that updates atomically
|
||
|
||
### Tests
|
||
- [x] **T3.1** Routing resolves endpoints for `(tenant_id, service_kind)` correctly
|
||
- [x] **T3.2** Hot reload swaps routing tables atomically (no partial reads)
|
||
- [x] **T3.3** Unknown tenant returns a consistent, typed routing error
|
||
|
||
---
|
||
|
||
## Milestone 4: AuthN Core (Tokens, Passwords, OIDC, MFA)
|
||
|
||
**Goal:** Implement the authentication layer and the public AuthN HTTP APIs described in the PRD: signup/signin/signout/refresh/forgot/reset and MFA primitives.
|
||
|
||
### Dependencies
|
||
- Milestone 1 (project foundation)
|
||
- Milestone 2 (persistent state)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 4 tests pass
|
||
|
||
### Tasks
|
||
- [x] **4.1** Implement token model
|
||
- Access token (short-lived)
|
||
- Refresh token (rotating, revocable)
|
||
- Key rotation for signing keys
|
||
- [x] **4.2** Implement password flows
|
||
- `POST /v1/auth/signup`, `POST /v1/auth/signin`, `POST /v1/auth/signout`, `POST /v1/auth/refresh`
|
||
- Forgot/reset: `POST /v1/auth/forgot`, `POST /v1/auth/reset`
|
||
- [x] **4.3** Implement Google OIDC integration points
|
||
- `POST /v1/auth/oidc/google/start`
|
||
- `GET /v1/auth/oidc/google/callback`
|
||
- Account linking rules
|
||
- [x] **4.4** Implement MFA (TOTP) primitives
|
||
- Enrollment start/confirm
|
||
- Challenge and verification
|
||
- Recovery codes
|
||
- [x] **4.5** Abuse protections
|
||
- Rate limits for signin/forgot/reset
|
||
- Generic “account not found” responses where appropriate
|
||
|
||
### Tests
|
||
- [x] **T4.1** Password hashing/verification works (Argon2id)
|
||
- [x] **T4.2** Refresh token rotation: old refresh token is invalid after use
|
||
- [x] **T4.3** Forgot/reset tokens are one-time and expire
|
||
- [x] **T4.4** MFA TOTP enrollment and challenge succeed for valid codes and fail for invalid
|
||
|
||
---
|
||
|
||
## Milestone 5: AuthZ (RBAC) + Tenant Enforcement
|
||
|
||
**Goal:** Enforce authorization decisions at the Gateway boundary, including tenant selection rules for `x-tenant-id`.
|
||
|
||
### Dependencies
|
||
- Milestone 4 (authn)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 5 tests pass
|
||
|
||
### Tasks
|
||
- [x] **5.1** Define RBAC model
|
||
- Rights (permissions), roles, assignments (principal ↔ tenant ↔ role)
|
||
- Platform admin vs tenant admin vs tenant member scoping rules
|
||
- [x] **5.2** Implement authorization engine
|
||
- Inputs: principal, tenant_id, action, resource attributes (aggregate_type, view_type)
|
||
- Outputs: allow/deny with reason
|
||
- [x] **5.3** Enforce `x-tenant-id` rules
|
||
- Required on tenant-scoped endpoints
|
||
- Validated format and tenant membership checks
|
||
- [x] **5.4** Add consistent error envelope mapping (401/403/400)
|
||
|
||
### Tests
|
||
- [x] **T5.1** Tenant spoofing is rejected (principal lacks membership)
|
||
- [x] **T5.2** Role assignment enables expected actions and denies others
|
||
- [x] **T5.3** Missing `x-tenant-id` on tenant routes returns 400
|
||
|
||
---
|
||
|
||
## Milestone 6: Upstream Proxying (Aggregate / Projection / Runner)
|
||
|
||
**Goal:** Route authenticated and authorized requests to the node services.
|
||
|
||
### Dependencies
|
||
- Milestone 3 (routing)
|
||
- Milestone 5 (authz)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 6 tests pass
|
||
|
||
### Tasks
|
||
- [x] **6.1** Aggregate submit command proxy
|
||
- gRPC server implementing `aggregate.gateway.v1.CommandService/SubmitCommand`
|
||
- HTTP wrapper `POST /v1/commands/{aggregate_type}/{aggregate_id}`
|
||
- Propagate `x-tenant-id` and correlation metadata
|
||
- Ensure safe retry semantics using `command_id` idempotency
|
||
- [x] **6.2** Projection query proxy
|
||
- `POST /v1/query/{view_type}` forwarding to Projection query endpoint once available
|
||
- [x] **6.3** Runner admin passthrough (admin-only)
|
||
- `/admin/runner/*` forwarding with strict authorization
|
||
|
||
### Tests
|
||
- [x] **T6.1** gRPC SubmitCommand forwards tenant metadata and returns upstream events
|
||
- [x] **T6.2** HTTP command endpoint returns the same shape as gRPC response
|
||
- [x] **T6.3** Query endpoint enforces tenant scoping and denies unauthorized callers
|
||
|
||
---
|
||
|
||
## Milestone 7: Admin IAM APIs (Users, Roles, Rights)
|
||
|
||
**Goal:** Expose the admin IAM endpoints for the Admin UI node to manage authn/authz data.
|
||
|
||
### Dependencies
|
||
- Milestone 4 (authn)
|
||
- Milestone 5 (authz)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 7 tests pass
|
||
|
||
### Tasks
|
||
- [x] **7.1** Implement admin IAM endpoints
|
||
- Users CRUD and disable/delete
|
||
- Identities link/unlink (OIDC), manage password credentials
|
||
- Rights CRUD, roles CRUD, role↔rights management
|
||
- Assignments CRUD (principal ↔ tenant ↔ role)
|
||
- Service accounts credential create/rotate and tenant role assignment
|
||
- MFA admin actions (reset MFA, revoke recovery codes)
|
||
- Session revocation for user (global signout)
|
||
- [x] **7.2** Implement audit trail for admin IAM actions
|
||
- Immutable record of actor, action, target, tenant scope, timestamp, request metadata
|
||
|
||
### Tests
|
||
- [x] **T7.1** Only platform/tenant admins can access relevant endpoints
|
||
- [x] **T7.2** All admin mutations emit an audit record
|
||
- [x] **T7.3** Assignment changes immediately affect authorization decisions
|
||
|
||
---
|
||
|
||
## Milestone 8: Rebalancing Operations (Control Plane Hooks)
|
||
|
||
**Goal:** Provide the pieces needed to support tenant rebalancing as described in the PRD: visibility, readiness gates, and safe cutover support.
|
||
|
||
### Dependencies
|
||
- Milestone 3 (routing config)
|
||
- Milestone 6 (upstream proxying)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 8 tests pass
|
||
|
||
### Tasks
|
||
- [x] **8.1** Expose placement introspection and status
|
||
- Current placement revision per service kind
|
||
- Effective routing decisions for a given tenant (admin-only)
|
||
- [x] **8.2** Define and implement readiness gates used by rebalancer
|
||
- Projection: warmup/catchup signal (lag)
|
||
- Runner: tenant drained / checkpoint stable signal
|
||
- Aggregate: tenant drain and state availability signal (as defined by upstream changes)
|
||
- [x] **8.3** Add operator-facing rebalancing endpoints (optional if a separate rebalancer service exists)
|
||
- Plan/apply/rollback APIs with strong authorization
|
||
|
||
### Tests
|
||
- [x] **T8.1** Placement revision changes are visible immediately and atomically
|
||
- [x] **T8.2** Rebalancing guardrails prevent cutover when target shard is not ready
|
||
|
||
---
|
||
|
||
## Milestone 9: Docker Swarm Deployment + HA (Max 2 Replicas)
|
||
|
||
**Goal:** Define and validate the Docker Swarm architecture for Gateway, including HA behavior with at most **2 Gateway replicas**.
|
||
|
||
### Dependencies
|
||
- Milestone 1 (health/ready/metrics)
|
||
- Milestone 2 (persistent state suitable for HA)
|
||
- Milestone 6 (proxying) for end-to-end smoke tests
|
||
|
||
### Exit Criteria
|
||
- All Milestone 9 tests pass
|
||
- The platform stack (`swarm/stacks/platform.yml`) can deploy the Gateway with `replicas: 2` and serve traffic during rolling updates
|
||
|
||
### Tasks
|
||
- [x] **9.1** Build container image
|
||
- Dockerfile, multi-stage build, minimal runtime image
|
||
- Embed build metadata (version, git sha)
|
||
- [x] **9.2** Define Swarm service topology (2 nodes max)
|
||
- `gateway` service with `deploy.replicas: 2`
|
||
- Healthcheck based on `/ready`
|
||
- Rolling update strategy (start-first), rollback policy on failure
|
||
- Network: overlay network for internal traffic to NATS and nodes
|
||
- [x] **9.3** Define ingress and TLS termination strategy
|
||
- Swarm routing mesh or an ingress proxy (document choice in stack)
|
||
- Ensure HTTP and gRPC can be routed correctly
|
||
- [x] **9.4** Secrets and config distribution
|
||
- OIDC client secrets, JWT signing keys (rotation-ready), NATS credentials
|
||
- Use Swarm secrets/configs instead of environment variables for secrets where possible
|
||
- [x] **9.5** HA behavior validation
|
||
- Run two replicas and ensure:
|
||
- refresh token rotation works across replicas (no stickiness)
|
||
- admin IAM updates are visible from both replicas
|
||
- in-flight requests survive a single replica restart
|
||
|
||
### Tests
|
||
- [x] **T9.1** `swarm/stacks/platform.yml` parses as valid YAML
|
||
- [x] **T9.2** Smoke: deploy 2 replicas and confirm `/ready` is healthy on both
|
||
- [x] **T9.3** Rolling update does not drop readiness below 1 available replica
|
||
- [x] **T9.4** Auth session/refresh works across replicas (no sticky sessions required)
|
||
|
||
---
|
||
|
||
## Milestone 10: Observability + Hardening
|
||
|
||
**Goal:** Make the Gateway production-ready with robust telemetry and safety defaults.
|
||
|
||
### Dependencies
|
||
- Milestone 6 (proxying)
|
||
|
||
### Exit Criteria
|
||
- All Milestone 10 tests pass
|
||
|
||
### Tasks
|
||
- [x] **10.1** Structured logs with correlation
|
||
- `request_id`, `trace_id`, principal id, tenant id (when present), upstream target
|
||
- [x] **10.2** Metrics
|
||
- Request counts/latency, auth failures, upstream errors, routing misses, rate limit blocks
|
||
- [x] **10.3** Security hardening
|
||
- CSRF protections for cookie-based flows
|
||
- JWT key rotation strategy and config
|
||
- mTLS/service auth boundary for internal upstreams
|
||
- [x] **10.4** Load and failure testing strategy
|
||
- Soak tests for routing reload + auth endpoints
|
||
- Backpressure/timeouts/circuit breaker verification
|
||
- [ ] **10.5** Correlation and trace context propagation (Gateway as source of truth)
|
||
- Accept inbound `x-correlation-id` and `traceparent` on HTTP and gRPC requests
|
||
- If missing, generate `x-correlation-id` at the start of request handling and start a new trace
|
||
- Echo `x-correlation-id` (and `traceparent` when applicable) on responses
|
||
- Propagate `x-correlation-id` and `traceparent` to upstream nodes (Aggregate/Projection/Runner) and record them in request spans/log fields
|
||
|
||
### Tests
|
||
- [x] **T10.1** Metrics include expected labels and counters increment correctly
|
||
- [x] **T10.2** Secrets never appear in logs in representative error cases
|
||
- [x] **T10.3** Rate limits trigger under abusive patterns
|
||
- [ ] **T10.4** Gateway generates `x-correlation-id` when missing and echoes it on responses
|
||
- [ ] **T10.5** Gateway propagates `x-correlation-id` and `traceparent` to upstream calls and includes them in logs/spans
|