13 KiB
Development Plan: Gateway
Overview
This plan breaks down the Gateway implementation into milestones ordered by dependency. Each milestone includes:
- Tasks with clear deliverables
- Test Requirements (unit tests + tautological tests + integration tests where applicable)
- Dependencies on previous milestones
Development Approach:
- Complete one milestone at a time
- Write tests before implementation (TDD where applicable)
- Do not start the next milestone until the current milestone’s tests are passing (green)
- Mark tasks complete with
[x]as you progress
Milestone 1: Project Foundation
Goal: Create the Gateway service as a Rust project aligned with existing node conventions (Axum + Tokio + tracing + Prometheus metrics).
Tasks
- 1.1 Initialize Cargo project
- Create
src/lib.rsandsrc/main.rs - Establish module layout for: http, grpc, authn, authz, routing, upstream, observability, config, storage
- Create
- 1.2 Choose and wire core dependencies (aligned with existing services)
- HTTP:
axum - gRPC:
tonic - Runtime:
tokio - Serialization:
serde,serde_json - Errors:
thiserror,anyhow - Telemetry:
tracing,metrics-exporter-prometheusor existing metrics pattern in the codebase
- HTTP:
- 1.3 Add baseline runtime endpoints
GET /health,GET /ready,GET /metrics- Structured logs with request id propagation
Tests
- T1.1 Project compiles
- T1.2
GET /healthreturns 200 - T1.3 Tautological test: core state types are Send + Sync
Milestone 2: Persistent State (Auth + RBAC + Sessions) for HA
Goal: Define where Gateway state lives so the service can run as HA (max 2 replicas) without sticky sessions and without losing auth/admin consistency.
Dependencies
- Milestone 1 (project foundation)
Tasks
- 2.1 Choose and implement the backing store for identity + authorization state
- Recommended default for platform alignment: NATS JetStream KV buckets for:
- users
- identities (OIDC links)
- password credential records (hash only)
- refresh token/session records (hash only, revocable, rotating)
- MFA enrollments + recovery codes (hash only)
- rights/roles/assignments
- audit log index (append-only model)
- Recommended default for platform alignment: NATS JetStream KV buckets for:
- 2.2 Define storage schema + versioning
- Key naming conventions
- JSON shapes (forward compatible)
- Migration strategy for schema changes
- 2.3 Implement storage client abstraction
- CRUD primitives with compare-and-set semantics where needed (e.g., refresh rotation)
- Pagination/scan strategy for admin listing endpoints
- Consistent error mapping for storage failures
Tests
- T2.1 Sensitive values are stored only as hashes (reset tokens, refresh tokens, recovery codes)
- T2.2 Refresh token rotation is atomic (cannot be used twice under concurrency)
- T2.3 Tautological test: storage client is Send + Sync
Milestone 3: Routing Config + Service Discovery
Goal: Implement the routing layer described in prd.md, supporting independent placement per service kind and hot reload.
Dependencies
- Milestone 1 (project foundation)
Exit Criteria
- All Milestone 3 tests pass
Tasks
- 3.1 Define routing config model (in-memory)
- Placement maps per service kind:
aggregate_placement,projection_placement,runner_placement - Shard directory per service kind:
*_shards[shard_id] -> endpoint(s) - Revision tracking and last-known-good semantics
- Placement maps per service kind:
- 3.2 Implement config sources
- Static file config for local development
- NATS JetStream KV watcher for production
- 3.3 Implement routing decision API
(tenant_id, service_kind) -> selected endpoint- Admin introspection:
GET /admin/routing
- 3.4 Implement config reload semantics
POST /admin/routing/reloadto force refresh- Watcher-based reload that updates atomically
Tests
- T3.1 Routing resolves endpoints for
(tenant_id, service_kind)correctly - T3.2 Hot reload swaps routing tables atomically (no partial reads)
- T3.3 Unknown tenant returns a consistent, typed routing error
Milestone 4: AuthN Core (Tokens, Passwords, OIDC, MFA)
Goal: Implement the authentication layer and the public AuthN HTTP APIs described in the PRD: signup/signin/signout/refresh/forgot/reset and MFA primitives.
Dependencies
- Milestone 1 (project foundation)
- Milestone 2 (persistent state)
Exit Criteria
- All Milestone 4 tests pass
Tasks
- 4.1 Implement token model
- Access token (short-lived)
- Refresh token (rotating, revocable)
- Key rotation for signing keys
- 4.2 Implement password flows
POST /v1/auth/signup,POST /v1/auth/signin,POST /v1/auth/signout,POST /v1/auth/refresh- Forgot/reset:
POST /v1/auth/forgot,POST /v1/auth/reset
- 4.3 Implement Google OIDC integration points
POST /v1/auth/oidc/google/startGET /v1/auth/oidc/google/callback- Account linking rules
- 4.4 Implement MFA (TOTP) primitives
- Enrollment start/confirm
- Challenge and verification
- Recovery codes
- 4.5 Abuse protections
- Rate limits for signin/forgot/reset
- Generic “account not found” responses where appropriate
Tests
- T4.1 Password hashing/verification works (Argon2id)
- T4.2 Refresh token rotation: old refresh token is invalid after use
- T4.3 Forgot/reset tokens are one-time and expire
- T4.4 MFA TOTP enrollment and challenge succeed for valid codes and fail for invalid
Milestone 5: AuthZ (RBAC) + Tenant Enforcement
Goal: Enforce authorization decisions at the Gateway boundary, including tenant selection rules for x-tenant-id.
Dependencies
- Milestone 4 (authn)
Exit Criteria
- All Milestone 5 tests pass
Tasks
- 5.1 Define RBAC model
- Rights (permissions), roles, assignments (principal ↔ tenant ↔ role)
- Platform admin vs tenant admin vs tenant member scoping rules
- 5.2 Implement authorization engine
- Inputs: principal, tenant_id, action, resource attributes (aggregate_type, view_type)
- Outputs: allow/deny with reason
- 5.3 Enforce
x-tenant-idrules- Required on tenant-scoped endpoints
- Validated format and tenant membership checks
- 5.4 Add consistent error envelope mapping (401/403/400)
Tests
- T5.1 Tenant spoofing is rejected (principal lacks membership)
- T5.2 Role assignment enables expected actions and denies others
- T5.3 Missing
x-tenant-idon tenant routes returns 400
Milestone 6: Upstream Proxying (Aggregate / Projection / Runner)
Goal: Route authenticated and authorized requests to the node services.
Dependencies
- Milestone 3 (routing)
- Milestone 5 (authz)
Exit Criteria
- All Milestone 6 tests pass
Tasks
- 6.1 Aggregate submit command proxy
- gRPC server implementing
aggregate.gateway.v1.CommandService/SubmitCommand - HTTP wrapper
POST /v1/commands/{aggregate_type}/{aggregate_id} - Propagate
x-tenant-idand correlation metadata - Ensure safe retry semantics using
command_ididempotency
- gRPC server implementing
- 6.2 Projection query proxy
POST /v1/query/{view_type}forwarding to Projection query endpoint once available
- 6.3 Runner admin passthrough (admin-only)
/admin/runner/*forwarding with strict authorization
Tests
- T6.1 gRPC SubmitCommand forwards tenant metadata and returns upstream events
- T6.2 HTTP command endpoint returns the same shape as gRPC response
- T6.3 Query endpoint enforces tenant scoping and denies unauthorized callers
Milestone 7: Admin IAM APIs (Users, Roles, Rights)
Goal: Expose the admin IAM endpoints for the Admin UI node to manage authn/authz data.
Dependencies
- Milestone 4 (authn)
- Milestone 5 (authz)
Exit Criteria
- All Milestone 7 tests pass
Tasks
- 7.1 Implement admin IAM endpoints
- Users CRUD and disable/delete
- Identities link/unlink (OIDC), manage password credentials
- Rights CRUD, roles CRUD, role↔rights management
- Assignments CRUD (principal ↔ tenant ↔ role)
- Service accounts credential create/rotate and tenant role assignment
- MFA admin actions (reset MFA, revoke recovery codes)
- Session revocation for user (global signout)
- 7.2 Implement audit trail for admin IAM actions
- Immutable record of actor, action, target, tenant scope, timestamp, request metadata
Tests
- T7.1 Only platform/tenant admins can access relevant endpoints
- T7.2 All admin mutations emit an audit record
- T7.3 Assignment changes immediately affect authorization decisions
Milestone 8: Rebalancing Operations (Control Plane Hooks)
Goal: Provide the pieces needed to support tenant rebalancing as described in the PRD: visibility, readiness gates, and safe cutover support.
Dependencies
- Milestone 3 (routing config)
- Milestone 6 (upstream proxying)
Exit Criteria
- All Milestone 8 tests pass
Tasks
- 8.1 Expose placement introspection and status
- Current placement revision per service kind
- Effective routing decisions for a given tenant (admin-only)
- 8.2 Define and implement readiness gates used by rebalancer
- Projection: warmup/catchup signal (lag)
- Runner: tenant drained / checkpoint stable signal
- Aggregate: tenant drain and state availability signal (as defined by upstream changes)
- 8.3 Add operator-facing rebalancing endpoints (optional if a separate rebalancer service exists)
- Plan/apply/rollback APIs with strong authorization
Tests
- T8.1 Placement revision changes are visible immediately and atomically
- T8.2 Rebalancing guardrails prevent cutover when target shard is not ready
Milestone 9: Docker Swarm Deployment + HA (Max 2 Replicas)
Goal: Define and validate the Docker Swarm architecture for Gateway, including HA behavior with at most 2 Gateway replicas.
Dependencies
- Milestone 1 (health/ready/metrics)
- Milestone 2 (persistent state suitable for HA)
- Milestone 6 (proxying) for end-to-end smoke tests
Exit Criteria
- All Milestone 9 tests pass
- The platform stack (
swarm/stacks/platform.yml) can deploy the Gateway withreplicas: 2and serve traffic during rolling updates
Tasks
- 9.1 Build container image
- Dockerfile, multi-stage build, minimal runtime image
- Embed build metadata (version, git sha)
- 9.2 Define Swarm service topology (2 nodes max)
gatewayservice withdeploy.replicas: 2- Healthcheck based on
/ready - Rolling update strategy (start-first), rollback policy on failure
- Network: overlay network for internal traffic to NATS and nodes
- 9.3 Define ingress and TLS termination strategy
- Swarm routing mesh or an ingress proxy (document choice in stack)
- Ensure HTTP and gRPC can be routed correctly
- 9.4 Secrets and config distribution
- OIDC client secrets, JWT signing keys (rotation-ready), NATS credentials
- Use Swarm secrets/configs instead of environment variables for secrets where possible
- 9.5 HA behavior validation
- Run two replicas and ensure:
- refresh token rotation works across replicas (no stickiness)
- admin IAM updates are visible from both replicas
- in-flight requests survive a single replica restart
- Run two replicas and ensure:
Tests
- T9.1
swarm/stacks/platform.ymlparses as valid YAML - T9.2 Smoke: deploy 2 replicas and confirm
/readyis healthy on both - T9.3 Rolling update does not drop readiness below 1 available replica
- T9.4 Auth session/refresh works across replicas (no sticky sessions required)
Milestone 10: Observability + Hardening
Goal: Make the Gateway production-ready with robust telemetry and safety defaults.
Dependencies
- Milestone 6 (proxying)
Exit Criteria
- All Milestone 10 tests pass
Tasks
- 10.1 Structured logs with correlation
request_id,trace_id, principal id, tenant id (when present), upstream target
- 10.2 Metrics
- Request counts/latency, auth failures, upstream errors, routing misses, rate limit blocks
- 10.3 Security hardening
- CSRF protections for cookie-based flows
- JWT key rotation strategy and config
- mTLS/service auth boundary for internal upstreams
- 10.4 Load and failure testing strategy
- Soak tests for routing reload + auth endpoints
- Backpressure/timeouts/circuit breaker verification
- 10.5 Correlation and trace context propagation (Gateway as source of truth)
- Accept inbound
x-correlation-idandtraceparenton HTTP and gRPC requests - If missing, generate
x-correlation-idat the start of request handling and start a new trace - Echo
x-correlation-id(andtraceparentwhen applicable) on responses - Propagate
x-correlation-idandtraceparentto upstream nodes (Aggregate/Projection/Runner) and record them in request spans/log fields
- Accept inbound
Tests
- T10.1 Metrics include expected labels and counters increment correctly
- T10.2 Secrets never appear in logs in representative error cases
- T10.3 Rate limits trigger under abusive patterns
- T10.4 Gateway generates
x-correlation-idwhen missing and echoes it on responses - T10.5 Gateway propagates
x-correlation-idandtraceparentto upstream calls and includes them in logs/spans