Files
cloudlysis/gateway/DEVELOPMENT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

13 KiB
Raw Blame History

Development Plan: Gateway

Overview

This plan breaks down the Gateway implementation into milestones ordered by dependency. Each milestone includes:

  • Tasks with clear deliverables
  • Test Requirements (unit tests + tautological tests + integration tests where applicable)
  • Dependencies on previous milestones

Development Approach:

  1. Complete one milestone at a time
  2. Write tests before implementation (TDD where applicable)
  3. Do not start the next milestone until the current milestones tests are passing (green)
  4. Mark tasks complete with [x] as you progress

Milestone 1: Project Foundation

Goal: Create the Gateway service as a Rust project aligned with existing node conventions (Axum + Tokio + tracing + Prometheus metrics).

Tasks

  • 1.1 Initialize Cargo project
    • Create src/lib.rs and src/main.rs
    • Establish module layout for: http, grpc, authn, authz, routing, upstream, observability, config, storage
  • 1.2 Choose and wire core dependencies (aligned with existing services)
    • HTTP: axum
    • gRPC: tonic
    • Runtime: tokio
    • Serialization: serde, serde_json
    • Errors: thiserror, anyhow
    • Telemetry: tracing, metrics-exporter-prometheus or existing metrics pattern in the codebase
  • 1.3 Add baseline runtime endpoints
    • GET /health, GET /ready, GET /metrics
    • Structured logs with request id propagation

Tests

  • T1.1 Project compiles
  • T1.2 GET /health returns 200
  • T1.3 Tautological test: core state types are Send + Sync

Milestone 2: Persistent State (Auth + RBAC + Sessions) for HA

Goal: Define where Gateway state lives so the service can run as HA (max 2 replicas) without sticky sessions and without losing auth/admin consistency.

Dependencies

  • Milestone 1 (project foundation)

Tasks

  • 2.1 Choose and implement the backing store for identity + authorization state
    • Recommended default for platform alignment: NATS JetStream KV buckets for:
      • users
      • identities (OIDC links)
      • password credential records (hash only)
      • refresh token/session records (hash only, revocable, rotating)
      • MFA enrollments + recovery codes (hash only)
      • rights/roles/assignments
      • audit log index (append-only model)
  • 2.2 Define storage schema + versioning
    • Key naming conventions
    • JSON shapes (forward compatible)
    • Migration strategy for schema changes
  • 2.3 Implement storage client abstraction
    • CRUD primitives with compare-and-set semantics where needed (e.g., refresh rotation)
    • Pagination/scan strategy for admin listing endpoints
    • Consistent error mapping for storage failures

Tests

  • T2.1 Sensitive values are stored only as hashes (reset tokens, refresh tokens, recovery codes)
  • T2.2 Refresh token rotation is atomic (cannot be used twice under concurrency)
  • T2.3 Tautological test: storage client is Send + Sync

Milestone 3: Routing Config + Service Discovery

Goal: Implement the routing layer described in prd.md, supporting independent placement per service kind and hot reload.

Dependencies

  • Milestone 1 (project foundation)

Exit Criteria

  • All Milestone 3 tests pass

Tasks

  • 3.1 Define routing config model (in-memory)
    • Placement maps per service kind: aggregate_placement, projection_placement, runner_placement
    • Shard directory per service kind: *_shards[shard_id] -> endpoint(s)
    • Revision tracking and last-known-good semantics
  • 3.2 Implement config sources
    • Static file config for local development
    • NATS JetStream KV watcher for production
  • 3.3 Implement routing decision API
    • (tenant_id, service_kind) -> selected endpoint
    • Admin introspection: GET /admin/routing
  • 3.4 Implement config reload semantics
    • POST /admin/routing/reload to force refresh
    • Watcher-based reload that updates atomically

Tests

  • T3.1 Routing resolves endpoints for (tenant_id, service_kind) correctly
  • T3.2 Hot reload swaps routing tables atomically (no partial reads)
  • T3.3 Unknown tenant returns a consistent, typed routing error

Milestone 4: AuthN Core (Tokens, Passwords, OIDC, MFA)

Goal: Implement the authentication layer and the public AuthN HTTP APIs described in the PRD: signup/signin/signout/refresh/forgot/reset and MFA primitives.

Dependencies

  • Milestone 1 (project foundation)
  • Milestone 2 (persistent state)

Exit Criteria

  • All Milestone 4 tests pass

Tasks

  • 4.1 Implement token model
    • Access token (short-lived)
    • Refresh token (rotating, revocable)
    • Key rotation for signing keys
  • 4.2 Implement password flows
    • POST /v1/auth/signup, POST /v1/auth/signin, POST /v1/auth/signout, POST /v1/auth/refresh
    • Forgot/reset: POST /v1/auth/forgot, POST /v1/auth/reset
  • 4.3 Implement Google OIDC integration points
    • POST /v1/auth/oidc/google/start
    • GET /v1/auth/oidc/google/callback
    • Account linking rules
  • 4.4 Implement MFA (TOTP) primitives
    • Enrollment start/confirm
    • Challenge and verification
    • Recovery codes
  • 4.5 Abuse protections
    • Rate limits for signin/forgot/reset
    • Generic “account not found” responses where appropriate

Tests

  • T4.1 Password hashing/verification works (Argon2id)
  • T4.2 Refresh token rotation: old refresh token is invalid after use
  • T4.3 Forgot/reset tokens are one-time and expire
  • T4.4 MFA TOTP enrollment and challenge succeed for valid codes and fail for invalid

Milestone 5: AuthZ (RBAC) + Tenant Enforcement

Goal: Enforce authorization decisions at the Gateway boundary, including tenant selection rules for x-tenant-id.

Dependencies

  • Milestone 4 (authn)

Exit Criteria

  • All Milestone 5 tests pass

Tasks

  • 5.1 Define RBAC model
    • Rights (permissions), roles, assignments (principal ↔ tenant ↔ role)
    • Platform admin vs tenant admin vs tenant member scoping rules
  • 5.2 Implement authorization engine
    • Inputs: principal, tenant_id, action, resource attributes (aggregate_type, view_type)
    • Outputs: allow/deny with reason
  • 5.3 Enforce x-tenant-id rules
    • Required on tenant-scoped endpoints
    • Validated format and tenant membership checks
  • 5.4 Add consistent error envelope mapping (401/403/400)

Tests

  • T5.1 Tenant spoofing is rejected (principal lacks membership)
  • T5.2 Role assignment enables expected actions and denies others
  • T5.3 Missing x-tenant-id on tenant routes returns 400

Milestone 6: Upstream Proxying (Aggregate / Projection / Runner)

Goal: Route authenticated and authorized requests to the node services.

Dependencies

  • Milestone 3 (routing)
  • Milestone 5 (authz)

Exit Criteria

  • All Milestone 6 tests pass

Tasks

  • 6.1 Aggregate submit command proxy
    • gRPC server implementing aggregate.gateway.v1.CommandService/SubmitCommand
    • HTTP wrapper POST /v1/commands/{aggregate_type}/{aggregate_id}
    • Propagate x-tenant-id and correlation metadata
    • Ensure safe retry semantics using command_id idempotency
  • 6.2 Projection query proxy
    • POST /v1/query/{view_type} forwarding to Projection query endpoint once available
  • 6.3 Runner admin passthrough (admin-only)
    • /admin/runner/* forwarding with strict authorization

Tests

  • T6.1 gRPC SubmitCommand forwards tenant metadata and returns upstream events
  • T6.2 HTTP command endpoint returns the same shape as gRPC response
  • T6.3 Query endpoint enforces tenant scoping and denies unauthorized callers

Milestone 7: Admin IAM APIs (Users, Roles, Rights)

Goal: Expose the admin IAM endpoints for the Admin UI node to manage authn/authz data.

Dependencies

  • Milestone 4 (authn)
  • Milestone 5 (authz)

Exit Criteria

  • All Milestone 7 tests pass

Tasks

  • 7.1 Implement admin IAM endpoints
    • Users CRUD and disable/delete
    • Identities link/unlink (OIDC), manage password credentials
    • Rights CRUD, roles CRUD, role↔rights management
    • Assignments CRUD (principal ↔ tenant ↔ role)
    • Service accounts credential create/rotate and tenant role assignment
    • MFA admin actions (reset MFA, revoke recovery codes)
    • Session revocation for user (global signout)
  • 7.2 Implement audit trail for admin IAM actions
    • Immutable record of actor, action, target, tenant scope, timestamp, request metadata

Tests

  • T7.1 Only platform/tenant admins can access relevant endpoints
  • T7.2 All admin mutations emit an audit record
  • T7.3 Assignment changes immediately affect authorization decisions

Milestone 8: Rebalancing Operations (Control Plane Hooks)

Goal: Provide the pieces needed to support tenant rebalancing as described in the PRD: visibility, readiness gates, and safe cutover support.

Dependencies

  • Milestone 3 (routing config)
  • Milestone 6 (upstream proxying)

Exit Criteria

  • All Milestone 8 tests pass

Tasks

  • 8.1 Expose placement introspection and status
    • Current placement revision per service kind
    • Effective routing decisions for a given tenant (admin-only)
  • 8.2 Define and implement readiness gates used by rebalancer
    • Projection: warmup/catchup signal (lag)
    • Runner: tenant drained / checkpoint stable signal
    • Aggregate: tenant drain and state availability signal (as defined by upstream changes)
  • 8.3 Add operator-facing rebalancing endpoints (optional if a separate rebalancer service exists)
    • Plan/apply/rollback APIs with strong authorization

Tests

  • T8.1 Placement revision changes are visible immediately and atomically
  • T8.2 Rebalancing guardrails prevent cutover when target shard is not ready

Milestone 9: Docker Swarm Deployment + HA (Max 2 Replicas)

Goal: Define and validate the Docker Swarm architecture for Gateway, including HA behavior with at most 2 Gateway replicas.

Dependencies

  • Milestone 1 (health/ready/metrics)
  • Milestone 2 (persistent state suitable for HA)
  • Milestone 6 (proxying) for end-to-end smoke tests

Exit Criteria

  • All Milestone 9 tests pass
  • The platform stack (swarm/stacks/platform.yml) can deploy the Gateway with replicas: 2 and serve traffic during rolling updates

Tasks

  • 9.1 Build container image
    • Dockerfile, multi-stage build, minimal runtime image
    • Embed build metadata (version, git sha)
  • 9.2 Define Swarm service topology (2 nodes max)
    • gateway service with deploy.replicas: 2
    • Healthcheck based on /ready
    • Rolling update strategy (start-first), rollback policy on failure
    • Network: overlay network for internal traffic to NATS and nodes
  • 9.3 Define ingress and TLS termination strategy
    • Swarm routing mesh or an ingress proxy (document choice in stack)
    • Ensure HTTP and gRPC can be routed correctly
  • 9.4 Secrets and config distribution
    • OIDC client secrets, JWT signing keys (rotation-ready), NATS credentials
    • Use Swarm secrets/configs instead of environment variables for secrets where possible
  • 9.5 HA behavior validation
    • Run two replicas and ensure:
      • refresh token rotation works across replicas (no stickiness)
      • admin IAM updates are visible from both replicas
      • in-flight requests survive a single replica restart

Tests

  • T9.1 swarm/stacks/platform.yml parses as valid YAML
  • T9.2 Smoke: deploy 2 replicas and confirm /ready is healthy on both
  • T9.3 Rolling update does not drop readiness below 1 available replica
  • T9.4 Auth session/refresh works across replicas (no sticky sessions required)

Milestone 10: Observability + Hardening

Goal: Make the Gateway production-ready with robust telemetry and safety defaults.

Dependencies

  • Milestone 6 (proxying)

Exit Criteria

  • All Milestone 10 tests pass

Tasks

  • 10.1 Structured logs with correlation
    • request_id, trace_id, principal id, tenant id (when present), upstream target
  • 10.2 Metrics
    • Request counts/latency, auth failures, upstream errors, routing misses, rate limit blocks
  • 10.3 Security hardening
    • CSRF protections for cookie-based flows
    • JWT key rotation strategy and config
    • mTLS/service auth boundary for internal upstreams
  • 10.4 Load and failure testing strategy
    • Soak tests for routing reload + auth endpoints
    • Backpressure/timeouts/circuit breaker verification
  • 10.5 Correlation and trace context propagation (Gateway as source of truth)
    • Accept inbound x-correlation-id and traceparent on HTTP and gRPC requests
    • If missing, generate x-correlation-id at the start of request handling and start a new trace
    • Echo x-correlation-id (and traceparent when applicable) on responses
    • Propagate x-correlation-id and traceparent to upstream nodes (Aggregate/Projection/Runner) and record them in request spans/log fields

Tests

  • T10.1 Metrics include expected labels and counters increment correctly
  • T10.2 Secrets never appear in logs in representative error cases
  • T10.3 Rate limits trigger under abusive patterns
  • T10.4 Gateway generates x-correlation-id when missing and echoes it on responses
  • T10.5 Gateway propagates x-correlation-id and traceparent to upstream calls and includes them in logs/spans