cloudlysis/gateway/prd.md at main

madapes/cloudlysis

Fork 0

Files

Vlad Durnea 1298d9a3df

ci / rust (push) Failing after 2m34s

Details

ci / ui (push) Failing after 30s

Details

Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets

2026-03-30 11:40:42 +03:00

21 KiB

Raw Permalink Blame History

🧱 Component: Gateway

Definition:
The Gateway is the single ingress for the platform. It provides:

Tenant-aware routing to the node services: Aggregate (write/commands), Projection (read/queries), and Runner (workflow/saga + effects admin).
Centralized authn (password via Argon2 + Google OIDC; extensible to more providers) and authz (tenant-scoped RBAC).
Cross-cutting concerns: request validation, rate limiting, observability, and consistent error semantics.

The Gateway is responsible for enforcing multi-tenancy at the edge: it treats x-tenant-id as the tenant selection signal, validates it against the caller identity, and routes requests to the correct tenant shard/node.

Context: Existing Nodes

This PRD is based on the currently implemented node repositories:

Aggregate: defines gRPC Command API aggregate.gateway.v1.CommandService/SubmitCommand in aggregate.proto. Aggregate’s PRD explicitly expects the Gateway to route by x-tenant-id (aggregate/prd.md).
Projection: provides health/admin HTTP endpoints and implements an in-process UQF query engine as QueryService but does not currently expose it over HTTP/gRPC (uqf.rs).
Runner: uses a gRPC client to submit aggregate commands “through the gateway” (config key aggregate_gateway_url), propagating x-tenant-id as gRPC metadata (GatewayClient, OutboxRelay).
Tenant placement: there is precedent for NATS JetStream KV as a control plane for tenant placement/sharding (Runner tenant filter watcher: tenant_placement.rs; Aggregate KV client helper: swarm.rs). There is also a simple static mapping example in gateway-routing.yaml.

Problem Statement

Clients (and internal workers like Runner) need a stable, secure entrypoint that:

Authenticates identities (humans and services)
Authorizes actions per tenant
Routes requests to the correct node(s) for the selected tenant
Provides consistent APIs independent of the underlying shard topology and service discovery

Without a Gateway, each node would need to re-implement auth, tenant enforcement, rate limiting, and topology discovery, increasing security risk and operational complexity.

Goals

Provide one entrypoint for command submission (Aggregate) and query execution (Projection), and an authenticated entrypoint for workflow/admin actions (Runner).
Enforce tenant isolation using x-tenant-id:
- Validate tenant selection is allowed for the caller
- Prevent tenant spoofing
Prioritize independent scalability of Aggregate, Projection, and Runner:
- Scale each service horizontally without requiring the others to scale
- Allow tenant assignments for each service to be rebalanced independently
Support authn:
- Username/password with Argon2 password hashing
- Google OIDC login (future providers supported)
Support authz:
- Tenant-scoped RBAC with explicit permissions
- Service identities for internal traffic (Runner → Gateway)
Provide operational endpoints: /health, /ready, /metrics, config/routing introspection (admin-only).

Non-Goals

Implement the Aggregate/Projection/Runner business logic.
Replace NATS JetStream as the event bus or the storage responsibilities of nodes.
Provide a general-purpose API gateway for arbitrary upstreams; this Gateway is purpose-built for platform nodes.
Provide UI/console; the Gateway only exposes APIs.

Primary Users

External clients: applications submitting commands and running queries.
Internal services: Runner submitting commands on behalf of sagas.
Operators: managing tenant placement and observing health/metrics.

Key Concepts

Tenant Selection and Enforcement

x-tenant-id is the canonical tenant selector for all tenant-scoped requests.
The Gateway MUST reject requests when:
- The endpoint is tenant-scoped and x-tenant-id is missing (unless explicitly configured as single-tenant default).
- The caller is not authorized for that tenant.
The Gateway SHOULD normalize and validate tenant IDs using the same constraints the nodes already use (alphanumeric + - + _).

Node Types and Traffic Classes

Aggregate (write path): synchronous command submission; returns events.
Projection (read path): query execution; returns query results; eventual consistency is expected.
Runner (workflow/admin path): operational endpoints for runner configuration, drain, reload, and diagnostics; access is admin-only.

Tenant-Aware Routing

Routing decision is primarily based on tenant_id, and secondarily on request kind (aggregate vs projection vs runner).
The Gateway abstracts the topology: clients do not need to know which node hosts their tenant.

Independent Scalability and Rebalancing

Each service (Aggregate, Projection, Runner) can have its own tenant-to-shard placement. The Gateway resolves routing per (tenant_id, service_kind).
Rebalancing is defined as moving a tenant’s assignment for a specific service from one shard to another with bounded disruption.

Functional Requirements

1) Authentication (AuthN)

AuthN surface area:
- Signup, signin, signout
- Forgot password, reset password
- MFA enrollment and MFA challenge (step-up)
- Google OIDC login (and future providers)
- Service identities (internal callers)
Password-based accounts:
- Store passwords hashed with Argon2id using per-user random salts and parameters suitable for production.
- Signup MUST support email verification before the account becomes active (configurable per environment).
- Signin MUST support MFA when required by policy.
- Signout MUST revoke refresh tokens (and optionally maintain a short-lived access-token denylist only if needed).
Sessions and tokens:
- Issue a short-lived access token and a refresh token with rotation.
- Refresh tokens MUST be stored server-side (hashed at rest) to support revocation and rotation.
- Support both browser and API clients:
  - Browser: refresh token in an HttpOnly cookie with CSRF protections.
  - API clients: refresh token in an authorization header or secure client storage (no localStorage guidance in the PRD; implementation chooses).
OIDC (Google):
- Support Authorization Code flow with PKCE.
- Map OIDC identities to internal users; allow linking multiple providers per user.
- Future providers (e.g., GitHub, Azure AD) should fit the same model.
Service auth (internal):
- Support service identities for Runner → Gateway and other future internal callers.
- Recommended approach: mTLS and/or signed JWTs with a sub of service:<name> plus explicit RBAC grants.
Forgot / reset password:
- Forgot password MUST create a one-time reset token with an expiry and store only a hash of it.
- Reset password MUST verify the token, enforce password policy, rotate credentials, and revoke all refresh tokens for the user.
- Sending reset links/codes is a side effect; the Gateway SHOULD trigger it via the platform’s effect execution path (Runner effect providers) rather than embedding SMTP credentials in the Gateway.
MFA:
- Support TOTP (authenticator apps) as the default MFA method.
- Support recovery codes (one-time use) for account recovery.
- MFA enrollment MUST require a recent primary authentication (step-up).
- MFA challenges MUST be bound to an auth session and have short expiration.

2) Authorization (AuthZ / RBAC)

RBAC entities:
- User (human identity)
- Service (machine identity)
- Tenant
- Role (set of permissions)
- Assignment (principal ↔ tenant ↔ role)
Authorization checks:
- Command submission permissions: per tenant, optionally scoped by aggregate_type.
- Query permissions: per tenant, optionally scoped by view_type.
- Admin permissions: routing/config endpoints, runner admin passthrough, tenant placement changes.

3) Routing to Nodes

The Gateway MUST route to:

Aggregate nodes for command submission.
Projection nodes for query execution.
Runner nodes for admin/ops passthrough.

Routing inputs:

tenant_id (from x-tenant-id or request body for internal gRPC; header is authoritative for external HTTP).
A routing table defining tenant → shard/node → service endpoint(s), where placement MAY differ per service kind.

Routing behavior:

The Gateway MUST be able to hot-reload routing configuration without restart.
The Gateway SHOULD support both:
- Static config (file-based mapping for development)
- Dynamic config (NATS KV-based control plane for production)
The Gateway MUST support routing when placements are independent:
- aggregate_placement[tenant_id] -> aggregate_shard_id
- projection_placement[tenant_id] -> projection_shard_id
- runner_placement[tenant_id] -> runner_shard_id
The Gateway SHOULD expose placement revisions and effective routing decisions for debugging (admin-only).

4) Public APIs (Initial)

The Gateway exposes two public surface areas:

Command Submission (Write)

gRPC: implement aggregate.gateway.v1.CommandService/SubmitCommand for internal callers (Runner) and optional external clients.
HTTP: provide a simple REST wrapper to allow browser and non-gRPC clients.

HTTP sketch:

POST /v1/commands/{aggregate_type}/{aggregate_id}
- Headers: Authorization, x-tenant-id
- Body: JSON command payload
- Response: JSON containing events (mirrors the gRPC response shape)

Query Execution (Read)

Because Projection currently implements UQF query logic but does not expose it, the Gateway defines a stable API and routes to a Projection query endpoint once it exists.

HTTP sketch:

POST /v1/query/{view_type}
- Headers: Authorization, x-tenant-id
- Body: { "uqf": "<json-string>" }
- Response: { "mode": "find" | "count", ... } compatible with Projection’s QueryResponse shape.

5) Operational APIs

GET /health and GET /ready for load balancers.
GET /metrics for Prometheus/Victoria Metrics.
Admin-only:
- GET /admin/routing (current effective routing table and revision)
- POST /admin/routing/reload (force reload; should still be safe if watcher exists)
- Runner passthrough under /admin/runner/* (authenticated + authorized)

6) AuthN Endpoints (HTTP)

The Gateway SHOULD expose a stable HTTP AuthN API (exact payloads may evolve; semantics should not):

POST /v1/auth/signup
POST /v1/auth/signin
POST /v1/auth/signout
POST /v1/auth/refresh
POST /v1/auth/forgot
POST /v1/auth/reset
POST /v1/auth/mfa/enroll/start
POST /v1/auth/mfa/enroll/confirm
POST /v1/auth/mfa/challenge
POST /v1/auth/oidc/google/start
GET /v1/auth/oidc/google/callback

The Gateway MUST enforce rate limits on signin/forgot/reset and MUST apply abuse protections (generic error responses for account existence, IP/device throttling).

7) Admin IAM APIs (HTTP)

The Gateway MUST expose an admin-facing API surface for the Admin UI node to manage authentication + authorization:

Users: create, read, update, disable, delete
Identities: link/unlink OIDC identities, manage password credentials, enforce email verification status
Roles and Rights: define permissions (rights), create/update roles, assign rights to roles
Assignments: assign roles to principals (users/services) scoped to a tenant
Service Accounts: create/rotate credentials for internal callers, assign tenant roles
MFA Admin Actions: reset MFA for a user, revoke recovery codes, force re-enrollment
Sessions: revoke refresh tokens for a user (global signout)

Endpoint sketch (admin-only, audited, paginated):

GET /v1/admin/iam/users
POST /v1/admin/iam/users
GET /v1/admin/iam/users/{user_id}
PATCH /v1/admin/iam/users/{user_id}
POST /v1/admin/iam/users/{user_id}/disable
POST /v1/admin/iam/users/{user_id}/sessions/revoke
POST /v1/admin/iam/users/{user_id}/mfa/reset
GET /v1/admin/iam/rights
POST /v1/admin/iam/rights
GET /v1/admin/iam/roles
POST /v1/admin/iam/roles
GET /v1/admin/iam/roles/{role_id}
PATCH /v1/admin/iam/roles/{role_id}
POST /v1/admin/iam/roles/{role_id}/rights
GET /v1/admin/iam/assignments
POST /v1/admin/iam/assignments
DELETE /v1/admin/iam/assignments/{assignment_id}

Tenant scoping rules:

Tenant-scoped operations MUST require x-tenant-id and apply within that tenant (role assignments, tenant membership, tenant admin).
Platform-scoped operations MUST NOT depend on x-tenant-id (right/permission catalog, platform admins, global user search).

All admin IAM endpoints MUST require strong authorization (platform admin or tenant admin depending on the resource) and MUST produce an immutable audit trail (who changed what, from where, and when).

Non-Functional Requirements

Security
- Reject requests missing tenant context when required.
- Do not trust x-tenant-id unless it is authorized by the caller identity.
- Rate limit authentication endpoints and command submission endpoints.
- Ensure secrets never appear in logs (tokens, OIDC codes, passwords).
- Enforce secure defaults for sessions:
  - HttpOnly + Secure cookies where applicable, explicit CSRF protections for browser flows.
  - Access token TTLs and refresh token rotation with revocation.
  - Account lockout / progressive throttling for credential stuffing.
- Require key management and rotation:
  - JWT signing keys MUST support rotation; old keys remain valid only for bounded overlap.
  - Password reset tokens, email verification tokens, and refresh tokens MUST be stored as hashes.
- Require transport security:
  - mTLS between Gateway and internal nodes (or an equivalent, explicit service-to-service auth boundary).
- Produce auditable, immutable logs for admin IAM actions and tenant placement changes.
Reliability
- Timeouts for upstream calls; bounded retries only when safe (idempotency key present).
- Circuit breaking per upstream endpoint.
- Graceful degradation when routing config control plane is temporarily unavailable (serve last known good config).
Observability
- Correlate requests with request_id and trace_id.
- Emit structured logs and Prometheus metrics (request counts, latency histograms, auth failures, upstream errors).
- Emit security signals (failed signins, MFA failures, suspicious IP/device patterns) suitable for alerting.
Performance
- Minimize per-request allocations; use connection pools for upstreams.
- Cache routing decisions keyed by (tenant_id, service_kind) with small TTL and invalidation on routing config change.
Compatibility
- Support single-tenant mode (empty tenant id) for development and early environments, without changing client code.
- Define API versioning rules and a consistent error envelope for HTTP APIs.

Proposed Architecture

High-Level Flow

Client / Runner
  |
  |  (Authorization, x-tenant-id)
  v
Gateway
  | 1) AuthN (password/OIDC/service)
  | 2) AuthZ (RBAC per tenant + permission)
  | 3) Tenant routing (tenant_id -> node -> endpoint)
  v
Aggregate / Projection / Runner nodes

Components Inside the Gateway

API Layer
- HTTP server for REST endpoints
- gRPC server implementing aggregate.gateway.v1.CommandService for Runner compatibility
Identity Layer
- Credential verification (Argon2)
- OIDC provider integration (Google)
- Token issuance and verification (JWT access + refresh token rotation)
Authorization Layer
- RBAC policy evaluation for each request
- Tenant membership validation for x-tenant-id
Routing Layer
- Routing config loader: file + NATS KV watcher
- Routing decision: (tenant_id, service_kind) -> endpoint with independent placement per service kind
- Health-aware endpoint selection (optional phase): avoid unhealthy endpoints when multiple replicas exist
Upstream Clients
- Aggregate upstream: gRPC client (forward SubmitCommand)
- Projection upstream: HTTP or gRPC client (forward Query)
- Runner upstream: HTTP client for admin passthrough (restricted)

Routing Config Model (Recommended)

Represent routing as two layers:

Placement maps (tenant → shard), per service kind:
- aggregate_placement[tenant_id] -> aggregate_shard_id
- projection_placement[tenant_id] -> projection_shard_id
- runner_placement[tenant_id] -> runner_shard_id
Shard directory (shard → endpoints), per service kind:
- aggregate_shards[aggregate_shard_id] -> { grpc_endpoint, http_endpoint, admin_endpoint? }
- projection_shards[projection_shard_id] -> { http_endpoint, admin_endpoint? }
- runner_shards[runner_shard_id] -> { http_endpoint, admin_endpoint }

This supports both:

Static YAML/JSON config files for local runs.
Dynamic updates via NATS KV:
- Keys like aggregate/tenants/<tenant_id>, projection/tenants/<tenant_id>, runner/tenants/<tenant_id>
- Keys like aggregate/shards/<shard_id>, projection/shards/<shard_id>, runner/shards/<shard_id>

The Gateway keeps:

Last known good routing config
A revision number (KV revision or monotonic local revision) for observability/debugging

Rebalancing Mechanism (Control Plane)

Rebalancing is driven by a small control plane that updates placement and coordinates safe handoff:

Placement Store: NATS JetStream KV buckets holding placement maps and shard directory entries.
Rebalancer (operator-driven initially, automated later):
- Reads load signals (Gateway/Node metrics) and proposes moves: (service_kind, tenant_id, from_shard, to_shard)
- Applies moves by writing to KV and orchestrating drain/warmup as needed
- Provides audit trail: who moved what, when, and why

Rebalance flow (per service kind):

Update placement (KV) to include the target shard assignment with a revision.
Ensure the target shard is ready for the tenant (service-specific warmup).
Drain the tenant on the old shard (stop accepting new work for that tenant, finish in-flight).
Finalize by removing/overwriting the old assignment and triggering config reload/watchers.

Service-specific notes:

Projection: can rebuild from JetStream; rebalancing can be “cold” (new shard catches up) with minimal coordination beyond tenant filtering.
Runner: must stop acquiring new work for a tenant, flush outbox dispatch, and persist checkpoints before handing off.
Aggregate: must ensure single-writer semantics per aggregate instance; tenant drain should block new commands during handoff, and the target shard must have state (snapshot transfer) or accept a cold rehydrate from JetStream.

Error Semantics

Auth failures: 401 (unauthenticated) or 403 (forbidden)
Tenant header issues:
- Missing x-tenant-id on tenant-scoped routes: 400
- Invalid tenant format: 400
- Tenant not permitted for principal: 403
Routing failures:
- Unknown tenant assignment: 503 with retriable hint
- No healthy upstream endpoints: 503
Upstream errors:
- Preserve upstream error category when safe; normalize into a consistent error envelope.

Rollout Plan

Phase 1 (Minimum viable ingress)

Implement tenant-aware routing for Aggregate command submission.
Implement gRPC SubmitCommand compatible with Runner.
Add HTTP wrapper for command submission.
Introduce basic authn/authz (service identity + a minimal RBAC model).

Phase 2 (Read path + OIDC)

Add query API and route to Projection query endpoint (Projection may need an exposed endpoint).
Add Google OIDC login and account linking.
Harden RBAC and permissions by resource type (aggregate_type, view_type).

Phase 3 (Operations + topology)

NATS KV routing config watcher (hot reload).
Admin APIs for routing inspection and controlled updates.
Health-aware routing and per-tenant rate limits.
Introduce placement maps per service kind (independent scaling).
Introduce a rebalancer workflow (manual first) to move tenant placements safely.

Gaps / Opportunities

Tenant lifecycle APIs: tenant creation, tenant metadata, domain verification, invite flows, default roles, and bootstrap of the first tenant admin.
API conventions: standard error envelope, pagination/cursors, request IDs, idempotency semantics for command submission retries.
Identity hardening: password policy, breached-password checks, device/session management, step-up authentication rules, and admin break-glass procedures.
SSO / enterprise: SCIM provisioning and additional OIDC/SAML providers as a future track.
Audit & compliance: immutable audit log schema, export/retention policies, and per-tenant data access trails.
Rebalancer safety: explicit two-phase cutover semantics (warmup readiness gates + drain completion signals) with operator-visible status.

21 KiB Raw Permalink Blame History Unescape Escape