Files
cloudlysis/GATEWAY_TRANSPORT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

9.8 KiB
Raw Blame History

Gateway Transport Plan

Purpose

Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:

  • Simplicity (few patterns, minimal bespoke conventions)
  • Ease of operation (consistent health/ready/metrics, consistent failure modes)
  • Frugality (bounded connections, bounded fanout, low overhead)
  • High performance (low tail latency, backpressure-aware, predictable routing)
  • Safety (tenant isolation, deny-by-default authz, consistent context propagation)

Non-Negotiable Rules (Global)

  • Every cross-service request MUST carry tenant + trace context.
  • Every transport path MUST have explicit timeouts/deadlines and bounded retries.
  • Every milestone below is “stop-the-line” gated:
    • All tasks completed
    • All tests passing
    • Workspace lint/format/type checks passing
    • Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)

Current State (Baseline)

  • Gateway → Aggregate: gRPC command submission
  • Gateway → Projection: HTTP query proxy (/v1/query/*)
  • Gateway → Runner: HTTP proxy for admin endpoints (/admin/runner/*)
  • Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)

Target Architecture (End State)

  • Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
  • Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
  • Async/event backbone: NATS JetStream remains for event/work distribution
  • shared is the single source of truth for:
    • Header names and propagation rules
    • Trace parsing/validation rules (traceparent, trace-id)
    • Request context representation (tenant/correlation/trace)

Definitions

Request Context

Fields that must be consistently propagated:

  • tenant_id (HTTP: x-tenant-id, NATS: tenant-id)
  • correlation_id (HTTP: x-correlation-id, NATS: x-correlation-id and correlation-id)
  • traceparent (HTTP: traceparent, NATS: traceparent)
  • trace_id (derived from traceparent or provided explicitly; NATS: trace-id)
  • request_id (HTTP: x-request-id, optional for NATS)

Standard Health Endpoints (per service)

  • GET /health liveness
  • GET /ready readiness (includes tenant gating if applicable)
  • GET /metrics Prometheus

Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)

Goal

Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).

Exit Criteria

  • A single shared contract exists for header names and trace parsing.
  • Gateway injects context into all upstream calls (including rebalance/health probes).
  • Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
  • Unit tests prove propagation behavior for each transport.
  • cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace all pass.

Tasks

  • Standardize header constants in shared and remove string literals from Gateway and nodes where feasible.
  • Add shared helpers for:
    • HTTP extract/inject
    • gRPC metadata extract/inject
    • NATS header extract/inject
  • Gateway: ensure context is injected into:
    • gRPC upstream requests to Aggregate
    • HTTP upstream requests to Projection
    • Runner admin proxy requests
    • Any “probe” calls (rebalance gates, fleet snapshots, health checks)
  • Projection/Runner/Aggregate: ensure NATS published messages include:
    • tenant-id
    • x-correlation-id + correlation-id
    • traceparent
    • trace-id (derived when possible)
  • Add transport-level tests:
    • Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
    • Gateway HTTP proxy path: incoming context → upstream headers preserved
    • NATS publish path: produced headers contain expected keys/values

Required Tests

  • Unit tests for shared parsing/derivation utilities
  • Existing per-crate test suites
  • At least one per-service “transport contract” test verifying headers are present and correct

Milestone 1: Internal RPC Standardization (Projection via gRPC)

Goal

Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.

Exit Criteria

  • A Projection gRPC service exists for query execution.
  • Gateway routes queries to Projection via gRPC by default.
  • Authorization semantics remain enforced in Gateway (deny-by-default).
  • Response shapes are stable and match the existing UI expectations.
  • All tests pass, including new gRPC query integration tests.

Tasks

  • Define protobuf API: projection.gateway.v1.QueryService
    • Request includes tenant + view + query payload and metadata
    • Response includes result payload and standard context propagation
  • Implement Projection gRPC server:
    • Parse tenant/view/query
    • Execute query against current projection storage/query engine
    • Enforce tenant scope
  • Implement Gateway gRPC client path for queries:
    • Routing by tenant to Projection endpoint
    • Deadlines, bounded retries (idempotent only)
    • Context propagation (tenant/correlation/trace)
  • Keep HTTP /v1/query/*:
    • Either route to internal gRPC implementation or keep as legacy/debug endpoint
  • Add tests:
    • Gateway query authz + forwarding via gRPC
    • Projection gRPC query contract tests for tenant isolation

Required Tests

  • New gRPC QueryService tests (unit + integration)
  • Existing query/authz tests in Gateway
  • Workspace fmt/clippy/test

Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)

Goal

Replace /admin/runner/* HTTP proxying with a first-class gRPC admin service for Runner operations.

Exit Criteria

  • Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
  • Gateway uses gRPC to call Runner admin APIs.
  • Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
  • Admin operations are idempotent where appropriate and include audit hooks where required.
  • All tests pass and include negative/tenant-spoof cases.

Tasks

  • Define protobuf API: runner.admin.v1.RunnerAdmin
    • Drain/resume/status/reload/tenant-scoped controls
    • Standard error mapping
  • Implement Runner gRPC admin server:
    • Tenant gating enforced for tenant-scoped operations
    • Readiness/drain semantics aligned with platform contracts
  • Implement Gateway gRPC client integration:
    • Route to Runner endpoint via routing table
    • Enforce authz rights (e.g. runner.admin)
    • Context propagation
  • Keep HTTP /admin/* in Runner optional:
    • Either remove Gateway proxy usage or keep for direct debugging behind secure network
  • Tests:
    • Gateway: admin calls rejected without rights
    • Gateway: tenant spoof attempts rejected
    • Runner: idempotency and drain semantics validated

Required Tests

  • gRPC RunnerAdmin unit/integration tests
  • Gateway proxy-to-gRPC tests
  • Workspace fmt/clippy/test

Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)

Goal

Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.

Exit Criteria

  • Gateway maintains bounded upstream connection pools for gRPC endpoints.
  • All gRPC calls have deadlines; retries are only for idempotent operations.
  • All probe/fanout calls are bounded and do not cause thundering herds.
  • Load/soak tests show stable behavior under partial failure.

Tasks

  • Implement a Gateway upstream channel pool:
    • LRU bounded by max endpoints
    • TTL/eviction strategy
    • Fast path reuse under load
  • Standardize retry profiles:
    • Read-only: short retry with jitter
    • Mutations: no automatic retry unless idempotency key present
  • Standardize timeouts:
    • Edge timeout limits
    • Internal per-service deadlines
  • Fanout controls:
    • Concurrency limiters for fleet snapshot/probes
    • Cache results where safe (short TTL)

Required Tests

  • Unit tests for pool eviction/TTL
  • Gateway integration tests for deadline propagation
  • Gated load tests (document env + how to run)

Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)

Goal

Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.

Exit Criteria

  • Gateway no longer depends on HTTP for Projection queries or Runner admin.
  • Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
  • All operational playbooks rely on standardized endpoints.

Tasks

  • Remove Gateways HTTP query proxy usage (or keep only as compatibility shim).
  • Remove Gateways runner admin HTTP proxy usage (or keep only as compatibility shim).
  • Ensure Control UI + Control API use the standardized Gateway surfaces.
  • Harden metrics and health probes to always carry context.

Required Tests

  • End-to-end smoke tests (gated)
  • Workspace fmt/clippy/test

Verification Commands (Required at Each Milestone)

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace
  • npm ci && npm run lint && npm run typecheck && npm run test && npm run build (in control/ui)

Notes / Constraints

  • Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
  • Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.