# Transport Development Plan ## Purpose Unify and optimize the platform transport layer end-to-end: - Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes - Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate This plan merges and supersedes: - `GATEWAY_TRANSPORT_PLAN.md` - `NATS_TRANSPORT_PLAN.md` ## Current Status (Codebase Reality) - Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway. - Request context pieces are standardized: - `shared` provides `TenantId`, `CorrelationId`, `TraceId` - `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)` - `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers - Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated - Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`. - Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`). - Node → NATS header propagation is improved and closer to consistent: - Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing. - Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing. - Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them. - Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes. ## Principles - Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone. - Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes. - Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources. - Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops. - High performance: multiplexing, backpressure, low tail latency, predictable routing. - Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay. ## Non-Negotiable Rules (Global) - Every cross-component hop MUST carry tenant + correlation + trace context. - Every transport path MUST have explicit timeouts/deadlines and bounded retries. - Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy). - Every milestone is stop-the-line gated: - All tasks completed - All tests required by the milestone pass - Workspace verification commands pass - Gated integration tests for the milestone are runnable and documented ## Baseline (Today) - Gateway → Aggregate: gRPC (command submission) - Gateway → Projection: HTTP (query proxy) - Gateway → Runner: HTTP (admin proxy) - Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` ## End State (Target Architecture) - Edge contract (clients ↔ Gateway): HTTP/JSON - Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin - Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement - `shared` is the single source of truth for: - header names and injection/extraction rules - trace parsing/validation (`traceparent`, `trace-id`) - context object model (tenant/correlation/trace/request ids) - NATS subject + consumer naming helpers ## Standard Contracts ### Context Fields - Tenant: HTTP `x-tenant-id`, NATS `tenant-id` - Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id` - Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible) - Request id: HTTP `x-request-id` (optional for NATS) ### Standard Service Endpoints (every service) - `GET /health` liveness - `GET /ready` readiness (includes tenant gating if relevant) - `GET /metrics` Prometheus ## Milestone 0: Shared Transport Contract (Headers + Context + Trace) ### Goal Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract. ### Exit Criteria - `shared` contains canonical constants for header names and NATS header names. - `shared` contains canonical trace parsing/validation and trace derivation helpers. - Library-level unit tests cover parsing/derivation behavior. - All crates build and tests pass for the workspace. ### Tasks - [x] Add shared ID types in `shared`: - [x] `TenantId` - [x] `CorrelationId` - [x] `TraceId` - [x] Consolidate header constants in `shared`: - [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop) - [x] HTTP: `x-tenant-id`, `x-request-id` - [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible) - [x] NATS: `tenant-id`, `Nats-Msg-Id` - [x] Add shared helpers: - [x] derive `trace-id` from `traceparent` - [x] derive `traceparent` from `trace-id` when valid - [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`) - [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`) - [x] Add unit tests in `shared` for: - [x] traceparent parsing validity - [x] serialization shape for correlation/trace id newtypes - [x] additional validation cases (invalid traceparents, all-zero ids) ### Required Tests - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` ## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes) ### Dependencies - Milestone 0 ### Goal Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts. ### Exit Criteria - Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only). - All NATS producers set required headers consistently. - All NATS consumers tolerate unknown fields and missing optional fields. - “Contract tests” exist per service to verify produced headers and subject formats. ### Tasks - [x] Create/standardize subject builder helpers (prefer `shared`): - [x] Aggregate event subject builder (`tenant..aggregate..`) - [x] Runner effect/effect_result subject builders - [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work) - [x] Aggregate publishes: - [x] `tenant-id` header always present - [x] correlation + trace headers always present; generated when missing/invalid - [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path) - [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`) - [x] Runner publishes (commands/results): - [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing - [x] trace headers always present/derived when possible; generated when missing/invalid - [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`) - [x] outbox metadata → NATS headers mapping standardized via shared helpers - [x] Projection consumption: - [x] envelope decoding remains tolerant (unknown fields ignored) - [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback) - [x] Add unit tests: - [x] subject formatting tests (shared builders) - [x] required header presence tests per publisher (Aggregate + Runner) ### Required Tests - Workspace verification commands ## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup) ### Dependencies - Milestone 1 ### Goal Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes. ### Exit Criteria - Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window). - Services create streams if missing, and validate compatibility on startup. - Startup does not silently replace or destructively mutate existing streams. - Config-only tests validate stream config builders without requiring NATS. ### Tasks - [x] Define stream policies: - [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup - [x] `WORKFLOW_COMMANDS` is defined and validated on startup - [x] `WORKFLOW_EVENTS` is defined and validated on startup - [x] Centralize stream policy builders/validators in `shared` - [x] Implement compatibility validation rules: - [x] required subjects are present (superset allowed) - [x] limits/max_age/duplicate window validated against minimums - [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy) - [x] Add unit tests for stream config builders + validators. ### Required Tests - Workspace verification commands ## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load) ### Dependencies - Milestone 2 ### Goal Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling. ### Exit Criteria - All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`). - Application-level concurrency is bounded and aligned with `max_in_flight`. - Poison policy is consistent across consumers (term + durable quarantine/deadletter record). - Gated NATS integration tests prove: - redelivery idempotency - poison termination - scale-out behavior (deliver group) where applicable ### Tasks - [x] Standardize consumer defaults: - [x] `AckPolicy::Explicit` - [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`) - [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`) - [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`) - [x] Projection: - [x] durable naming collision-free for Single/PerView modes - [x] checkpoint gate semantics: “skip still acks” - [x] poison handling persists durable records and terminates reliably (poison record + term) - [x] Runner: - [x] durable naming collision-free and stable across replicas - [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured) - [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default) - [x] Aggregate: - [x] ad-hoc fetch consumer always unique and bounded - [x] best-effort deletion never targets unrelated consumers - [x] Add gated NATS integration tests and document env flags: - [x] Runner ignored tests - [x] Projection ignored tests ### Required Tests - Workspace verification commands - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored` - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored` ## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService) ### Dependencies - Milestone 0 (context contract) ### Goal Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use. ### Exit Criteria - Projection exposes `projection.gateway.v1.QueryService`. - Gateway routes queries via gRPC by default. - Authz remains enforced in Gateway (deny-by-default). - Query responses remain stable for Control UI expectations. - New gRPC query tests pass (unit + integration). ### Tasks - [x] Define protobuf API: `projection.gateway.v1.QueryService` - [x] Implement Projection gRPC server for query execution - [x] Implement Gateway gRPC client routing to Projection - [x] deadlines - [x] bounded retries (idempotent only) - [x] context propagation - [x] Preserve HTTP `/v1/query/*` as compatibility/debug: - [x] route internally to gRPC - [x] Add tests: - [x] authz + forwarding via gRPC - [x] tenant isolation enforcement in Projection QueryService ### Required Tests - Workspace verification commands ## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin) ### Dependencies - Milestone 0 (context contract) ### Goal Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service. ### Exit Criteria - Runner exposes `runner.admin.v1.RunnerAdmin`. - Gateway calls Runner admin via gRPC (authz enforced in Gateway). - Tenant-spoof and unauthorized calls are rejected deterministically. - Runner drain/readiness semantics validated and tested. ### Tasks - [x] Define protobuf API: `runner.admin.v1.RunnerAdmin` - [x] Implement Runner gRPC admin server - [x] Implement Gateway gRPC client integration for admin operations - [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway - [x] Add tests: - [x] Gateway: rejects without rights - [x] Gateway: rejects tenant spoof attempts - [x] Runner: idempotency and drain semantics ### Required Tests - Workspace verification commands ## Milestone 6: Gateway Upstream Performance + Operational Guardrails ### Dependencies - Milestones 4–5 (gRPC internal RPC surfaces available) ### Goal Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load. ### Exit Criteria - Bounded upstream gRPC channel pool exists (LRU + TTL/eviction). - Deadlines everywhere; retries only for idempotent operations. - Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context. - Gated load/soak tests exist and are runnable. ### Tasks - [x] Implement upstream channel pool - [x] bounded LRU - [x] TTL/eviction - [x] fast-path reuse under load (cached gRPC channels) - [x] Standardize retry profiles - [x] read-only: limited retry with jitter (Gateway gRPC calls) - [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations) - [x] Standardize timeouts/deadlines: - [x] edge timeout limits - [x] internal per-service deadlines - [x] Fanout controls: - [x] concurrency limiters for probes/snapshots - [x] short TTL caching where safe - [x] Ensure probes carry context (correlation/trace) for observability. ### Required Tests - Workspace verification commands - Gated load/soak tests (document env + how to run) ## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths) ### Dependencies - Milestone 6 ### Goal Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only. ### Exit Criteria - Gateway no longer depends on HTTP for Projection queries or Runner admin. - Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control. - End-to-end smoke tests pass (gated). ### Tasks - [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC) - [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC) - [x] Ensure Control UI + Control API rely only on standardized surfaces - [x] Harden metrics and readiness probes to match the standard contract everywhere ### Required Tests - Workspace verification commands - End-to-end smoke tests (gated) ## Workspace Verification Commands (Run for Every Milestone) - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` - `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)