# Transport Development Plan ## Purpose Unify and optimize the platform transport layer end-to-end: - Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes - Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate This plan merges and supersedes: - `GATEWAY_TRANSPORT_PLAN.md` - `NATS_TRANSPORT_PLAN.md` ## Current Status (Codebase Reality) - Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway. - Request context pieces are partially standardized: - `shared` provides `TenantId`, `CorrelationId`, `TraceId` - `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)` - Some header names are centralized in `shared` but not all call sites use constants yet. - Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`. - Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`). - Node → NATS header propagation is improved and closer to consistent: - Runner publishes `x-correlation-id` and `correlation-id`, and ensures `traceparent`/`trace-id` are present/derived when possible. - Aggregate publishes `trace-id` when `traceparent` is present. - Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes. ## Principles - Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone. - Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes. - Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources. - Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops. - High performance: multiplexing, backpressure, low tail latency, predictable routing. - Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay. ## Non-Negotiable Rules (Global) - Every cross-component hop MUST carry tenant + correlation + trace context. - Every transport path MUST have explicit timeouts/deadlines and bounded retries. - Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy). - Every milestone is stop-the-line gated: - All tasks completed - All tests required by the milestone pass - Workspace verification commands pass - Gated integration tests for the milestone are runnable and documented ## Baseline (Today) - Gateway → Aggregate: gRPC (command submission) - Gateway → Projection: HTTP (query proxy) - Gateway → Runner: HTTP (admin proxy) - Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` ## End State (Target Architecture) - Edge contract (clients ↔ Gateway): HTTP/JSON - Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin - Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement - `shared` is the single source of truth for: - header names and injection/extraction rules - trace parsing/validation (`traceparent`, `trace-id`) - context object model (tenant/correlation/trace/request ids) - NATS subject + consumer naming helpers ## Standard Contracts ### Context Fields - Tenant: HTTP `x-tenant-id`, NATS `tenant-id` - Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id` - Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible) - Request id: HTTP `x-request-id` (optional for NATS) ### Standard Service Endpoints (every service) - `GET /health` liveness - `GET /ready` readiness (includes tenant gating if relevant) - `GET /metrics` Prometheus ## Milestone 0: Shared Transport Contract (Headers + Context + Trace) ### Goal Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract. ### Exit Criteria - `shared` contains canonical constants for header names and NATS header names. - `shared` contains canonical trace parsing/validation and trace derivation helpers. - Library-level unit tests cover parsing/derivation behavior. - All crates build and tests pass for the workspace. ### Tasks - [x] Add shared ID types in `shared`: - [x] `TenantId` - [x] `CorrelationId` - [x] `TraceId` - [~] Consolidate header constants in `shared`: - [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop) - [ ] HTTP: `x-tenant-id`, `x-request-id` (missing constants) - [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible) - [ ] NATS: `tenant-id` constant, `Nats-Msg-Id` constant (missing constants) - [x] Add shared helpers: - [x] derive `trace-id` from `traceparent` - [x] derive `traceparent` from `trace-id` when valid - [ ] normalize/generate correlation id when missing across all transports (helper exists for `CorrelationId::generate()`; adoption incomplete) - [x] Add unit tests in `shared` for: - [x] traceparent parsing validity - [x] serialization shape for correlation/trace id newtypes - [ ] additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement ### Required Tests - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` ## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes) ### Dependencies - Milestone 0 ### Goal Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts. ### Exit Criteria - Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only). - All NATS producers set required headers consistently. - All NATS consumers tolerate unknown fields and missing optional fields. - “Contract tests” exist per service to verify produced headers and subject formats. ### Tasks - [ ] Create/standardize subject builder helpers (prefer `shared`): - [ ] Aggregate event subject builder (`tenant..aggregate..`) - [ ] Runner effect/effect_result/workflow subject builders - [~] Aggregate publishes: - [ ] `tenant-id` header always present (still needs enforcement everywhere) - [ ] correlation + trace headers always present when available, generated when required - [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path) - [ ] `Nats-Msg-Id` strategy explicitly defined and tested - [~] Runner publishes (commands/results): - [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) - [x] trace headers derived consistently when possible (`traceparent` from `trace-id`, `trace-id` from `traceparent`) - [ ] outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete) - [~] Projection consumption: - [x] envelope decoding remains tolerant (unknown fields ignored) - [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified) - [ ] Add unit tests: - [ ] subject formatting tests per service (once builders exist) - [ ] required header presence tests per publisher (enforce required keys) ### Required Tests - Workspace verification commands ## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup) ### Dependencies - Milestone 1 ### Goal Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes. ### Exit Criteria - Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window). - Services create streams if missing, and validate compatibility on startup. - Startup does not silently replace or destructively mutate existing streams. - Config-only tests validate stream config builders without requiring NATS. ### Tasks - [ ] Define stream policies: - [ ] `AGGREGATE_EVENTS` (subjects, retention, duplicate window) - [ ] `WORKFLOW_COMMANDS` - [ ] `WORKFLOW_EVENTS` - [ ] Implement compatibility validation rules: - [ ] required subjects are present (superset allowed) - [ ] retention/limits are within allowed ranges - [ ] dedupe assumptions align with producer `Nats-Msg-Id` usage - [ ] Add unit tests for stream config builders + validators. ### Required Tests - Workspace verification commands ## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load) ### Dependencies - Milestone 2 ### Goal Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling. ### Exit Criteria - All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`). - Application-level concurrency is bounded and aligned with `max_in_flight`. - Poison policy is consistent across consumers (term + durable quarantine/deadletter record). - Gated NATS integration tests prove: - redelivery idempotency - poison termination - scale-out behavior (deliver group) where applicable ### Tasks - [ ] Standardize consumer defaults: - [ ] `AckPolicy::Explicit` - [ ] `ack_wait` default + env override - [ ] `max_deliver` default + env override - [ ] `max_ack_pending` tied to worker concurrency - [ ] Projection: - [ ] durable naming collision-free for Single/PerView modes - [ ] checkpoint gate semantics: “skip still acks” - [ ] poison handling persists durable records and terminates reliably - [ ] Runner: - [ ] durable naming collision-free and stable across replicas - [ ] deliver group rules defined and tested - [ ] outbox relay exactly-once behavior verified under redelivery - [ ] Aggregate: - [ ] ad-hoc fetch consumer always unique and bounded - [ ] best-effort deletion never targets unrelated consumers - [ ] Add gated NATS integration tests and document env flags: - [ ] Runner ignored tests - [ ] Projection ignored tests ### Required Tests - Workspace verification commands - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored` - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored` ## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService) ### Dependencies - Milestone 0 (context contract) ### Goal Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use. ### Exit Criteria - Projection exposes `projection.gateway.v1.QueryService`. - Gateway routes queries via gRPC by default. - Authz remains enforced in Gateway (deny-by-default). - Query responses remain stable for Control UI expectations. - New gRPC query tests pass (unit + integration). ### Tasks - [ ] Define protobuf API: `projection.gateway.v1.QueryService` - [ ] Implement Projection gRPC server for query execution - [ ] Implement Gateway gRPC client routing to Projection - [ ] deadlines - [ ] bounded retries (idempotent only) - [ ] context propagation - [ ] Preserve HTTP `/v1/query/*` as compatibility/debug: - [ ] route internally to gRPC or keep as legacy endpoint - [ ] Add tests: - [ ] authz + forwarding via gRPC - [ ] tenant isolation enforcement in Projection QueryService ### Required Tests - Workspace verification commands ## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin) ### Dependencies - Milestone 0 (context contract) ### Goal Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service. ### Exit Criteria - Runner exposes `runner.admin.v1.RunnerAdmin`. - Gateway calls Runner admin via gRPC (authz enforced in Gateway). - Tenant-spoof and unauthorized calls are rejected deterministically. - Runner drain/readiness semantics validated and tested. ### Tasks - [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin` - [ ] Implement Runner gRPC admin server - [ ] Implement Gateway gRPC client integration for admin operations - [ ] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway - [ ] Add tests: - [ ] Gateway: rejects without rights - [ ] Gateway: rejects tenant spoof attempts - [ ] Runner: idempotency and drain semantics ### Required Tests - Workspace verification commands ## Milestone 6: Gateway Upstream Performance + Operational Guardrails ### Dependencies - Milestones 4–5 (gRPC internal RPC surfaces available) ### Goal Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load. ### Exit Criteria - Bounded upstream gRPC channel pool exists (LRU + TTL/eviction). - Deadlines everywhere; retries only for idempotent operations. - Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context. - Gated load/soak tests exist and are runnable. ### Tasks - [ ] Implement upstream channel pool - [ ] bounded LRU - [ ] TTL/eviction - [ ] fast-path reuse under load - [ ] Standardize retry profiles - [ ] read-only: limited retry with jitter - [ ] mutations: no retry unless idempotency key is present and semantics are safe - [ ] Standardize timeouts/deadlines: - [ ] edge timeout limits - [ ] internal per-service deadlines - [ ] Fanout controls: - [ ] concurrency limiters for probes/snapshots - [ ] short TTL caching where safe - [ ] Ensure probes carry context (correlation/trace) for observability. ### Required Tests - Workspace verification commands - Gated load/soak tests (document env + how to run) ## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths) ### Dependencies - Milestone 6 ### Goal Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only. ### Exit Criteria - Gateway no longer depends on HTTP for Projection queries or Runner admin. - Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control. - End-to-end smoke tests pass (gated). ### Tasks - [ ] Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim) - [ ] Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim) - [ ] Ensure Control UI + Control API rely only on standardized surfaces - [ ] Harden metrics and readiness probes to match the standard contract everywhere ### Required Tests - Workspace verification commands - End-to-end smoke tests (gated) ## Workspace Verification Commands (Run for Every Milestone) - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` - `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)