15 KiB
Transport Development Plan
Purpose
Unify and optimize the platform transport layer end-to-end:
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
This plan merges and supersedes:
GATEWAY_TRANSPORT_PLAN.mdNATS_TRANSPORT_PLAN.md
Current Status (Codebase Reality)
- Monorepo workspace exists;
sharedcrate exists and is used by Aggregate/Projection/Runner/Gateway. - Request context pieces are partially standardized:
sharedprovidesTenantId,CorrelationId,TraceIdsharedprovidestrace_id_from_traceparent(...)andtraceparent_from_trace_id(...)- Some header names are centralized in
sharedbut not all call sites use constants yet.
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates
x-tenant-id,x-correlation-id, andtraceparent. - Gateway → Projection remains HTTP proxy (
/v1/query/...) and Gateway → Runner remains HTTP admin proxy (/admin/runner/...). - Node → NATS header propagation is improved and closer to consistent:
- Runner publishes
x-correlation-idandcorrelation-id, and ensurestraceparent/trace-idare present/derived when possible. - Aggregate publishes
trace-idwhentraceparentis present.
- Runner publishes
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
Principles
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
Non-Negotiable Rules (Global)
- Every cross-component hop MUST carry tenant + correlation + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
- Every milestone is stop-the-line gated:
- All tasks completed
- All tests required by the milestone pass
- Workspace verification commands pass
- Gated integration tests for the milestone are runnable and documented
Baseline (Today)
- Gateway → Aggregate: gRPC (command submission)
- Gateway → Projection: HTTP (query proxy)
- Gateway → Runner: HTTP (admin proxy)
- Node ↔ NATS JetStream:
AGGREGATE_EVENTS,WORKFLOW_COMMANDS,WORKFLOW_EVENTS
End State (Target Architecture)
- Edge contract (clients ↔ Gateway): HTTP/JSON
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
sharedis the single source of truth for:- header names and injection/extraction rules
- trace parsing/validation (
traceparent,trace-id) - context object model (tenant/correlation/trace/request ids)
- NATS subject + consumer naming helpers
Standard Contracts
Context Fields
- Tenant: HTTP
x-tenant-id, NATStenant-id - Correlation: HTTP
x-correlation-id, NATSx-correlation-idandcorrelation-id - Trace: HTTP
traceparent, NATStraceparentandtrace-id(derived when possible) - Request id: HTTP
x-request-id(optional for NATS)
Standard Service Endpoints (every service)
GET /healthlivenessGET /readyreadiness (includes tenant gating if relevant)GET /metricsPrometheus
Milestone 0: Shared Transport Contract (Headers + Context + Trace)
Goal
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
Exit Criteria
sharedcontains canonical constants for header names and NATS header names.sharedcontains canonical trace parsing/validation and trace derivation helpers.- Library-level unit tests cover parsing/derivation behavior.
- All crates build and tests pass for the workspace.
Tasks
- Add shared ID types in
shared:TenantIdCorrelationIdTraceId
- [~] Consolidate header constants in
shared:- HTTP:
x-correlation-id,traceparent,trace-id(for NATS/interop) - HTTP:
x-tenant-id,x-request-id(missing constants) - NATS:
correlation-id(used in Runner),trace-id(now emitted where possible) - NATS:
tenant-idconstant,Nats-Msg-Idconstant (missing constants)
- HTTP:
- Add shared helpers:
- derive
trace-idfromtraceparent - derive
traceparentfromtrace-idwhen valid - normalize/generate correlation id when missing across all transports (helper exists for
CorrelationId::generate(); adoption incomplete)
- derive
- Add unit tests in
sharedfor:- traceparent parsing validity
- serialization shape for correlation/trace id newtypes
- additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement
Required Tests
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspace
Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
Dependencies
- Milestone 0
Goal
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
Exit Criteria
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
- All NATS producers set required headers consistently.
- All NATS consumers tolerate unknown fields and missing optional fields.
- “Contract tests” exist per service to verify produced headers and subject formats.
Tasks
- Create/standardize subject builder helpers (prefer
shared):- Aggregate event subject builder (
tenant.<tenant>.aggregate.<type>.<id>) - Runner effect/effect_result/workflow subject builders
- Aggregate event subject builder (
- [~] Aggregate publishes:
tenant-idheader always present (still needs enforcement everywhere)- correlation + trace headers always present when available, generated when required
trace-idis derived whentraceparentis present (now emitted in publish path)Nats-Msg-Idstrategy explicitly defined and tested
- [~] Runner publishes (commands/results):
- correlation headers emitted consistently (
x-correlation-id+correlation-id) - trace headers derived consistently when possible (
traceparentfromtrace-id,trace-idfromtraceparent) - outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
- correlation headers emitted consistently (
- [~] Projection consumption:
- envelope decoding remains tolerant (unknown fields ignored)
- [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
- Add unit tests:
- subject formatting tests per service (once builders exist)
- required header presence tests per publisher (enforce required keys)
Required Tests
- Workspace verification commands
Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
Dependencies
- Milestone 1
Goal
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
Exit Criteria
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
- Services create streams if missing, and validate compatibility on startup.
- Startup does not silently replace or destructively mutate existing streams.
- Config-only tests validate stream config builders without requiring NATS.
Tasks
- Define stream policies:
AGGREGATE_EVENTS(subjects, retention, duplicate window)WORKFLOW_COMMANDSWORKFLOW_EVENTS
- Implement compatibility validation rules:
- required subjects are present (superset allowed)
- retention/limits are within allowed ranges
- dedupe assumptions align with producer
Nats-Msg-Idusage
- Add unit tests for stream config builders + validators.
Required Tests
- Workspace verification commands
Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
Dependencies
- Milestone 2
Goal
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
Exit Criteria
- All long-lived consumers use explicit ack with standardized defaults (
ack_wait,max_deliver,max_ack_pending). - Application-level concurrency is bounded and aligned with
max_in_flight. - Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
- Gated NATS integration tests prove:
- redelivery idempotency
- poison termination
- scale-out behavior (deliver group) where applicable
Tasks
- Standardize consumer defaults:
AckPolicy::Explicitack_waitdefault + env overridemax_deliverdefault + env overridemax_ack_pendingtied to worker concurrency
- Projection:
- durable naming collision-free for Single/PerView modes
- checkpoint gate semantics: “skip still acks”
- poison handling persists durable records and terminates reliably
- Runner:
- durable naming collision-free and stable across replicas
- deliver group rules defined and tested
- outbox relay exactly-once behavior verified under redelivery
- Aggregate:
- ad-hoc fetch consumer always unique and bounded
- best-effort deletion never targets unrelated consumers
- Add gated NATS integration tests and document env flags:
- Runner ignored tests
- Projection ignored tests
Required Tests
- Workspace verification commands
- Runner:
RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored - Projection:
PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored
Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
Dependencies
- Milestone 0 (context contract)
Goal
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
Exit Criteria
- Projection exposes
projection.gateway.v1.QueryService. - Gateway routes queries via gRPC by default.
- Authz remains enforced in Gateway (deny-by-default).
- Query responses remain stable for Control UI expectations.
- New gRPC query tests pass (unit + integration).
Tasks
- Define protobuf API:
projection.gateway.v1.QueryService - Implement Projection gRPC server for query execution
- Implement Gateway gRPC client routing to Projection
- deadlines
- bounded retries (idempotent only)
- context propagation
- Preserve HTTP
/v1/query/*as compatibility/debug:- route internally to gRPC or keep as legacy endpoint
- Add tests:
- authz + forwarding via gRPC
- tenant isolation enforcement in Projection QueryService
Required Tests
- Workspace verification commands
Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
Dependencies
- Milestone 0 (context contract)
Goal
Replace Gateway’s /admin/runner/* HTTP proxy usage with a first-class gRPC admin service.
Exit Criteria
- Runner exposes
runner.admin.v1.RunnerAdmin. - Gateway calls Runner admin via gRPC (authz enforced in Gateway).
- Tenant-spoof and unauthorized calls are rejected deterministically.
- Runner drain/readiness semantics validated and tested.
Tasks
- Define protobuf API:
runner.admin.v1.RunnerAdmin - Implement Runner gRPC admin server
- Implement Gateway gRPC client integration for admin operations
- Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- Add tests:
- Gateway: rejects without rights
- Gateway: rejects tenant spoof attempts
- Runner: idempotency and drain semantics
Required Tests
- Workspace verification commands
Milestone 6: Gateway Upstream Performance + Operational Guardrails
Dependencies
- Milestones 4–5 (gRPC internal RPC surfaces available)
Goal
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
Exit Criteria
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
- Deadlines everywhere; retries only for idempotent operations.
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
- Gated load/soak tests exist and are runnable.
Tasks
- Implement upstream channel pool
- bounded LRU
- TTL/eviction
- fast-path reuse under load
- Standardize retry profiles
- read-only: limited retry with jitter
- mutations: no retry unless idempotency key is present and semantics are safe
- Standardize timeouts/deadlines:
- edge timeout limits
- internal per-service deadlines
- Fanout controls:
- concurrency limiters for probes/snapshots
- short TTL caching where safe
- Ensure probes carry context (correlation/trace) for observability.
Required Tests
- Workspace verification commands
- Gated load/soak tests (document env + how to run)
Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
Dependencies
- Milestone 6
Goal
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
- End-to-end smoke tests pass (gated).
Tasks
- Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
- Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
- Ensure Control UI + Control API rely only on standardized surfaces
- Harden metrics and readiness probes to match the standard contract everywhere
Required Tests
- Workspace verification commands
- End-to-end smoke tests (gated)
Workspace Verification Commands (Run for Every Milestone)
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspacenpm ci && npm run lint && npm run typecheck && npm run test && npm run build(incontrol/ui)