shared: add stream+consumer policy helpers; NATS context header builder aggregate/runner/projection: centralize stream validation and header usage; set bounded consumer params projection: add QueryService gRPC and wire into main; settings include PROJECTION_GRPC_ADDR gateway: gRPC routing to Projection/Runner with deadlines; bounded read-only retries; pooled gRPC channels (bounded LRU+TTL); admin proxy forwards to gRPC; probes use concurrency limiter + TTL cache runner: add RunnerAdmin gRPC server (drain, status, reload) and wire into main; settings include RUNNER_GRPC_ADDR tests: add gateway authz for runner admin, projection tenant isolation, runner admin drain semantics docs: update TRANSPORT_DEVELOPMENT_PLAN to reflect completed milestones and details
16 KiB
Transport Development Plan
Purpose
Unify and optimize the platform transport layer end-to-end:
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
This plan merges and supersedes:
GATEWAY_TRANSPORT_PLAN.mdNATS_TRANSPORT_PLAN.md
Current Status (Codebase Reality)
- Monorepo workspace exists;
sharedcrate exists and is used by Aggregate/Projection/Runner/Gateway. - Request context pieces are standardized:
sharedprovidesTenantId,CorrelationId,TraceIdsharedprovidestrace_id_from_traceparent(...)andtraceparent_from_trace_id(...)sharedprovides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers- Most call sites now use
sharedconstants/helpers; remaining gaps should be treated as Milestone-gated
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates
x-tenant-id,x-correlation-id, andtraceparent. - Gateway → Projection remains HTTP proxy (
/v1/query/...) and Gateway → Runner remains HTTP admin proxy (/admin/runner/...). - Node → NATS header propagation is improved and closer to consistent:
- Runner publishes required headers for effect commands/results (
tenant-id,Nats-Msg-Id, correlation, traceparent/trace-id), generating when missing. - Aggregate publishes required headers for events (
tenant-id,Nats-Msg-Id, correlation, traceparent/trace-id), generating when missing. - Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
- Runner publishes required headers for effect commands/results (
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
Principles
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
Non-Negotiable Rules (Global)
- Every cross-component hop MUST carry tenant + correlation + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
- Every milestone is stop-the-line gated:
- All tasks completed
- All tests required by the milestone pass
- Workspace verification commands pass
- Gated integration tests for the milestone are runnable and documented
Baseline (Today)
- Gateway → Aggregate: gRPC (command submission)
- Gateway → Projection: HTTP (query proxy)
- Gateway → Runner: HTTP (admin proxy)
- Node ↔ NATS JetStream:
AGGREGATE_EVENTS,WORKFLOW_COMMANDS,WORKFLOW_EVENTS
End State (Target Architecture)
- Edge contract (clients ↔ Gateway): HTTP/JSON
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
sharedis the single source of truth for:- header names and injection/extraction rules
- trace parsing/validation (
traceparent,trace-id) - context object model (tenant/correlation/trace/request ids)
- NATS subject + consumer naming helpers
Standard Contracts
Context Fields
- Tenant: HTTP
x-tenant-id, NATStenant-id - Correlation: HTTP
x-correlation-id, NATSx-correlation-idandcorrelation-id - Trace: HTTP
traceparent, NATStraceparentandtrace-id(derived when possible) - Request id: HTTP
x-request-id(optional for NATS)
Standard Service Endpoints (every service)
GET /healthlivenessGET /readyreadiness (includes tenant gating if relevant)GET /metricsPrometheus
Milestone 0: Shared Transport Contract (Headers + Context + Trace)
Goal
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
Exit Criteria
sharedcontains canonical constants for header names and NATS header names.sharedcontains canonical trace parsing/validation and trace derivation helpers.- Library-level unit tests cover parsing/derivation behavior.
- All crates build and tests pass for the workspace.
Tasks
- Add shared ID types in
shared:TenantIdCorrelationIdTraceId
- Consolidate header constants in
shared:- HTTP:
x-correlation-id,traceparent,trace-id(for NATS/interop) - HTTP:
x-tenant-id,x-request-id - NATS:
correlation-id(used in Runner),trace-id(now emitted where possible) - NATS:
tenant-id,Nats-Msg-Id
- HTTP:
- Add shared helpers:
- derive
trace-idfromtraceparent - derive
traceparentfromtrace-idwhen valid - normalize/generate correlation id when missing (
normalize_correlation_id(...)) - normalize/generate traceparent when missing/invalid (
normalize_traceparent(...))
- derive
- Add unit tests in
sharedfor:- traceparent parsing validity
- serialization shape for correlation/trace id newtypes
- additional validation cases (invalid traceparents, all-zero ids)
Required Tests
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspace
Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
Dependencies
- Milestone 0
Goal
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
Exit Criteria
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
- All NATS producers set required headers consistently.
- All NATS consumers tolerate unknown fields and missing optional fields.
- “Contract tests” exist per service to verify produced headers and subject formats.
Tasks
- Create/standardize subject builder helpers (prefer
shared):- Aggregate event subject builder (
tenant.<tenant>.aggregate.<type>.<id>) - Runner effect/effect_result subject builders
- Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
- Aggregate event subject builder (
- Aggregate publishes:
tenant-idheader always present- correlation + trace headers always present; generated when missing/invalid
trace-idis derived whentraceparentis present (now emitted in publish path)Nats-Msg-Idstrategy explicitly defined and tested (Aggregate events useevent_id)
- Runner publishes (commands/results):
- correlation headers emitted consistently (
x-correlation-id+correlation-id) and generated when missing - trace headers always present/derived when possible; generated when missing/invalid
Nats-Msg-Idstrategy explicitly defined and tested (Runner commands/results usecommand_id)- outbox metadata → NATS headers mapping standardized via shared helpers
- correlation headers emitted consistently (
- Projection consumption:
- envelope decoding remains tolerant (unknown fields ignored)
- correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
- Add unit tests:
- subject formatting tests (shared builders)
- required header presence tests per publisher (Aggregate + Runner)
Required Tests
- Workspace verification commands
Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
Dependencies
- Milestone 1
Goal
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
Exit Criteria
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
- Services create streams if missing, and validate compatibility on startup.
- Startup does not silently replace or destructively mutate existing streams.
- Config-only tests validate stream config builders without requiring NATS.
Tasks
- Define stream policies:
AGGREGATE_EVENTS(subjects, limits, duplicate window) is defined and validated on startupWORKFLOW_COMMANDSis defined and validated on startupWORKFLOW_EVENTSis defined and validated on startup- Centralize stream policy builders/validators in
shared
- Implement compatibility validation rules:
- required subjects are present (superset allowed)
- limits/max_age/duplicate window validated against minimums
- dedupe assumptions align with producer
Nats-Msg-Idusage (duplicate window + msg-id strategy)
- Add unit tests for stream config builders + validators.
Required Tests
- Workspace verification commands
Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
Dependencies
- Milestone 2
Goal
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
Exit Criteria
- All long-lived consumers use explicit ack with standardized defaults (
ack_wait,max_deliver,max_ack_pending). - Application-level concurrency is bounded and aligned with
max_in_flight. - Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
- Gated NATS integration tests prove:
- redelivery idempotency
- poison termination
- scale-out behavior (deliver group) where applicable
Tasks
- Standardize consumer defaults:
AckPolicy::Explicitack_waitdefault + env override (Runner/Projection:*_ACK_TIMEOUT_MS)max_deliverdefault + env override (Runner/Projection:*_MAX_DELIVER)max_ack_pendingtied to worker concurrency (Runner/Projection:max_in_flight)
- Projection:
- durable naming collision-free for Single/PerView modes
- checkpoint gate semantics: “skip still acks”
- poison handling persists durable records and terminates reliably (poison record + term)
- Runner:
- durable naming collision-free and stable across replicas
- deliver group rules defined (pull consumers;
deliver_groupis rejected if configured) - outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
- Aggregate:
- ad-hoc fetch consumer always unique and bounded
- best-effort deletion never targets unrelated consumers
- Add gated NATS integration tests and document env flags:
- Runner ignored tests
- Projection ignored tests
Required Tests
- Workspace verification commands
- Runner:
RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored - Projection:
PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored
Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
Dependencies
- Milestone 0 (context contract)
Goal
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
Exit Criteria
- Projection exposes
projection.gateway.v1.QueryService. - Gateway routes queries via gRPC by default.
- Authz remains enforced in Gateway (deny-by-default).
- Query responses remain stable for Control UI expectations.
- New gRPC query tests pass (unit + integration).
Tasks
- Define protobuf API:
projection.gateway.v1.QueryService - Implement Projection gRPC server for query execution
- Implement Gateway gRPC client routing to Projection
- deadlines
- bounded retries (idempotent only)
- context propagation
- Preserve HTTP
/v1/query/*as compatibility/debug:- route internally to gRPC
- Add tests:
- authz + forwarding via gRPC
- tenant isolation enforcement in Projection QueryService
Required Tests
- Workspace verification commands
Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
Dependencies
- Milestone 0 (context contract)
Goal
Replace Gateway’s /admin/runner/* HTTP proxy usage with a first-class gRPC admin service.
Exit Criteria
- Runner exposes
runner.admin.v1.RunnerAdmin. - Gateway calls Runner admin via gRPC (authz enforced in Gateway).
- Tenant-spoof and unauthorized calls are rejected deterministically.
- Runner drain/readiness semantics validated and tested.
Tasks
- Define protobuf API:
runner.admin.v1.RunnerAdmin - Implement Runner gRPC admin server
- Implement Gateway gRPC client integration for admin operations
- Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- Add tests:
- Gateway: rejects without rights
- Gateway: rejects tenant spoof attempts
- Runner: idempotency and drain semantics
Required Tests
- Workspace verification commands
Milestone 6: Gateway Upstream Performance + Operational Guardrails
Dependencies
- Milestones 4–5 (gRPC internal RPC surfaces available)
Goal
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
Exit Criteria
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
- Deadlines everywhere; retries only for idempotent operations.
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
- Gated load/soak tests exist and are runnable.
Tasks
- Implement upstream channel pool
- bounded LRU
- TTL/eviction
- fast-path reuse under load (cached gRPC channels)
- Standardize retry profiles
- read-only: limited retry with jitter (Gateway gRPC calls)
- mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
- Standardize timeouts/deadlines:
- edge timeout limits
- internal per-service deadlines
- Fanout controls:
- concurrency limiters for probes/snapshots
- short TTL caching where safe
- Ensure probes carry context (correlation/trace) for observability.
Required Tests
- Workspace verification commands
- Gated load/soak tests (document env + how to run)
Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
Dependencies
- Milestone 6
Goal
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
- End-to-end smoke tests pass (gated).
Tasks
- Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
- Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
- Ensure Control UI + Control API rely only on standardized surfaces
- Harden metrics and readiness probes to match the standard contract everywhere
Required Tests
- Workspace verification commands
- End-to-end smoke tests (gated)
Workspace Verification Commands (Run for Every Milestone)
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspacenpm ci && npm run lint && npm run typecheck && npm run test && npm run build(incontrol/ui)