# NATS Transport Plan ## Purpose Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles: - Simplicity (few primitives, consistent naming, minimal per-service divergence) - Ease of operation (predictable streams/consumers, clear runbooks, easy debugging) - Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage) - Low resource usage (stable durable consumers, controlled ack waits, limited fanout) - High performance (high throughput, low tail latency, reliable backpressure) - Safety (tenant isolation, idempotency, deterministic replay, poison handling) ## Non-Negotiable Rules (Global) - Every JetStream stream/consumer MUST have an explicit contract: - name, subjects, retention, storage, replication, max sizes - ack policy, ack wait, max deliver, max in flight - Every node MUST run with bounded work: - bounded pull batch sizes - bounded concurrency - bounded retry/backoff - Every message MUST be tenant-scoped in subject and/or headers. - Every milestone below is “stop-the-line” gated: - all tasks completed - all tests passing - workspace lint/format checks passing - required NATS-gated integration tests for the milestone passing (when gated by env) ## Current State (Baseline) - Streams: - `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume) - `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner) - Subject conventions: - Aggregate events: `tenant..aggregate..` - Defaults often use filters like `tenant.*.aggregate.*.*` - Durable consumers: - Projection uses a durable name (configurable) - Runner uses configurable durable prefix per role - Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch - Headers: - Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist ## Target Architecture (End State) - A single “NATS wire protocol” contract shared across services: - subject naming - required headers (tenant/correlation/trace) - message envelope compatibility rules (tolerant decoding, optional fields) - Stable, minimal set of JetStream streams: - one stream per message class (aggregate events, workflow commands, workflow events) - no per-tenant streams unless there is a strong operational reason - Stable, limited consumers: - durable consumers for long-lived processors (Projection, Runner) - ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion - Uniform backpressure + reliability defaults: - explicit ack - bounded `max_ack_pending` and application-level concurrency - bounded redelivery via `max_deliver` + poison policy ## Definitions ### Message Context (Headers) Standard headers for NATS published messages: - `tenant-id` (required) - `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing) - `traceparent` (optional but recommended; generated/propagated if present upstream) - `trace-id` (optional; derived from traceparent when possible) - `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable) ### Subject Naming Rules - Tenant-first prefix: `tenant..…` - Stable message class token: - `aggregate` for domain events - `effect`, `effect_result`, `workflow`, `workflow_event` for Runner - No ambiguous wildcard publishing: - producers publish concrete subjects only - consumers may filter with wildcards ### Consumer Naming Rules - Durable consumer names must be stable and collision-free: - include role + mode + optional view/saga name + shard/group - Ephemeral consumer names must be unique per operation: - include tenant + purpose + uuid - must be deleted best-effort when operation completes ## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes) ### Goal Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts. ### Exit Criteria - `shared` exposes NATS header constants and helpers for inject/extract/derive. - All producers set required headers consistently. - All consumers tolerate unknown fields and missing optional fields. - A single, documented subject naming convention is enforced in code (builder functions). - Workspace fmt/clippy/tests pass. ### Tasks - [ ] Centralize NATS header constants and helpers in `shared`: - [ ] inject headers for publish (tenant, correlation, trace) - [ ] extract headers on receive (best-effort) - [ ] derive `trace-id` from `traceparent` - [ ] Aggregate: - [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers - [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test) - [ ] Projection: - [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported) - [ ] Ensure correlation/trace context is carried into spans/metrics consistently - [ ] Runner: - [ ] Ensure publish paths include correlation/trace headers consistently for commands and results - [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested - [ ] Tests: - [ ] Unit tests for header injection/extraction in `shared` - [ ] Per-service unit tests asserting produced headers include required keys ### Required Tests - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` ## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage) ### Goal Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage. ### Exit Criteria - Stream config for each stream is explicitly defined and validated at startup. - Limits (max messages/bytes/age) are explicit and have defaults. - Duplicate windows and dedupe behavior are explicit and tested. - A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace). ### Tasks - [ ] Define a single “stream config policy” module per service (or shared helper): - [ ] `AGGREGATE_EVENTS` subjects + retention policy - [ ] `WORKFLOW_COMMANDS` subjects + retention policy - [ ] `WORKFLOW_EVENTS` subjects + retention policy - [ ] Standardize defaults: - [ ] retention: limits appropriate for replay + rebuild - [ ] `duplicate_window` aligned with producer idempotency strategy - [ ] storage type and replication policy documented and configurable - [ ] Add startup validations: - [ ] verify stream exists and matches required subject set (compatible superset allowed) - [ ] verify required ack/dedupe assumptions hold - [ ] Add tests that parse and validate configs without NATS. ### Required Tests - Unit tests for stream config builders - Existing crate tests ## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison) ### Goal Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling. ### Exit Criteria - All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`. - Application concurrency is bounded and tied to `max_in_flight`. - Poison policy is consistent: - after `max_deliver`, term + deadletter/quarantine record is written - Replay behavior is deterministic on restart (checkpoint-based where applicable). ### Tasks - [ ] Define standard consumer config defaults: - [ ] `AckPolicy::Explicit` - [ ] `ack_wait` default + env override - [ ] `max_deliver` default + env override - [ ] `max_ack_pending` tied to application concurrency - [ ] Projection: - [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView) - [ ] Ensure checkpoint gates ack correctly (skip still acks) - [ ] Ensure poison policy writes durable records and terminates reliably - [ ] Runner: - [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out - [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish - [ ] Aggregate: - [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required) - [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers - [ ] Tests: - [ ] Unit tests for consumer name generation (sanitization + uniqueness) - [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag) ### Required Tests - Workspace fmt/clippy/tests - NATS-gated integration tests for: - redelivery idempotency - poison termination behavior - scale-out with deliver group (where supported) ## Milestone 3: Connection Management + Failure Semantics (Operational Frugality) ### Goal Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages. ### Exit Criteria - One NATS connection per process (or bounded pool only if justified). - Reconnect/backoff policy is explicit and consistent. - Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly. - No busy-looping on NATS outages. ### Tasks - [ ] Standardize connection options: - [ ] reconnect delays/backoff - [ ] max reconnect attempts or “infinite with backoff” strategy (explicit) - [ ] request timeouts around JetStream operations - [ ] Standardize readiness semantics: - [ ] `ready=false` when NATS is unavailable and the node depends on it - [ ] `health` stays “process alive” but reports NATS connectivity in payload - [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set). - [ ] Tests: - [ ] unit tests for backoff behavior (where possible) - [ ] gated integration test: temporary NATS outage does not crash-loop and recovers ## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable) ### Goal Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage. ### Exit Criteria - Durable names are deterministic and collision-free across replicas. - Deliver groups are used where appropriate to share work across replicas. - Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking). - A scale-out test suite exists and is gated but runnable. ### Tasks - [ ] Establish consumer naming scheme per service role: - [ ] Projection: per-view durable option uses sanitized names and stable mapping - [ ] Runner: durable prefix includes role + shard + optional group - [ ] Establish deliver group usage rules: - [ ] when to enable (scale-out consumers) - [ ] how to roll without duplication - [ ] Strengthen dedupe keys: - [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery - [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id` - [ ] Add gated tests: - [ ] two replicas, same tenant, no duplicate publishes - [ ] rolling restart preserves checkpoint correctness ## Verification Commands (Required at Each Milestone) - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` - Gated NATS integration tests: - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored` - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored` - Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests ## Notes / Constraints - Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups. - Prefer backward-compatible envelope changes (optional fields, tolerant decoding). - Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.