12 KiB
NATS Transport Plan
Purpose
Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:
- Simplicity (few primitives, consistent naming, minimal per-service divergence)
- Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
- Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
- Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
- High performance (high throughput, low tail latency, reliable backpressure)
- Safety (tenant isolation, idempotency, deterministic replay, poison handling)
Non-Negotiable Rules (Global)
- Every JetStream stream/consumer MUST have an explicit contract:
- name, subjects, retention, storage, replication, max sizes
- ack policy, ack wait, max deliver, max in flight
- Every node MUST run with bounded work:
- bounded pull batch sizes
- bounded concurrency
- bounded retry/backoff
- Every message MUST be tenant-scoped in subject and/or headers.
- Every milestone below is “stop-the-line” gated:
- all tasks completed
- all tests passing
- workspace lint/format checks passing
- required NATS-gated integration tests for the milestone passing (when gated by env)
Current State (Baseline)
- Streams:
AGGREGATE_EVENTS(Aggregate publishes, Projection/Runner consume)WORKFLOW_COMMANDS,WORKFLOW_EVENTS(Runner)
- Subject conventions:
- Aggregate events:
tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id> - Defaults often use filters like
tenant.*.aggregate.*.*
- Aggregate events:
- Durable consumers:
- Projection uses a durable name (configurable)
- Runner uses configurable durable prefix per role
- Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
- Headers:
- Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist
Target Architecture (End State)
- A single “NATS wire protocol” contract shared across services:
- subject naming
- required headers (tenant/correlation/trace)
- message envelope compatibility rules (tolerant decoding, optional fields)
- Stable, minimal set of JetStream streams:
- one stream per message class (aggregate events, workflow commands, workflow events)
- no per-tenant streams unless there is a strong operational reason
- Stable, limited consumers:
- durable consumers for long-lived processors (Projection, Runner)
- ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
- Uniform backpressure + reliability defaults:
- explicit ack
- bounded
max_ack_pendingand application-level concurrency - bounded redelivery via
max_deliver+ poison policy
Definitions
Message Context (Headers)
Standard headers for NATS published messages:
tenant-id(required)x-correlation-idandcorrelation-id(required for any request-derived message; generated if missing)traceparent(optional but recommended; generated/propagated if present upstream)trace-id(optional; derived from traceparent when possible)Nats-Msg-Id(required for idempotent publish/dedupe when applicable)
Subject Naming Rules
- Tenant-first prefix:
tenant.<tenant_id>.… - Stable message class token:
aggregatefor domain eventseffect,effect_result,workflow,workflow_eventfor Runner
- No ambiguous wildcard publishing:
- producers publish concrete subjects only
- consumers may filter with wildcards
Consumer Naming Rules
- Durable consumer names must be stable and collision-free:
- include role + mode + optional view/saga name + shard/group
- Ephemeral consumer names must be unique per operation:
- include tenant + purpose + uuid
- must be deleted best-effort when operation completes
Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)
Goal
Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.
Exit Criteria
sharedexposes NATS header constants and helpers for inject/extract/derive.- All producers set required headers consistently.
- All consumers tolerate unknown fields and missing optional fields.
- A single, documented subject naming convention is enforced in code (builder functions).
- Workspace fmt/clippy/tests pass.
Tasks
- Centralize NATS header constants and helpers in
shared:- inject headers for publish (tenant, correlation, trace)
- extract headers on receive (best-effort)
- derive
trace-idfromtraceparent
- Aggregate:
- Ensure event publishing always sets
tenant-id, correlation headers, trace headers - Ensure
Nats-Msg-Idstrategy is correct for idempotency/dedupe (document and test)
- Ensure event publishing always sets
- Projection:
- Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
- Ensure correlation/trace context is carried into spans/metrics consistently
- Runner:
- Ensure publish paths include correlation/trace headers consistently for commands and results
- Ensure outbox metadata → NATS headers mapping is consistent and tested
- Tests:
- Unit tests for header injection/extraction in
shared - Per-service unit tests asserting produced headers include required keys
- Unit tests for header injection/extraction in
Required Tests
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspace
Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)
Goal
Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.
Exit Criteria
- Stream config for each stream is explicitly defined and validated at startup.
- Limits (max messages/bytes/age) are explicit and have defaults.
- Duplicate windows and dedupe behavior are explicit and tested.
- A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).
Tasks
- Define a single “stream config policy” module per service (or shared helper):
AGGREGATE_EVENTSsubjects + retention policyWORKFLOW_COMMANDSsubjects + retention policyWORKFLOW_EVENTSsubjects + retention policy
- Standardize defaults:
- retention: limits appropriate for replay + rebuild
duplicate_windowaligned with producer idempotency strategy- storage type and replication policy documented and configurable
- Add startup validations:
- verify stream exists and matches required subject set (compatible superset allowed)
- verify required ack/dedupe assumptions hold
- Add tests that parse and validate configs without NATS.
Required Tests
- Unit tests for stream config builders
- Existing crate tests
Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)
Goal
Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.
Exit Criteria
- All long-lived consumers use explicit ack with consistent
ack_wait,max_deliver,max_ack_pending. - Application concurrency is bounded and tied to
max_in_flight. - Poison policy is consistent:
- after
max_deliver, term + deadletter/quarantine record is written
- after
- Replay behavior is deterministic on restart (checkpoint-based where applicable).
Tasks
- Define standard consumer config defaults:
AckPolicy::Explicitack_waitdefault + env overridemax_deliverdefault + env overridemax_ack_pendingtied to application concurrency
- Projection:
- Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
- Ensure checkpoint gates ack correctly (skip still acks)
- Ensure poison policy writes durable records and terminates reliably
- Runner:
- Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
- Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
- Aggregate:
- Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
- Ensure best-effort cleanup is performed and cannot delete unrelated consumers
- Tests:
- Unit tests for consumer name generation (sanitization + uniqueness)
- NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)
Required Tests
- Workspace fmt/clippy/tests
- NATS-gated integration tests for:
- redelivery idempotency
- poison termination behavior
- scale-out with deliver group (where supported)
Milestone 3: Connection Management + Failure Semantics (Operational Frugality)
Goal
Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.
Exit Criteria
- One NATS connection per process (or bounded pool only if justified).
- Reconnect/backoff policy is explicit and consistent.
- Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
- No busy-looping on NATS outages.
Tasks
- Standardize connection options:
- reconnect delays/backoff
- max reconnect attempts or “infinite with backoff” strategy (explicit)
- request timeouts around JetStream operations
- Standardize readiness semantics:
ready=falsewhen NATS is unavailable and the node depends on ithealthstays “process alive” but reports NATS connectivity in payload
- Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
- Tests:
- unit tests for backoff behavior (where possible)
- gated integration test: temporary NATS outage does not crash-loop and recovers
Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)
Goal
Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.
Exit Criteria
- Durable names are deterministic and collision-free across replicas.
- Deliver groups are used where appropriate to share work across replicas.
- Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
- A scale-out test suite exists and is gated but runnable.
Tasks
- Establish consumer naming scheme per service role:
- Projection: per-view durable option uses sanitized names and stable mapping
- Runner: durable prefix includes role + shard + optional group
- Establish deliver group usage rules:
- when to enable (scale-out consumers)
- how to roll without duplication
- Strengthen dedupe keys:
- event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
- outbox relay: verify publish idempotency with
Nats-Msg-Id
- Add gated tests:
- two replicas, same tenant, no duplicate publishes
- rolling restart preserves checkpoint correctness
Verification Commands (Required at Each Milestone)
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspace- Gated NATS integration tests:
- Runner:
RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored - Projection:
PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored - Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests
- Runner:
Notes / Constraints
- Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
- Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
- Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.