Files
cloudlysis/NATS_TRANSPORT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

12 KiB

NATS Transport Plan

Purpose

Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:

  • Simplicity (few primitives, consistent naming, minimal per-service divergence)
  • Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
  • Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
  • Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
  • High performance (high throughput, low tail latency, reliable backpressure)
  • Safety (tenant isolation, idempotency, deterministic replay, poison handling)

Non-Negotiable Rules (Global)

  • Every JetStream stream/consumer MUST have an explicit contract:
    • name, subjects, retention, storage, replication, max sizes
    • ack policy, ack wait, max deliver, max in flight
  • Every node MUST run with bounded work:
    • bounded pull batch sizes
    • bounded concurrency
    • bounded retry/backoff
  • Every message MUST be tenant-scoped in subject and/or headers.
  • Every milestone below is “stop-the-line” gated:
    • all tasks completed
    • all tests passing
    • workspace lint/format checks passing
    • required NATS-gated integration tests for the milestone passing (when gated by env)

Current State (Baseline)

  • Streams:
    • AGGREGATE_EVENTS (Aggregate publishes, Projection/Runner consume)
    • WORKFLOW_COMMANDS, WORKFLOW_EVENTS (Runner)
  • Subject conventions:
    • Aggregate events: tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>
    • Defaults often use filters like tenant.*.aggregate.*.*
  • Durable consumers:
    • Projection uses a durable name (configurable)
    • Runner uses configurable durable prefix per role
    • Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
  • Headers:
    • Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist

Target Architecture (End State)

  • A single “NATS wire protocol” contract shared across services:
    • subject naming
    • required headers (tenant/correlation/trace)
    • message envelope compatibility rules (tolerant decoding, optional fields)
  • Stable, minimal set of JetStream streams:
    • one stream per message class (aggregate events, workflow commands, workflow events)
    • no per-tenant streams unless there is a strong operational reason
  • Stable, limited consumers:
    • durable consumers for long-lived processors (Projection, Runner)
    • ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
  • Uniform backpressure + reliability defaults:
    • explicit ack
    • bounded max_ack_pending and application-level concurrency
    • bounded redelivery via max_deliver + poison policy

Definitions

Message Context (Headers)

Standard headers for NATS published messages:

  • tenant-id (required)
  • x-correlation-id and correlation-id (required for any request-derived message; generated if missing)
  • traceparent (optional but recommended; generated/propagated if present upstream)
  • trace-id (optional; derived from traceparent when possible)
  • Nats-Msg-Id (required for idempotent publish/dedupe when applicable)

Subject Naming Rules

  • Tenant-first prefix: tenant.<tenant_id>.…
  • Stable message class token:
    • aggregate for domain events
    • effect, effect_result, workflow, workflow_event for Runner
  • No ambiguous wildcard publishing:
    • producers publish concrete subjects only
    • consumers may filter with wildcards

Consumer Naming Rules

  • Durable consumer names must be stable and collision-free:
    • include role + mode + optional view/saga name + shard/group
  • Ephemeral consumer names must be unique per operation:
    • include tenant + purpose + uuid
    • must be deleted best-effort when operation completes

Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)

Goal

Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.

Exit Criteria

  • shared exposes NATS header constants and helpers for inject/extract/derive.
  • All producers set required headers consistently.
  • All consumers tolerate unknown fields and missing optional fields.
  • A single, documented subject naming convention is enforced in code (builder functions).
  • Workspace fmt/clippy/tests pass.

Tasks

  • Centralize NATS header constants and helpers in shared:
    • inject headers for publish (tenant, correlation, trace)
    • extract headers on receive (best-effort)
    • derive trace-id from traceparent
  • Aggregate:
    • Ensure event publishing always sets tenant-id, correlation headers, trace headers
    • Ensure Nats-Msg-Id strategy is correct for idempotency/dedupe (document and test)
  • Projection:
    • Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
    • Ensure correlation/trace context is carried into spans/metrics consistently
  • Runner:
    • Ensure publish paths include correlation/trace headers consistently for commands and results
    • Ensure outbox metadata → NATS headers mapping is consistent and tested
  • Tests:
    • Unit tests for header injection/extraction in shared
    • Per-service unit tests asserting produced headers include required keys

Required Tests

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace

Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)

Goal

Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.

Exit Criteria

  • Stream config for each stream is explicitly defined and validated at startup.
  • Limits (max messages/bytes/age) are explicit and have defaults.
  • Duplicate windows and dedupe behavior are explicit and tested.
  • A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).

Tasks

  • Define a single “stream config policy” module per service (or shared helper):
    • AGGREGATE_EVENTS subjects + retention policy
    • WORKFLOW_COMMANDS subjects + retention policy
    • WORKFLOW_EVENTS subjects + retention policy
  • Standardize defaults:
    • retention: limits appropriate for replay + rebuild
    • duplicate_window aligned with producer idempotency strategy
    • storage type and replication policy documented and configurable
  • Add startup validations:
    • verify stream exists and matches required subject set (compatible superset allowed)
    • verify required ack/dedupe assumptions hold
  • Add tests that parse and validate configs without NATS.

Required Tests

  • Unit tests for stream config builders
  • Existing crate tests

Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)

Goal

Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.

Exit Criteria

  • All long-lived consumers use explicit ack with consistent ack_wait, max_deliver, max_ack_pending.
  • Application concurrency is bounded and tied to max_in_flight.
  • Poison policy is consistent:
    • after max_deliver, term + deadletter/quarantine record is written
  • Replay behavior is deterministic on restart (checkpoint-based where applicable).

Tasks

  • Define standard consumer config defaults:
    • AckPolicy::Explicit
    • ack_wait default + env override
    • max_deliver default + env override
    • max_ack_pending tied to application concurrency
  • Projection:
    • Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
    • Ensure checkpoint gates ack correctly (skip still acks)
    • Ensure poison policy writes durable records and terminates reliably
  • Runner:
    • Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
    • Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
  • Aggregate:
    • Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
    • Ensure best-effort cleanup is performed and cannot delete unrelated consumers
  • Tests:
    • Unit tests for consumer name generation (sanitization + uniqueness)
    • NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)

Required Tests

  • Workspace fmt/clippy/tests
  • NATS-gated integration tests for:
    • redelivery idempotency
    • poison termination behavior
    • scale-out with deliver group (where supported)

Milestone 3: Connection Management + Failure Semantics (Operational Frugality)

Goal

Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.

Exit Criteria

  • One NATS connection per process (or bounded pool only if justified).
  • Reconnect/backoff policy is explicit and consistent.
  • Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
  • No busy-looping on NATS outages.

Tasks

  • Standardize connection options:
    • reconnect delays/backoff
    • max reconnect attempts or “infinite with backoff” strategy (explicit)
    • request timeouts around JetStream operations
  • Standardize readiness semantics:
    • ready=false when NATS is unavailable and the node depends on it
    • health stays “process alive” but reports NATS connectivity in payload
  • Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
  • Tests:
    • unit tests for backoff behavior (where possible)
    • gated integration test: temporary NATS outage does not crash-loop and recovers

Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)

Goal

Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.

Exit Criteria

  • Durable names are deterministic and collision-free across replicas.
  • Deliver groups are used where appropriate to share work across replicas.
  • Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
  • A scale-out test suite exists and is gated but runnable.

Tasks

  • Establish consumer naming scheme per service role:
    • Projection: per-view durable option uses sanitized names and stable mapping
    • Runner: durable prefix includes role + shard + optional group
  • Establish deliver group usage rules:
    • when to enable (scale-out consumers)
    • how to roll without duplication
  • Strengthen dedupe keys:
    • event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
    • outbox relay: verify publish idempotency with Nats-Msg-Id
  • Add gated tests:
    • two replicas, same tenant, no duplicate publishes
    • rolling restart preserves checkpoint correctness

Verification Commands (Required at Each Milestone)

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace
  • Gated NATS integration tests:
    • Runner: RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored
    • Projection: PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored
    • Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests

Notes / Constraints

  • Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
  • Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
  • Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.