# NATS Transport Plan

## Purpose
Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:
- Simplicity (few primitives, consistent naming, minimal per-service divergence)
- Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
- Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
- Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
- High performance (high throughput, low tail latency, reliable backpressure)
- Safety (tenant isolation, idempotency, deterministic replay, poison handling)

## Non-Negotiable Rules (Global)
- Every JetStream stream/consumer MUST have an explicit contract:
  - name, subjects, retention, storage, replication, max sizes
  - ack policy, ack wait, max deliver, max in flight
- Every node MUST run with bounded work:
  - bounded pull batch sizes
  - bounded concurrency
  - bounded retry/backoff
- Every message MUST be tenant-scoped in subject and/or headers.
- Every milestone below is “stop-the-line” gated:
  - all tasks completed
  - all tests passing
  - workspace lint/format checks passing
  - required NATS-gated integration tests for the milestone passing (when gated by env)

## Current State (Baseline)
- Streams:
  - `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume)
  - `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner)
- Subject conventions:
  - Aggregate events: `tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>`
  - Defaults often use filters like `tenant.*.aggregate.*.*`
- Durable consumers:
  - Projection uses a durable name (configurable)
  - Runner uses configurable durable prefix per role
  - Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
- Headers:
  - Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist

## Target Architecture (End State)
- A single “NATS wire protocol” contract shared across services:
  - subject naming
  - required headers (tenant/correlation/trace)
  - message envelope compatibility rules (tolerant decoding, optional fields)
- Stable, minimal set of JetStream streams:
  - one stream per message class (aggregate events, workflow commands, workflow events)
  - no per-tenant streams unless there is a strong operational reason
- Stable, limited consumers:
  - durable consumers for long-lived processors (Projection, Runner)
  - ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
- Uniform backpressure + reliability defaults:
  - explicit ack
  - bounded `max_ack_pending` and application-level concurrency
  - bounded redelivery via `max_deliver` + poison policy

## Definitions
### Message Context (Headers)
Standard headers for NATS published messages:
- `tenant-id` (required)
- `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing)
- `traceparent` (optional but recommended; generated/propagated if present upstream)
- `trace-id` (optional; derived from traceparent when possible)
- `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable)

### Subject Naming Rules
- Tenant-first prefix: `tenant.<tenant_id>.…`
- Stable message class token:
  - `aggregate` for domain events
  - `effect`, `effect_result`, `workflow`, `workflow_event` for Runner
- No ambiguous wildcard publishing:
  - producers publish concrete subjects only
  - consumers may filter with wildcards

### Consumer Naming Rules
- Durable consumer names must be stable and collision-free:
  - include role + mode + optional view/saga name + shard/group
- Ephemeral consumer names must be unique per operation:
  - include tenant + purpose + uuid
  - must be deleted best-effort when operation completes

## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)

### Goal
Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.

### Exit Criteria
- `shared` exposes NATS header constants and helpers for inject/extract/derive.
- All producers set required headers consistently.
- All consumers tolerate unknown fields and missing optional fields.
- A single, documented subject naming convention is enforced in code (builder functions).
- Workspace fmt/clippy/tests pass.

### Tasks
- [ ] Centralize NATS header constants and helpers in `shared`:
  - [ ] inject headers for publish (tenant, correlation, trace)
  - [ ] extract headers on receive (best-effort)
  - [ ] derive `trace-id` from `traceparent`
- [ ] Aggregate:
  - [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers
  - [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test)
- [ ] Projection:
  - [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
  - [ ] Ensure correlation/trace context is carried into spans/metrics consistently
- [ ] Runner:
  - [ ] Ensure publish paths include correlation/trace headers consistently for commands and results
  - [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested
- [ ] Tests:
  - [ ] Unit tests for header injection/extraction in `shared`
  - [ ] Per-service unit tests asserting produced headers include required keys

### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`

## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)

### Goal
Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.

### Exit Criteria
- Stream config for each stream is explicitly defined and validated at startup.
- Limits (max messages/bytes/age) are explicit and have defaults.
- Duplicate windows and dedupe behavior are explicit and tested.
- A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).

### Tasks
- [ ] Define a single “stream config policy” module per service (or shared helper):
  - [ ] `AGGREGATE_EVENTS` subjects + retention policy
  - [ ] `WORKFLOW_COMMANDS` subjects + retention policy
  - [ ] `WORKFLOW_EVENTS` subjects + retention policy
- [ ] Standardize defaults:
  - [ ] retention: limits appropriate for replay + rebuild
  - [ ] `duplicate_window` aligned with producer idempotency strategy
  - [ ] storage type and replication policy documented and configurable
- [ ] Add startup validations:
  - [ ] verify stream exists and matches required subject set (compatible superset allowed)
  - [ ] verify required ack/dedupe assumptions hold
- [ ] Add tests that parse and validate configs without NATS.

### Required Tests
- Unit tests for stream config builders
- Existing crate tests

## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)

### Goal
Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.

### Exit Criteria
- All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`.
- Application concurrency is bounded and tied to `max_in_flight`.
- Poison policy is consistent:
  - after `max_deliver`, term + deadletter/quarantine record is written
- Replay behavior is deterministic on restart (checkpoint-based where applicable).

### Tasks
- [ ] Define standard consumer config defaults:
  - [ ] `AckPolicy::Explicit`
  - [ ] `ack_wait` default + env override
  - [ ] `max_deliver` default + env override
  - [ ] `max_ack_pending` tied to application concurrency
- [ ] Projection:
  - [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
  - [ ] Ensure checkpoint gates ack correctly (skip still acks)
  - [ ] Ensure poison policy writes durable records and terminates reliably
- [ ] Runner:
  - [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
  - [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
- [ ] Aggregate:
  - [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
  - [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers
- [ ] Tests:
  - [ ] Unit tests for consumer name generation (sanitization + uniqueness)
  - [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)

### Required Tests
- Workspace fmt/clippy/tests
- NATS-gated integration tests for:
  - redelivery idempotency
  - poison termination behavior
  - scale-out with deliver group (where supported)

## Milestone 3: Connection Management + Failure Semantics (Operational Frugality)

### Goal
Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.

### Exit Criteria
- One NATS connection per process (or bounded pool only if justified).
- Reconnect/backoff policy is explicit and consistent.
- Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
- No busy-looping on NATS outages.

### Tasks
- [ ] Standardize connection options:
  - [ ] reconnect delays/backoff
  - [ ] max reconnect attempts or “infinite with backoff” strategy (explicit)
  - [ ] request timeouts around JetStream operations
- [ ] Standardize readiness semantics:
  - [ ] `ready=false` when NATS is unavailable and the node depends on it
  - [ ] `health` stays “process alive” but reports NATS connectivity in payload
- [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
- [ ] Tests:
  - [ ] unit tests for backoff behavior (where possible)
  - [ ] gated integration test: temporary NATS outage does not crash-loop and recovers

## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)

### Goal
Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.

### Exit Criteria
- Durable names are deterministic and collision-free across replicas.
- Deliver groups are used where appropriate to share work across replicas.
- Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
- A scale-out test suite exists and is gated but runnable.

### Tasks
- [ ] Establish consumer naming scheme per service role:
  - [ ] Projection: per-view durable option uses sanitized names and stable mapping
  - [ ] Runner: durable prefix includes role + shard + optional group
- [ ] Establish deliver group usage rules:
  - [ ] when to enable (scale-out consumers)
  - [ ] how to roll without duplication
- [ ] Strengthen dedupe keys:
  - [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
  - [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id`
- [ ] Add gated tests:
  - [ ] two replicas, same tenant, no duplicate publishes
  - [ ] rolling restart preserves checkpoint correctness

## Verification Commands (Required at Each Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- Gated NATS integration tests:
  - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
  - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
  - Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests

## Notes / Constraints
- Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
- Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
- Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.