Files
cloudlysis/NATS_TRANSPORT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

247 lines
12 KiB
Markdown

# NATS Transport Plan
## Purpose
Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:
- Simplicity (few primitives, consistent naming, minimal per-service divergence)
- Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
- Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
- Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
- High performance (high throughput, low tail latency, reliable backpressure)
- Safety (tenant isolation, idempotency, deterministic replay, poison handling)
## Non-Negotiable Rules (Global)
- Every JetStream stream/consumer MUST have an explicit contract:
- name, subjects, retention, storage, replication, max sizes
- ack policy, ack wait, max deliver, max in flight
- Every node MUST run with bounded work:
- bounded pull batch sizes
- bounded concurrency
- bounded retry/backoff
- Every message MUST be tenant-scoped in subject and/or headers.
- Every milestone below is “stop-the-line” gated:
- all tasks completed
- all tests passing
- workspace lint/format checks passing
- required NATS-gated integration tests for the milestone passing (when gated by env)
## Current State (Baseline)
- Streams:
- `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume)
- `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner)
- Subject conventions:
- Aggregate events: `tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>`
- Defaults often use filters like `tenant.*.aggregate.*.*`
- Durable consumers:
- Projection uses a durable name (configurable)
- Runner uses configurable durable prefix per role
- Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
- Headers:
- Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist
## Target Architecture (End State)
- A single “NATS wire protocol” contract shared across services:
- subject naming
- required headers (tenant/correlation/trace)
- message envelope compatibility rules (tolerant decoding, optional fields)
- Stable, minimal set of JetStream streams:
- one stream per message class (aggregate events, workflow commands, workflow events)
- no per-tenant streams unless there is a strong operational reason
- Stable, limited consumers:
- durable consumers for long-lived processors (Projection, Runner)
- ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
- Uniform backpressure + reliability defaults:
- explicit ack
- bounded `max_ack_pending` and application-level concurrency
- bounded redelivery via `max_deliver` + poison policy
## Definitions
### Message Context (Headers)
Standard headers for NATS published messages:
- `tenant-id` (required)
- `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing)
- `traceparent` (optional but recommended; generated/propagated if present upstream)
- `trace-id` (optional; derived from traceparent when possible)
- `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable)
### Subject Naming Rules
- Tenant-first prefix: `tenant.<tenant_id>.…`
- Stable message class token:
- `aggregate` for domain events
- `effect`, `effect_result`, `workflow`, `workflow_event` for Runner
- No ambiguous wildcard publishing:
- producers publish concrete subjects only
- consumers may filter with wildcards
### Consumer Naming Rules
- Durable consumer names must be stable and collision-free:
- include role + mode + optional view/saga name + shard/group
- Ephemeral consumer names must be unique per operation:
- include tenant + purpose + uuid
- must be deleted best-effort when operation completes
## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)
### Goal
Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.
### Exit Criteria
- `shared` exposes NATS header constants and helpers for inject/extract/derive.
- All producers set required headers consistently.
- All consumers tolerate unknown fields and missing optional fields.
- A single, documented subject naming convention is enforced in code (builder functions).
- Workspace fmt/clippy/tests pass.
### Tasks
- [ ] Centralize NATS header constants and helpers in `shared`:
- [ ] inject headers for publish (tenant, correlation, trace)
- [ ] extract headers on receive (best-effort)
- [ ] derive `trace-id` from `traceparent`
- [ ] Aggregate:
- [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers
- [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test)
- [ ] Projection:
- [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
- [ ] Ensure correlation/trace context is carried into spans/metrics consistently
- [ ] Runner:
- [ ] Ensure publish paths include correlation/trace headers consistently for commands and results
- [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested
- [ ] Tests:
- [ ] Unit tests for header injection/extraction in `shared`
- [ ] Per-service unit tests asserting produced headers include required keys
### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)
### Goal
Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.
### Exit Criteria
- Stream config for each stream is explicitly defined and validated at startup.
- Limits (max messages/bytes/age) are explicit and have defaults.
- Duplicate windows and dedupe behavior are explicit and tested.
- A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).
### Tasks
- [ ] Define a single “stream config policy” module per service (or shared helper):
- [ ] `AGGREGATE_EVENTS` subjects + retention policy
- [ ] `WORKFLOW_COMMANDS` subjects + retention policy
- [ ] `WORKFLOW_EVENTS` subjects + retention policy
- [ ] Standardize defaults:
- [ ] retention: limits appropriate for replay + rebuild
- [ ] `duplicate_window` aligned with producer idempotency strategy
- [ ] storage type and replication policy documented and configurable
- [ ] Add startup validations:
- [ ] verify stream exists and matches required subject set (compatible superset allowed)
- [ ] verify required ack/dedupe assumptions hold
- [ ] Add tests that parse and validate configs without NATS.
### Required Tests
- Unit tests for stream config builders
- Existing crate tests
## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)
### Goal
Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.
### Exit Criteria
- All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`.
- Application concurrency is bounded and tied to `max_in_flight`.
- Poison policy is consistent:
- after `max_deliver`, term + deadletter/quarantine record is written
- Replay behavior is deterministic on restart (checkpoint-based where applicable).
### Tasks
- [ ] Define standard consumer config defaults:
- [ ] `AckPolicy::Explicit`
- [ ] `ack_wait` default + env override
- [ ] `max_deliver` default + env override
- [ ] `max_ack_pending` tied to application concurrency
- [ ] Projection:
- [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
- [ ] Ensure checkpoint gates ack correctly (skip still acks)
- [ ] Ensure poison policy writes durable records and terminates reliably
- [ ] Runner:
- [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
- [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
- [ ] Aggregate:
- [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
- [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers
- [ ] Tests:
- [ ] Unit tests for consumer name generation (sanitization + uniqueness)
- [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)
### Required Tests
- Workspace fmt/clippy/tests
- NATS-gated integration tests for:
- redelivery idempotency
- poison termination behavior
- scale-out with deliver group (where supported)
## Milestone 3: Connection Management + Failure Semantics (Operational Frugality)
### Goal
Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.
### Exit Criteria
- One NATS connection per process (or bounded pool only if justified).
- Reconnect/backoff policy is explicit and consistent.
- Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
- No busy-looping on NATS outages.
### Tasks
- [ ] Standardize connection options:
- [ ] reconnect delays/backoff
- [ ] max reconnect attempts or “infinite with backoff” strategy (explicit)
- [ ] request timeouts around JetStream operations
- [ ] Standardize readiness semantics:
- [ ] `ready=false` when NATS is unavailable and the node depends on it
- [ ] `health` stays “process alive” but reports NATS connectivity in payload
- [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
- [ ] Tests:
- [ ] unit tests for backoff behavior (where possible)
- [ ] gated integration test: temporary NATS outage does not crash-loop and recovers
## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)
### Goal
Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.
### Exit Criteria
- Durable names are deterministic and collision-free across replicas.
- Deliver groups are used where appropriate to share work across replicas.
- Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
- A scale-out test suite exists and is gated but runnable.
### Tasks
- [ ] Establish consumer naming scheme per service role:
- [ ] Projection: per-view durable option uses sanitized names and stable mapping
- [ ] Runner: durable prefix includes role + shard + optional group
- [ ] Establish deliver group usage rules:
- [ ] when to enable (scale-out consumers)
- [ ] how to roll without duplication
- [ ] Strengthen dedupe keys:
- [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
- [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id`
- [ ] Add gated tests:
- [ ] two replicas, same tenant, no duplicate publishes
- [ ] rolling restart preserves checkpoint correctness
## Verification Commands (Required at Each Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- Gated NATS integration tests:
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
- Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests
## Notes / Constraints
- Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
- Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
- Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.