madapes/cloudlysis

Fork 0

Files

Vlad Durnea 1298d9a3df

ci / rust (push) Failing after 2m34s

Details

ci / ui (push) Failing after 30s

Details

Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets

2026-03-30 11:40:42 +03:00

12 KiB

Raw Blame History

NATS Transport Plan

Purpose

Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:

Simplicity (few primitives, consistent naming, minimal per-service divergence)
Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
High performance (high throughput, low tail latency, reliable backpressure)
Safety (tenant isolation, idempotency, deterministic replay, poison handling)

Non-Negotiable Rules (Global)

Every JetStream stream/consumer MUST have an explicit contract:
- name, subjects, retention, storage, replication, max sizes
- ack policy, ack wait, max deliver, max in flight
Every node MUST run with bounded work:
- bounded pull batch sizes
- bounded concurrency
- bounded retry/backoff
Every message MUST be tenant-scoped in subject and/or headers.
Every milestone below is “stop-the-line” gated:
- all tasks completed
- all tests passing
- workspace lint/format checks passing
- required NATS-gated integration tests for the milestone passing (when gated by env)

Current State (Baseline)

Streams:
- AGGREGATE_EVENTS (Aggregate publishes, Projection/Runner consume)
- WORKFLOW_COMMANDS, WORKFLOW_EVENTS (Runner)
Subject conventions:
- Aggregate events: tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>
- Defaults often use filters like tenant.*.aggregate.*.*
Durable consumers:
- Projection uses a durable name (configurable)
- Runner uses configurable durable prefix per role
- Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
Headers:
- Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist

Target Architecture (End State)

A single “NATS wire protocol” contract shared across services:
- subject naming
- required headers (tenant/correlation/trace)
- message envelope compatibility rules (tolerant decoding, optional fields)
Stable, minimal set of JetStream streams:
- one stream per message class (aggregate events, workflow commands, workflow events)
- no per-tenant streams unless there is a strong operational reason
Stable, limited consumers:
- durable consumers for long-lived processors (Projection, Runner)
- ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
Uniform backpressure + reliability defaults:
- explicit ack
- bounded max_ack_pending and application-level concurrency
- bounded redelivery via max_deliver + poison policy

Definitions

Message Context (Headers)

Standard headers for NATS published messages:

tenant-id (required)
x-correlation-id and correlation-id (required for any request-derived message; generated if missing)
traceparent (optional but recommended; generated/propagated if present upstream)
trace-id (optional; derived from traceparent when possible)
Nats-Msg-Id (required for idempotent publish/dedupe when applicable)

Subject Naming Rules

Tenant-first prefix: tenant.<tenant_id>.…
Stable message class token:
- aggregate for domain events
- effect, effect_result, workflow, workflow_event for Runner
No ambiguous wildcard publishing:
- producers publish concrete subjects only
- consumers may filter with wildcards

Consumer Naming Rules

Durable consumer names must be stable and collision-free:
- include role + mode + optional view/saga name + shard/group
Ephemeral consumer names must be unique per operation:
- include tenant + purpose + uuid
- must be deleted best-effort when operation completes

Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)

Goal

Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.

Exit Criteria

shared exposes NATS header constants and helpers for inject/extract/derive.
All producers set required headers consistently.
All consumers tolerate unknown fields and missing optional fields.
A single, documented subject naming convention is enforced in code (builder functions).
Workspace fmt/clippy/tests pass.

Tasks

Centralize NATS header constants and helpers in shared:
- inject headers for publish (tenant, correlation, trace)
- extract headers on receive (best-effort)
- derive trace-id from traceparent
Aggregate:
- Ensure event publishing always sets tenant-id, correlation headers, trace headers
- Ensure Nats-Msg-Id strategy is correct for idempotency/dedupe (document and test)
Projection:
- Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
- Ensure correlation/trace context is carried into spans/metrics consistently
Runner:
- Ensure publish paths include correlation/trace headers consistently for commands and results
- Ensure outbox metadata → NATS headers mapping is consistent and tested
Tests:
- Unit tests for header injection/extraction in shared
- Per-service unit tests asserting produced headers include required keys

Required Tests

cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace

Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)

Goal

Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.

Exit Criteria

Stream config for each stream is explicitly defined and validated at startup.
Limits (max messages/bytes/age) are explicit and have defaults.
Duplicate windows and dedupe behavior are explicit and tested.
A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).

Tasks

Define a single “stream config policy” module per service (or shared helper):
- AGGREGATE_EVENTS subjects + retention policy
- WORKFLOW_COMMANDS subjects + retention policy
- WORKFLOW_EVENTS subjects + retention policy
Standardize defaults:
- retention: limits appropriate for replay + rebuild
- duplicate_window aligned with producer idempotency strategy
- storage type and replication policy documented and configurable
Add startup validations:
- verify stream exists and matches required subject set (compatible superset allowed)
- verify required ack/dedupe assumptions hold
Add tests that parse and validate configs without NATS.

Required Tests

Unit tests for stream config builders
Existing crate tests

Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)

Goal

Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.

Exit Criteria

All long-lived consumers use explicit ack with consistent ack_wait, max_deliver, max_ack_pending.
Application concurrency is bounded and tied to max_in_flight.
Poison policy is consistent:
- after max_deliver, term + deadletter/quarantine record is written
Replay behavior is deterministic on restart (checkpoint-based where applicable).

Tasks

Define standard consumer config defaults:
- AckPolicy::Explicit
- ack_wait default + env override
- max_deliver default + env override
- max_ack_pending tied to application concurrency
Projection:
- Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
- Ensure checkpoint gates ack correctly (skip still acks)
- Ensure poison policy writes durable records and terminates reliably
Runner:
- Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
- Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
Aggregate:
- Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
- Ensure best-effort cleanup is performed and cannot delete unrelated consumers
Tests:
- Unit tests for consumer name generation (sanitization + uniqueness)
- NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)

Required Tests

Workspace fmt/clippy/tests
NATS-gated integration tests for:
- redelivery idempotency
- poison termination behavior
- scale-out with deliver group (where supported)

Milestone 3: Connection Management + Failure Semantics (Operational Frugality)

Goal

Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.

Exit Criteria

One NATS connection per process (or bounded pool only if justified).
Reconnect/backoff policy is explicit and consistent.
Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
No busy-looping on NATS outages.

Tasks

Standardize connection options:
- reconnect delays/backoff
- max reconnect attempts or “infinite with backoff” strategy (explicit)
- request timeouts around JetStream operations
Standardize readiness semantics:
- ready=false when NATS is unavailable and the node depends on it
- health stays “process alive” but reports NATS connectivity in payload
Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
Tests:
- unit tests for backoff behavior (where possible)
- gated integration test: temporary NATS outage does not crash-loop and recovers

Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)

Goal

Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.

Exit Criteria

Durable names are deterministic and collision-free across replicas.
Deliver groups are used where appropriate to share work across replicas.
Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
A scale-out test suite exists and is gated but runnable.

Tasks

Establish consumer naming scheme per service role:
- Projection: per-view durable option uses sanitized names and stable mapping
- Runner: durable prefix includes role + shard + optional group
Establish deliver group usage rules:
- when to enable (scale-out consumers)
- how to roll without duplication
Strengthen dedupe keys:
- event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
- outbox relay: verify publish idempotency with Nats-Msg-Id
Add gated tests:
- two replicas, same tenant, no duplicate publishes
- rolling restart preserves checkpoint correctness

Verification Commands (Required at Each Milestone)

cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
Gated NATS integration tests:
- Runner: RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored
- Projection: PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored
- Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests

Notes / Constraints

Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.

12 KiB Raw Blame History

NATS Transport Plan

Purpose

Non-Negotiable Rules (Global)

Current State (Baseline)

Target Architecture (End State)

Definitions

Message Context (Headers)

Subject Naming Rules

Consumer Naming Rules

Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)

Goal

Exit Criteria

Tasks

Required Tests

Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)

Goal

Exit Criteria

Tasks

Required Tests

Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)

Goal

Exit Criteria

Tasks

Required Tests

Milestone 3: Connection Management + Failure Semantics (Operational Frugality)

Goal

Exit Criteria

Tasks

Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)

Goal

Exit Criteria

Tasks

Verification Commands (Required at Each Milestone)

Notes / Constraints

12 KiB

Raw Blame History