docs: add docs folder (architecture, developer, usage); update README; wire probe TTL cache + concurrency notes into docs
Some checks failed
ci / ui (push) Failing after 28s
images / build-and-push (push) Failing after 19s
ci / rust (push) Failing after 2m26s

This commit is contained in:
2026-03-30 14:32:47 +03:00
parent 90c307016d
commit e9a0142396
12 changed files with 203 additions and 805 deletions

View File

@@ -1,216 +0,0 @@
# Gateway Transport Plan
## Purpose
Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
- Simplicity (few patterns, minimal bespoke conventions)
- Ease of operation (consistent health/ready/metrics, consistent failure modes)
- Frugality (bounded connections, bounded fanout, low overhead)
- High performance (low tail latency, backpressure-aware, predictable routing)
- Safety (tenant isolation, deny-by-default authz, consistent context propagation)
## Non-Negotiable Rules (Global)
- Every cross-service request MUST carry tenant + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every milestone below is “stop-the-line” gated:
- All tasks completed
- All tests passing
- Workspace lint/format/type checks passing
- Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
## Current State (Baseline)
- Gateway → Aggregate: gRPC command submission
- Gateway → Projection: HTTP query proxy (`/v1/query/*`)
- Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`)
- Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
## Target Architecture (End State)
- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
- Async/event backbone: NATS JetStream remains for event/work distribution
- `shared` is the single source of truth for:
- Header names and propagation rules
- Trace parsing/validation rules (`traceparent`, `trace-id`)
- Request context representation (tenant/correlation/trace)
## Definitions
### Request Context
Fields that must be consistently propagated:
- `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`)
- `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`)
- `traceparent` (HTTP: `traceparent`, NATS: `traceparent`)
- `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`)
- `request_id` (HTTP: `x-request-id`, optional for NATS)
### Standard Health Endpoints (per service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if applicable)
- `GET /metrics` Prometheus
## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
### Goal
Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
### Exit Criteria
- A single shared contract exists for header names and trace parsing.
- Gateway injects context into all upstream calls (including rebalance/health probes).
- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
- Unit tests prove propagation behavior for each transport.
- `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass.
### Tasks
- [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible.
- [ ] Add `shared` helpers for:
- HTTP extract/inject
- gRPC metadata extract/inject
- NATS header extract/inject
- [ ] Gateway: ensure context is injected into:
- gRPC upstream requests to Aggregate
- HTTP upstream requests to Projection
- Runner admin proxy requests
- Any “probe” calls (rebalance gates, fleet snapshots, health checks)
- [ ] Projection/Runner/Aggregate: ensure NATS published messages include:
- `tenant-id`
- `x-correlation-id` + `correlation-id`
- `traceparent`
- `trace-id` (derived when possible)
- [ ] Add transport-level tests:
- [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
- [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved
- [ ] NATS publish path: produced headers contain expected keys/values
### Required Tests
- Unit tests for shared parsing/derivation utilities
- Existing per-crate test suites
- At least one per-service “transport contract” test verifying headers are present and correct
## Milestone 1: Internal RPC Standardization (Projection via gRPC)
### Goal
Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- A Projection gRPC service exists for query execution.
- Gateway routes queries to Projection via gRPC by default.
- Authorization semantics remain enforced in Gateway (deny-by-default).
- Response shapes are stable and match the existing UI expectations.
- All tests pass, including new gRPC query integration tests.
### Tasks
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
- [ ] Request includes tenant + view + query payload and metadata
- [ ] Response includes result payload and standard context propagation
- [ ] Implement Projection gRPC server:
- [ ] Parse tenant/view/query
- [ ] Execute query against current projection storage/query engine
- [ ] Enforce tenant scope
- [ ] Implement Gateway gRPC client path for queries:
- [ ] Routing by tenant to Projection endpoint
- [ ] Deadlines, bounded retries (idempotent only)
- [ ] Context propagation (tenant/correlation/trace)
- [ ] Keep HTTP `/v1/query/*`:
- [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint
- [ ] Add tests:
- [ ] Gateway query authz + forwarding via gRPC
- [ ] Projection gRPC query contract tests for tenant isolation
### Required Tests
- New gRPC QueryService tests (unit + integration)
- Existing query/authz tests in Gateway
- Workspace fmt/clippy/test
## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
### Goal
Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations.
### Exit Criteria
- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
- Gateway uses gRPC to call Runner admin APIs.
- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
- Admin operations are idempotent where appropriate and include audit hooks where required.
- All tests pass and include negative/tenant-spoof cases.
### Tasks
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [ ] Drain/resume/status/reload/tenant-scoped controls
- [ ] Standard error mapping
- [ ] Implement Runner gRPC admin server:
- [ ] Tenant gating enforced for tenant-scoped operations
- [ ] Readiness/drain semantics aligned with platform contracts
- [ ] Implement Gateway gRPC client integration:
- [ ] Route to Runner endpoint via routing table
- [ ] Enforce authz rights (e.g. `runner.admin`)
- [ ] Context propagation
- [ ] Keep HTTP `/admin/*` in Runner optional:
- [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network
- [ ] Tests:
- [ ] Gateway: admin calls rejected without rights
- [ ] Gateway: tenant spoof attempts rejected
- [ ] Runner: idempotency and drain semantics validated
### Required Tests
- gRPC RunnerAdmin unit/integration tests
- Gateway proxy-to-gRPC tests
- Workspace fmt/clippy/test
## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
### Goal
Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
### Exit Criteria
- Gateway maintains bounded upstream connection pools for gRPC endpoints.
- All gRPC calls have deadlines; retries are only for idempotent operations.
- All probe/fanout calls are bounded and do not cause thundering herds.
- Load/soak tests show stable behavior under partial failure.
### Tasks
- [ ] Implement a Gateway upstream channel pool:
- [ ] LRU bounded by max endpoints
- [ ] TTL/eviction strategy
- [ ] Fast path reuse under load
- [ ] Standardize retry profiles:
- [ ] Read-only: short retry with jitter
- [ ] Mutations: no automatic retry unless idempotency key present
- [ ] Standardize timeouts:
- [ ] Edge timeout limits
- [ ] Internal per-service deadlines
- [ ] Fanout controls:
- [ ] Concurrency limiters for fleet snapshot/probes
- [ ] Cache results where safe (short TTL)
### Required Tests
- Unit tests for pool eviction/TTL
- Gateway integration tests for deadline propagation
- Gated load tests (document env + how to run)
## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
### Goal
Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
- All operational playbooks rely on standardized endpoints.
### Tasks
- [ ] Remove Gateways HTTP query proxy usage (or keep only as compatibility shim).
- [ ] Remove Gateways runner admin HTTP proxy usage (or keep only as compatibility shim).
- [ ] Ensure Control UI + Control API use the standardized Gateway surfaces.
- [ ] Harden metrics and health probes to always carry context.
### Required Tests
- End-to-end smoke tests (gated)
- Workspace fmt/clippy/test
## Verification Commands (Required at Each Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
## Notes / Constraints
- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.

View File

@@ -1,246 +0,0 @@
# NATS Transport Plan
## Purpose
Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:
- Simplicity (few primitives, consistent naming, minimal per-service divergence)
- Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
- Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
- Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
- High performance (high throughput, low tail latency, reliable backpressure)
- Safety (tenant isolation, idempotency, deterministic replay, poison handling)
## Non-Negotiable Rules (Global)
- Every JetStream stream/consumer MUST have an explicit contract:
- name, subjects, retention, storage, replication, max sizes
- ack policy, ack wait, max deliver, max in flight
- Every node MUST run with bounded work:
- bounded pull batch sizes
- bounded concurrency
- bounded retry/backoff
- Every message MUST be tenant-scoped in subject and/or headers.
- Every milestone below is “stop-the-line” gated:
- all tasks completed
- all tests passing
- workspace lint/format checks passing
- required NATS-gated integration tests for the milestone passing (when gated by env)
## Current State (Baseline)
- Streams:
- `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume)
- `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner)
- Subject conventions:
- Aggregate events: `tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>`
- Defaults often use filters like `tenant.*.aggregate.*.*`
- Durable consumers:
- Projection uses a durable name (configurable)
- Runner uses configurable durable prefix per role
- Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
- Headers:
- Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist
## Target Architecture (End State)
- A single “NATS wire protocol” contract shared across services:
- subject naming
- required headers (tenant/correlation/trace)
- message envelope compatibility rules (tolerant decoding, optional fields)
- Stable, minimal set of JetStream streams:
- one stream per message class (aggregate events, workflow commands, workflow events)
- no per-tenant streams unless there is a strong operational reason
- Stable, limited consumers:
- durable consumers for long-lived processors (Projection, Runner)
- ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
- Uniform backpressure + reliability defaults:
- explicit ack
- bounded `max_ack_pending` and application-level concurrency
- bounded redelivery via `max_deliver` + poison policy
## Definitions
### Message Context (Headers)
Standard headers for NATS published messages:
- `tenant-id` (required)
- `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing)
- `traceparent` (optional but recommended; generated/propagated if present upstream)
- `trace-id` (optional; derived from traceparent when possible)
- `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable)
### Subject Naming Rules
- Tenant-first prefix: `tenant.<tenant_id>.…`
- Stable message class token:
- `aggregate` for domain events
- `effect`, `effect_result`, `workflow`, `workflow_event` for Runner
- No ambiguous wildcard publishing:
- producers publish concrete subjects only
- consumers may filter with wildcards
### Consumer Naming Rules
- Durable consumer names must be stable and collision-free:
- include role + mode + optional view/saga name + shard/group
- Ephemeral consumer names must be unique per operation:
- include tenant + purpose + uuid
- must be deleted best-effort when operation completes
## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)
### Goal
Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.
### Exit Criteria
- `shared` exposes NATS header constants and helpers for inject/extract/derive.
- All producers set required headers consistently.
- All consumers tolerate unknown fields and missing optional fields.
- A single, documented subject naming convention is enforced in code (builder functions).
- Workspace fmt/clippy/tests pass.
### Tasks
- [ ] Centralize NATS header constants and helpers in `shared`:
- [ ] inject headers for publish (tenant, correlation, trace)
- [ ] extract headers on receive (best-effort)
- [ ] derive `trace-id` from `traceparent`
- [ ] Aggregate:
- [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers
- [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test)
- [ ] Projection:
- [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
- [ ] Ensure correlation/trace context is carried into spans/metrics consistently
- [ ] Runner:
- [ ] Ensure publish paths include correlation/trace headers consistently for commands and results
- [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested
- [ ] Tests:
- [ ] Unit tests for header injection/extraction in `shared`
- [ ] Per-service unit tests asserting produced headers include required keys
### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)
### Goal
Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.
### Exit Criteria
- Stream config for each stream is explicitly defined and validated at startup.
- Limits (max messages/bytes/age) are explicit and have defaults.
- Duplicate windows and dedupe behavior are explicit and tested.
- A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).
### Tasks
- [ ] Define a single “stream config policy” module per service (or shared helper):
- [ ] `AGGREGATE_EVENTS` subjects + retention policy
- [ ] `WORKFLOW_COMMANDS` subjects + retention policy
- [ ] `WORKFLOW_EVENTS` subjects + retention policy
- [ ] Standardize defaults:
- [ ] retention: limits appropriate for replay + rebuild
- [ ] `duplicate_window` aligned with producer idempotency strategy
- [ ] storage type and replication policy documented and configurable
- [ ] Add startup validations:
- [ ] verify stream exists and matches required subject set (compatible superset allowed)
- [ ] verify required ack/dedupe assumptions hold
- [ ] Add tests that parse and validate configs without NATS.
### Required Tests
- Unit tests for stream config builders
- Existing crate tests
## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)
### Goal
Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.
### Exit Criteria
- All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`.
- Application concurrency is bounded and tied to `max_in_flight`.
- Poison policy is consistent:
- after `max_deliver`, term + deadletter/quarantine record is written
- Replay behavior is deterministic on restart (checkpoint-based where applicable).
### Tasks
- [ ] Define standard consumer config defaults:
- [ ] `AckPolicy::Explicit`
- [ ] `ack_wait` default + env override
- [ ] `max_deliver` default + env override
- [ ] `max_ack_pending` tied to application concurrency
- [ ] Projection:
- [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
- [ ] Ensure checkpoint gates ack correctly (skip still acks)
- [ ] Ensure poison policy writes durable records and terminates reliably
- [ ] Runner:
- [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
- [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
- [ ] Aggregate:
- [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
- [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers
- [ ] Tests:
- [ ] Unit tests for consumer name generation (sanitization + uniqueness)
- [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)
### Required Tests
- Workspace fmt/clippy/tests
- NATS-gated integration tests for:
- redelivery idempotency
- poison termination behavior
- scale-out with deliver group (where supported)
## Milestone 3: Connection Management + Failure Semantics (Operational Frugality)
### Goal
Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.
### Exit Criteria
- One NATS connection per process (or bounded pool only if justified).
- Reconnect/backoff policy is explicit and consistent.
- Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
- No busy-looping on NATS outages.
### Tasks
- [ ] Standardize connection options:
- [ ] reconnect delays/backoff
- [ ] max reconnect attempts or “infinite with backoff” strategy (explicit)
- [ ] request timeouts around JetStream operations
- [ ] Standardize readiness semantics:
- [ ] `ready=false` when NATS is unavailable and the node depends on it
- [ ] `health` stays “process alive” but reports NATS connectivity in payload
- [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
- [ ] Tests:
- [ ] unit tests for backoff behavior (where possible)
- [ ] gated integration test: temporary NATS outage does not crash-loop and recovers
## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)
### Goal
Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.
### Exit Criteria
- Durable names are deterministic and collision-free across replicas.
- Deliver groups are used where appropriate to share work across replicas.
- Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
- A scale-out test suite exists and is gated but runnable.
### Tasks
- [ ] Establish consumer naming scheme per service role:
- [ ] Projection: per-view durable option uses sanitized names and stable mapping
- [ ] Runner: durable prefix includes role + shard + optional group
- [ ] Establish deliver group usage rules:
- [ ] when to enable (scale-out consumers)
- [ ] how to roll without duplication
- [ ] Strengthen dedupe keys:
- [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
- [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id`
- [ ] Add gated tests:
- [ ] two replicas, same tenant, no duplicate publishes
- [ ] rolling restart preserves checkpoint correctness
## Verification Commands (Required at Each Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- Gated NATS integration tests:
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
- Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests
## Notes / Constraints
- Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
- Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
- Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.

View File

@@ -4,10 +4,12 @@
- Rust services (Cargo workspace): `aggregate/`, `gateway/`, `projection/`, `runner/`, `control/api/`, `shared/` - Rust services (Cargo workspace): `aggregate/`, `gateway/`, `projection/`, `runner/`, `control/api/`, `shared/`
- Control UI: `control/ui/` - Control UI: `control/ui/`
- Docker + Swarm + Compose: `docker/`, `docker-compose.yml`, `swarm/`, `observability/` - Docker + Swarm + Compose: `docker/`, `docker-compose.yml`, `swarm/`, `observability/`
- Transport plans:
- `TRANSPORT_DEVELOPMENT_PLAN.md` ## Documentation
- `GATEWAY_TRANSPORT_PLAN.md` - docs/README.md
- `NATS_TRANSPORT_PLAN.md` - Architecture: docs/architecture/overview.md, docs/architecture/transport.md
- Developer: docs/developer/setup.md, docs/developer/testing.md
- Usage: docs/usage/quickstart.md, docs/usage/api.md, docs/usage/nats.md
## Quick Start (Docker Compose) ## Quick Start (Docker Compose)
@@ -35,4 +37,5 @@ More details: `DOCKER.md`
cargo fmt --check cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace cargo test --workspace
cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build
``` ```

View File

@@ -1,339 +0,0 @@
# Transport Development Plan
## Purpose
Unify and optimize the platform transport layer end-to-end:
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
This plan merges and supersedes:
- `GATEWAY_TRANSPORT_PLAN.md`
- `NATS_TRANSPORT_PLAN.md`
## Current Status (Codebase Reality)
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
- Request context pieces are standardized:
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
- `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
- Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
- Node → NATS header propagation is improved and closer to consistent:
- Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
- Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
- Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
## Principles
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
## Non-Negotiable Rules (Global)
- Every cross-component hop MUST carry tenant + correlation + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
- Every milestone is stop-the-line gated:
- All tasks completed
- All tests required by the milestone pass
- Workspace verification commands pass
- Gated integration tests for the milestone are runnable and documented
## Baseline (Today)
- Gateway → Aggregate: gRPC (command submission)
- Gateway → Projection: HTTP (query proxy)
- Gateway → Runner: HTTP (admin proxy)
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
## End State (Target Architecture)
- Edge contract (clients ↔ Gateway): HTTP/JSON
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
- `shared` is the single source of truth for:
- header names and injection/extraction rules
- trace parsing/validation (`traceparent`, `trace-id`)
- context object model (tenant/correlation/trace/request ids)
- NATS subject + consumer naming helpers
## Standard Contracts
### Context Fields
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
- Request id: HTTP `x-request-id` (optional for NATS)
### Standard Service Endpoints (every service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if relevant)
- `GET /metrics` Prometheus
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
### Goal
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
### Exit Criteria
- `shared` contains canonical constants for header names and NATS header names.
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
- Library-level unit tests cover parsing/derivation behavior.
- All crates build and tests pass for the workspace.
### Tasks
- [x] Add shared ID types in `shared`:
- [x] `TenantId`
- [x] `CorrelationId`
- [x] `TraceId`
- [x] Consolidate header constants in `shared`:
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
- [x] HTTP: `x-tenant-id`, `x-request-id`
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
- [x] NATS: `tenant-id`, `Nats-Msg-Id`
- [x] Add shared helpers:
- [x] derive `trace-id` from `traceparent`
- [x] derive `traceparent` from `trace-id` when valid
- [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
- [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
- [x] Add unit tests in `shared` for:
- [x] traceparent parsing validity
- [x] serialization shape for correlation/trace id newtypes
- [x] additional validation cases (invalid traceparents, all-zero ids)
### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
### Dependencies
- Milestone 0
### Goal
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
### Exit Criteria
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
- All NATS producers set required headers consistently.
- All NATS consumers tolerate unknown fields and missing optional fields.
- “Contract tests” exist per service to verify produced headers and subject formats.
### Tasks
- [x] Create/standardize subject builder helpers (prefer `shared`):
- [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
- [x] Runner effect/effect_result subject builders
- [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
- [x] Aggregate publishes:
- [x] `tenant-id` header always present
- [x] correlation + trace headers always present; generated when missing/invalid
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
- [x] Runner publishes (commands/results):
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
- [x] trace headers always present/derived when possible; generated when missing/invalid
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
- [x] outbox metadata → NATS headers mapping standardized via shared helpers
- [x] Projection consumption:
- [x] envelope decoding remains tolerant (unknown fields ignored)
- [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
- [x] Add unit tests:
- [x] subject formatting tests (shared builders)
- [x] required header presence tests per publisher (Aggregate + Runner)
### Required Tests
- Workspace verification commands
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
### Dependencies
- Milestone 1
### Goal
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
### Exit Criteria
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
- Services create streams if missing, and validate compatibility on startup.
- Startup does not silently replace or destructively mutate existing streams.
- Config-only tests validate stream config builders without requiring NATS.
### Tasks
- [x] Define stream policies:
- [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
- [x] `WORKFLOW_COMMANDS` is defined and validated on startup
- [x] `WORKFLOW_EVENTS` is defined and validated on startup
- [x] Centralize stream policy builders/validators in `shared`
- [x] Implement compatibility validation rules:
- [x] required subjects are present (superset allowed)
- [x] limits/max_age/duplicate window validated against minimums
- [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
- [x] Add unit tests for stream config builders + validators.
### Required Tests
- Workspace verification commands
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
### Dependencies
- Milestone 2
### Goal
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
### Exit Criteria
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
- Application-level concurrency is bounded and aligned with `max_in_flight`.
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
- Gated NATS integration tests prove:
- redelivery idempotency
- poison termination
- scale-out behavior (deliver group) where applicable
### Tasks
- [x] Standardize consumer defaults:
- [x] `AckPolicy::Explicit`
- [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
- [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
- [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
- [x] Projection:
- [x] durable naming collision-free for Single/PerView modes
- [x] checkpoint gate semantics: “skip still acks”
- [x] poison handling persists durable records and terminates reliably (poison record + term)
- [x] Runner:
- [x] durable naming collision-free and stable across replicas
- [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
- [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
- [x] Aggregate:
- [x] ad-hoc fetch consumer always unique and bounded
- [x] best-effort deletion never targets unrelated consumers
- [x] Add gated NATS integration tests and document env flags:
- [x] Runner ignored tests
- [x] Projection ignored tests
### Required Tests
- Workspace verification commands
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- Projection exposes `projection.gateway.v1.QueryService`.
- Gateway routes queries via gRPC by default.
- Authz remains enforced in Gateway (deny-by-default).
- Query responses remain stable for Control UI expectations.
- New gRPC query tests pass (unit + integration).
### Tasks
- [x] Define protobuf API: `projection.gateway.v1.QueryService`
- [x] Implement Projection gRPC server for query execution
- [x] Implement Gateway gRPC client routing to Projection
- [x] deadlines
- [x] bounded retries (idempotent only)
- [x] context propagation
- [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
- [x] route internally to gRPC
- [x] Add tests:
- [x] authz + forwarding via gRPC
- [x] tenant isolation enforcement in Projection QueryService
### Required Tests
- Workspace verification commands
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateways `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
### Exit Criteria
- Runner exposes `runner.admin.v1.RunnerAdmin`.
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
- Tenant-spoof and unauthorized calls are rejected deterministically.
- Runner drain/readiness semantics validated and tested.
### Tasks
- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [x] Implement Runner gRPC admin server
- [x] Implement Gateway gRPC client integration for admin operations
- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- [x] Add tests:
- [x] Gateway: rejects without rights
- [x] Gateway: rejects tenant spoof attempts
- [x] Runner: idempotency and drain semantics
### Required Tests
- Workspace verification commands
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
### Dependencies
- Milestones 45 (gRPC internal RPC surfaces available)
### Goal
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
### Exit Criteria
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
- Deadlines everywhere; retries only for idempotent operations.
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
- Gated load/soak tests exist and are runnable.
### Tasks
- [x] Implement upstream channel pool
- [x] bounded LRU
- [x] TTL/eviction
- [x] fast-path reuse under load (cached gRPC channels)
- [x] Standardize retry profiles
- [x] read-only: limited retry with jitter (Gateway gRPC calls)
- [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
- [x] Standardize timeouts/deadlines:
- [x] edge timeout limits
- [x] internal per-service deadlines
- [x] Fanout controls:
- [x] concurrency limiters for probes/snapshots
- [x] short TTL caching where safe
- [x] Ensure probes carry context (correlation/trace) for observability.
### Required Tests
- Workspace verification commands
- Gated load/soak tests (document env + how to run)
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
### Dependencies
- Milestone 6
### Goal
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
- End-to-end smoke tests pass (gated).
### Tasks
- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
- [x] Ensure Control UI + Control API rely only on standardized surfaces
- [x] Harden metrics and readiness probes to match the standard contract everywhere
### Required Tests
- Workspace verification commands
- End-to-end smoke tests (gated)
## Workspace Verification Commands (Run for Every Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)

18
docs/README.md Normal file
View File

@@ -0,0 +1,18 @@
# Cloudlysis Documentation
## Sections
- Architecture
- docs/architecture/overview.md
- docs/architecture/transport.md
- Developer
- docs/developer/setup.md
- docs/developer/testing.md
- Usage
- docs/usage/quickstart.md
- docs/usage/api.md
- docs/usage/nats.md
## Conventions
- HTTP edge remains JSON over REST via Gateway.
- Internal RPC uses gRPC between Gateway and nodes (Aggregate, Projection, Runner).
- Async backbone uses NATS JetStream and KV with standardized subjects, headers, and stream/consumer policies.

View File

@@ -0,0 +1,21 @@
# Architecture Overview
## Monorepo
- Rust workspace: aggregate, projection, runner, gateway, control/api, shared
- Frontend: control/ui
- Infra: docker, observability, swarm
## Data Flow
- Clients → Gateway (HTTP/JSON)
- Gateway ↔ Nodes (gRPC)
- Nodes ↔ NATS (JetStream + KV)
## Services
- Aggregate: command handling + event sourcing; publishes events to JetStream
- Projection: materialized views; consumes aggregate events; exposes QueryService (gRPC)
- Runner: workflow/saga engine + effects/outbox; exposes RunnerAdmin (gRPC)
- Gateway: edge, authn/z, routing to nodes, admin entry points
## Observability
- /health, /ready, /metrics on all services
- Correlation and tracing propagated across HTTP, gRPC, and NATS

View File

@@ -0,0 +1,23 @@
# Transport Contracts
## Context
- Tenant: HTTP x-tenant-id; NATS tenant-id
- Correlation: HTTP x-correlation-id; NATS x-correlation-id + correlation-id
- Trace: HTTP traceparent; NATS traceparent + trace-id
## Internal RPC (gRPC)
- Aggregate: CommandService (submit commands)
- Projection: QueryService (execute queries)
- Runner: RunnerAdmin (drain, status, reload)
- All calls set deadlines and propagate context metadata
## NATS JetStream
- Streams:
- AGGREGATE_EVENTS: tenant.*.aggregate.*.*
- WORKFLOW_COMMANDS, WORKFLOW_EVENTS
- Producers set headers: tenant-id, Nats-Msg-Id, correlation, traceparent/trace-id
- Consumers use AckPolicy::Explicit, bounded ack_wait, max_deliver, max_ack_pending
## Routing
- Gateway routes per-tenant to shards for Aggregate/Projection/Runner
- Routing tables hot-reload atomically

26
docs/developer/setup.md Normal file
View File

@@ -0,0 +1,26 @@
# Developer Setup
## Prerequisites
- Rust toolchain (stable)
- Node.js (LTS) for control/ui
- Docker (optional) for local stack
## Build
```bash
cargo build
cd control/ui && npm ci && npm run build
```
## Workspace Verification
```bash
cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build
```
## Environment
- Gateway: routing config using file or KV
- Projection: PROJECTION_GRPC_ADDR
- Runner: RUNNER_GRPC_ADDR
- NATS: URLs via service-specific settings

27
docs/developer/testing.md Normal file
View File

@@ -0,0 +1,27 @@
# Testing
## Unit and Integration
```bash
cargo test --workspace
```
## Gated Tests (require external services)
- Runner NATS:
```bash
RUNNER_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p runner -- --ignored
```
- Projection NATS:
```bash
PROJECTION_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p projection -- --ignored
```
- Docker-based gates:
```bash
cargo test -p gateway -- --ignored
```
## Control UI
```bash
cd control/ui
npm ci
npm run test
```

38
docs/usage/api.md Normal file
View File

@@ -0,0 +1,38 @@
# Usage: API Examples
## Projection Query via Gateway (HTTP → gRPC)
```bash
curl -sS -X POST \
-H "x-tenant-id: tenant-a" \
-H "x-correlation-id: demo" \
-H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \
http://localhost:8080/v1/query/User \
-d '{"uqf":"{\"eq\":{\"id\":\"u1\"}}"}'
```
## Projection Query via gRPC (direct, internal)
```bash
grpcurl -d '{"tenant_id":"tenant-a","view_type":"User","uqf":"{}"}' \
-H 'x-tenant-id: tenant-a' \
-H 'x-correlation-id: demo' \
-H 'traceparent: 00-00000000000000000000000000000001-0000000000000001-01' \
-plaintext localhost:9090 projection.gateway.v1.QueryService/ExecuteQuery
```
## Aggregate Command via Gateway (HTTP → gRPC)
```bash
curl -sS -X POST \
-H "x-tenant-id: tenant-a" \
-H "x-correlation-id: demo" \
-H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \
http://localhost:8080/v1/aggregate/BankAccount/command \
-d '{"id":"acc-1","command_type":"Open","payload":{"owner":"Alice"}}'
```
## Runner Admin via Gateway (HTTP → gRPC)
```bash
curl -sS -X POST \
-H "x-tenant-id: tenant-a" \
-H "authorization: Bearer <token>" \
http://localhost:8080/admin/runner/drain?wait_ms=0
```

17
docs/usage/nats.md Normal file
View File

@@ -0,0 +1,17 @@
# NATS Reference
## Subjects
- Aggregate events: tenant.<tenant>.aggregate.<type>.<id>
- Workflow commands/events: shared helpers define exact formats
## Headers (Producers)
- tenant-id: required
- Nats-Msg-Id: idempotency key (event_id, command_id, etc.)
- x-correlation-id and correlation-id
- traceparent and trace-id
## Consumers
- AckPolicy::Explicit
- ack_wait: bounded timeout
- max_deliver: bounded
- max_ack_pending: aligned with concurrency

26
docs/usage/quickstart.md Normal file
View File

@@ -0,0 +1,26 @@
# Quick Start
## Compose
```bash
docker compose up -d --build
```
Full stack with observability:
```bash
docker compose -f docker-compose.yml -f observability/docker-compose.yml up -d --build
```
## Local Dev (minimal)
- Start NATS locally
- Run services:
```bash
cargo run -p aggregate
cargo run -p projection
cargo run -p runner
cargo run -p gateway
```
## Verify
- Gateway: GET /health, /ready, /metrics
- Projection: GET /health, /ready, /metrics
- Runner: GET /health, /ready, /metrics