diff --git a/GATEWAY_TRANSPORT_PLAN.md b/GATEWAY_TRANSPORT_PLAN.md deleted file mode 100644 index 4328cde..0000000 --- a/GATEWAY_TRANSPORT_PLAN.md +++ /dev/null @@ -1,216 +0,0 @@ -# Gateway Transport Plan - -## Purpose -Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles: -- Simplicity (few patterns, minimal bespoke conventions) -- Ease of operation (consistent health/ready/metrics, consistent failure modes) -- Frugality (bounded connections, bounded fanout, low overhead) -- High performance (low tail latency, backpressure-aware, predictable routing) -- Safety (tenant isolation, deny-by-default authz, consistent context propagation) - -## Non-Negotiable Rules (Global) -- Every cross-service request MUST carry tenant + trace context. -- Every transport path MUST have explicit timeouts/deadlines and bounded retries. -- Every milestone below is “stop-the-line” gated: - - All tasks completed - - All tests passing - - Workspace lint/format/type checks passing - - Required integration tests for the milestone passing (when gated by env, they must be runnable and documented) - -## Current State (Baseline) -- Gateway → Aggregate: gRPC command submission -- Gateway → Projection: HTTP query proxy (`/v1/query/*`) -- Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`) -- Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent) - -## Target Architecture (End State) -- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly) -- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack) -- Async/event backbone: NATS JetStream remains for event/work distribution -- `shared` is the single source of truth for: - - Header names and propagation rules - - Trace parsing/validation rules (`traceparent`, `trace-id`) - - Request context representation (tenant/correlation/trace) - -## Definitions -### Request Context -Fields that must be consistently propagated: -- `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`) -- `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`) -- `traceparent` (HTTP: `traceparent`, NATS: `traceparent`) -- `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`) -- `request_id` (HTTP: `x-request-id`, optional for NATS) - -### Standard Health Endpoints (per service) -- `GET /health` liveness -- `GET /ready` readiness (includes tenant gating if applicable) -- `GET /metrics` Prometheus - -## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere) - -### Goal -Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes). - -### Exit Criteria -- A single shared contract exists for header names and trace parsing. -- Gateway injects context into all upstream calls (including rebalance/health probes). -- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own. -- Unit tests prove propagation behavior for each transport. -- `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass. - -### Tasks -- [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible. -- [ ] Add `shared` helpers for: - - HTTP extract/inject - - gRPC metadata extract/inject - - NATS header extract/inject -- [ ] Gateway: ensure context is injected into: - - gRPC upstream requests to Aggregate - - HTTP upstream requests to Projection - - Runner admin proxy requests - - Any “probe” calls (rebalance gates, fleet snapshots, health checks) -- [ ] Projection/Runner/Aggregate: ensure NATS published messages include: - - `tenant-id` - - `x-correlation-id` + `correlation-id` - - `traceparent` - - `trace-id` (derived when possible) -- [ ] Add transport-level tests: - - [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved - - [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved - - [ ] NATS publish path: produced headers contain expected keys/values - -### Required Tests -- Unit tests for shared parsing/derivation utilities -- Existing per-crate test suites -- At least one per-service “transport contract” test verifying headers are present and correct - -## Milestone 1: Internal RPC Standardization (Projection via gRPC) - -### Goal -Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use. - -### Exit Criteria -- A Projection gRPC service exists for query execution. -- Gateway routes queries to Projection via gRPC by default. -- Authorization semantics remain enforced in Gateway (deny-by-default). -- Response shapes are stable and match the existing UI expectations. -- All tests pass, including new gRPC query integration tests. - -### Tasks -- [ ] Define protobuf API: `projection.gateway.v1.QueryService` - - [ ] Request includes tenant + view + query payload and metadata - - [ ] Response includes result payload and standard context propagation -- [ ] Implement Projection gRPC server: - - [ ] Parse tenant/view/query - - [ ] Execute query against current projection storage/query engine - - [ ] Enforce tenant scope -- [ ] Implement Gateway gRPC client path for queries: - - [ ] Routing by tenant to Projection endpoint - - [ ] Deadlines, bounded retries (idempotent only) - - [ ] Context propagation (tenant/correlation/trace) -- [ ] Keep HTTP `/v1/query/*`: - - [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint -- [ ] Add tests: - - [ ] Gateway query authz + forwarding via gRPC - - [ ] Projection gRPC query contract tests for tenant isolation - -### Required Tests -- New gRPC QueryService tests (unit + integration) -- Existing query/authz tests in Gateway -- Workspace fmt/clippy/test - -## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC) - -### Goal -Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations. - -### Exit Criteria -- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway. -- Gateway uses gRPC to call Runner admin APIs. -- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary. -- Admin operations are idempotent where appropriate and include audit hooks where required. -- All tests pass and include negative/tenant-spoof cases. - -### Tasks -- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin` - - [ ] Drain/resume/status/reload/tenant-scoped controls - - [ ] Standard error mapping -- [ ] Implement Runner gRPC admin server: - - [ ] Tenant gating enforced for tenant-scoped operations - - [ ] Readiness/drain semantics aligned with platform contracts -- [ ] Implement Gateway gRPC client integration: - - [ ] Route to Runner endpoint via routing table - - [ ] Enforce authz rights (e.g. `runner.admin`) - - [ ] Context propagation -- [ ] Keep HTTP `/admin/*` in Runner optional: - - [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network -- [ ] Tests: - - [ ] Gateway: admin calls rejected without rights - - [ ] Gateway: tenant spoof attempts rejected - - [ ] Runner: idempotency and drain semantics validated - -### Required Tests -- gRPC RunnerAdmin unit/integration tests -- Gateway proxy-to-gRPC tests -- Workspace fmt/clippy/test - -## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality) - -### Goal -Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes. - -### Exit Criteria -- Gateway maintains bounded upstream connection pools for gRPC endpoints. -- All gRPC calls have deadlines; retries are only for idempotent operations. -- All probe/fanout calls are bounded and do not cause thundering herds. -- Load/soak tests show stable behavior under partial failure. - -### Tasks -- [ ] Implement a Gateway upstream channel pool: - - [ ] LRU bounded by max endpoints - - [ ] TTL/eviction strategy - - [ ] Fast path reuse under load -- [ ] Standardize retry profiles: - - [ ] Read-only: short retry with jitter - - [ ] Mutations: no automatic retry unless idempotency key present -- [ ] Standardize timeouts: - - [ ] Edge timeout limits - - [ ] Internal per-service deadlines -- [ ] Fanout controls: - - [ ] Concurrency limiters for fleet snapshot/probes - - [ ] Cache results where safe (short TTL) - -### Required Tests -- Unit tests for pool eviction/TTL -- Gateway integration tests for deadline propagation -- Gated load tests (document env + how to run) - -## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths) - -### Goal -Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async. - -### Exit Criteria -- Gateway no longer depends on HTTP for Projection queries or Runner admin. -- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control. -- All operational playbooks rely on standardized endpoints. - -### Tasks -- [ ] Remove Gateway’s HTTP query proxy usage (or keep only as compatibility shim). -- [ ] Remove Gateway’s runner admin HTTP proxy usage (or keep only as compatibility shim). -- [ ] Ensure Control UI + Control API use the standardized Gateway surfaces. -- [ ] Harden metrics and health probes to always carry context. - -### Required Tests -- End-to-end smoke tests (gated) -- Workspace fmt/clippy/test - -## Verification Commands (Required at Each Milestone) -- `cargo fmt --check` -- `cargo clippy --workspace --all-targets -- -D warnings` -- `cargo test --workspace` -- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`) - -## Notes / Constraints -- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding. -- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical. diff --git a/NATS_TRANSPORT_PLAN.md b/NATS_TRANSPORT_PLAN.md deleted file mode 100644 index aed03a1..0000000 --- a/NATS_TRANSPORT_PLAN.md +++ /dev/null @@ -1,246 +0,0 @@ -# NATS Transport Plan - -## Purpose -Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles: -- Simplicity (few primitives, consistent naming, minimal per-service divergence) -- Ease of operation (predictable streams/consumers, clear runbooks, easy debugging) -- Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage) -- Low resource usage (stable durable consumers, controlled ack waits, limited fanout) -- High performance (high throughput, low tail latency, reliable backpressure) -- Safety (tenant isolation, idempotency, deterministic replay, poison handling) - -## Non-Negotiable Rules (Global) -- Every JetStream stream/consumer MUST have an explicit contract: - - name, subjects, retention, storage, replication, max sizes - - ack policy, ack wait, max deliver, max in flight -- Every node MUST run with bounded work: - - bounded pull batch sizes - - bounded concurrency - - bounded retry/backoff -- Every message MUST be tenant-scoped in subject and/or headers. -- Every milestone below is “stop-the-line” gated: - - all tasks completed - - all tests passing - - workspace lint/format checks passing - - required NATS-gated integration tests for the milestone passing (when gated by env) - -## Current State (Baseline) -- Streams: - - `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume) - - `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner) -- Subject conventions: - - Aggregate events: `tenant..aggregate..` - - Defaults often use filters like `tenant.*.aggregate.*.*` -- Durable consumers: - - Projection uses a durable name (configurable) - - Runner uses configurable durable prefix per role - - Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch -- Headers: - - Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist - -## Target Architecture (End State) -- A single “NATS wire protocol” contract shared across services: - - subject naming - - required headers (tenant/correlation/trace) - - message envelope compatibility rules (tolerant decoding, optional fields) -- Stable, minimal set of JetStream streams: - - one stream per message class (aggregate events, workflow commands, workflow events) - - no per-tenant streams unless there is a strong operational reason -- Stable, limited consumers: - - durable consumers for long-lived processors (Projection, Runner) - - ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion -- Uniform backpressure + reliability defaults: - - explicit ack - - bounded `max_ack_pending` and application-level concurrency - - bounded redelivery via `max_deliver` + poison policy - -## Definitions -### Message Context (Headers) -Standard headers for NATS published messages: -- `tenant-id` (required) -- `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing) -- `traceparent` (optional but recommended; generated/propagated if present upstream) -- `trace-id` (optional; derived from traceparent when possible) -- `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable) - -### Subject Naming Rules -- Tenant-first prefix: `tenant..…` -- Stable message class token: - - `aggregate` for domain events - - `effect`, `effect_result`, `workflow`, `workflow_event` for Runner -- No ambiguous wildcard publishing: - - producers publish concrete subjects only - - consumers may filter with wildcards - -### Consumer Naming Rules -- Durable consumer names must be stable and collision-free: - - include role + mode + optional view/saga name + shard/group -- Ephemeral consumer names must be unique per operation: - - include tenant + purpose + uuid - - must be deleted best-effort when operation completes - -## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes) - -### Goal -Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts. - -### Exit Criteria -- `shared` exposes NATS header constants and helpers for inject/extract/derive. -- All producers set required headers consistently. -- All consumers tolerate unknown fields and missing optional fields. -- A single, documented subject naming convention is enforced in code (builder functions). -- Workspace fmt/clippy/tests pass. - -### Tasks -- [ ] Centralize NATS header constants and helpers in `shared`: - - [ ] inject headers for publish (tenant, correlation, trace) - - [ ] extract headers on receive (best-effort) - - [ ] derive `trace-id` from `traceparent` -- [ ] Aggregate: - - [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers - - [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test) -- [ ] Projection: - - [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported) - - [ ] Ensure correlation/trace context is carried into spans/metrics consistently -- [ ] Runner: - - [ ] Ensure publish paths include correlation/trace headers consistently for commands and results - - [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested -- [ ] Tests: - - [ ] Unit tests for header injection/extraction in `shared` - - [ ] Per-service unit tests asserting produced headers include required keys - -### Required Tests -- `cargo fmt --check` -- `cargo clippy --workspace --all-targets -- -D warnings` -- `cargo test --workspace` - -## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage) - -### Goal -Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage. - -### Exit Criteria -- Stream config for each stream is explicitly defined and validated at startup. -- Limits (max messages/bytes/age) are explicit and have defaults. -- Duplicate windows and dedupe behavior are explicit and tested. -- A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace). - -### Tasks -- [ ] Define a single “stream config policy” module per service (or shared helper): - - [ ] `AGGREGATE_EVENTS` subjects + retention policy - - [ ] `WORKFLOW_COMMANDS` subjects + retention policy - - [ ] `WORKFLOW_EVENTS` subjects + retention policy -- [ ] Standardize defaults: - - [ ] retention: limits appropriate for replay + rebuild - - [ ] `duplicate_window` aligned with producer idempotency strategy - - [ ] storage type and replication policy documented and configurable -- [ ] Add startup validations: - - [ ] verify stream exists and matches required subject set (compatible superset allowed) - - [ ] verify required ack/dedupe assumptions hold -- [ ] Add tests that parse and validate configs without NATS. - -### Required Tests -- Unit tests for stream config builders -- Existing crate tests - -## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison) - -### Goal -Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling. - -### Exit Criteria -- All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`. -- Application concurrency is bounded and tied to `max_in_flight`. -- Poison policy is consistent: - - after `max_deliver`, term + deadletter/quarantine record is written -- Replay behavior is deterministic on restart (checkpoint-based where applicable). - -### Tasks -- [ ] Define standard consumer config defaults: - - [ ] `AckPolicy::Explicit` - - [ ] `ack_wait` default + env override - - [ ] `max_deliver` default + env override - - [ ] `max_ack_pending` tied to application concurrency -- [ ] Projection: - - [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView) - - [ ] Ensure checkpoint gates ack correctly (skip still acks) - - [ ] Ensure poison policy writes durable records and terminates reliably -- [ ] Runner: - - [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out - - [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish -- [ ] Aggregate: - - [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required) - - [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers -- [ ] Tests: - - [ ] Unit tests for consumer name generation (sanitization + uniqueness) - - [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag) - -### Required Tests -- Workspace fmt/clippy/tests -- NATS-gated integration tests for: - - redelivery idempotency - - poison termination behavior - - scale-out with deliver group (where supported) - -## Milestone 3: Connection Management + Failure Semantics (Operational Frugality) - -### Goal -Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages. - -### Exit Criteria -- One NATS connection per process (or bounded pool only if justified). -- Reconnect/backoff policy is explicit and consistent. -- Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly. -- No busy-looping on NATS outages. - -### Tasks -- [ ] Standardize connection options: - - [ ] reconnect delays/backoff - - [ ] max reconnect attempts or “infinite with backoff” strategy (explicit) - - [ ] request timeouts around JetStream operations -- [ ] Standardize readiness semantics: - - [ ] `ready=false` when NATS is unavailable and the node depends on it - - [ ] `health` stays “process alive” but reports NATS connectivity in payload -- [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set). -- [ ] Tests: - - [ ] unit tests for backoff behavior (where possible) - - [ ] gated integration test: temporary NATS outage does not crash-loop and recovers - -## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable) - -### Goal -Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage. - -### Exit Criteria -- Durable names are deterministic and collision-free across replicas. -- Deliver groups are used where appropriate to share work across replicas. -- Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking). -- A scale-out test suite exists and is gated but runnable. - -### Tasks -- [ ] Establish consumer naming scheme per service role: - - [ ] Projection: per-view durable option uses sanitized names and stable mapping - - [ ] Runner: durable prefix includes role + shard + optional group -- [ ] Establish deliver group usage rules: - - [ ] when to enable (scale-out consumers) - - [ ] how to roll without duplication -- [ ] Strengthen dedupe keys: - - [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery - - [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id` -- [ ] Add gated tests: - - [ ] two replicas, same tenant, no duplicate publishes - - [ ] rolling restart preserves checkpoint correctness - -## Verification Commands (Required at Each Milestone) -- `cargo fmt --check` -- `cargo clippy --workspace --all-targets -- -D warnings` -- `cargo test --workspace` -- Gated NATS integration tests: - - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored` - - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored` - - Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests - -## Notes / Constraints -- Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups. -- Prefer backward-compatible envelope changes (optional fields, tolerant decoding). -- Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort. diff --git a/README.md b/README.md index 7670628..3214387 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,12 @@ - Rust services (Cargo workspace): `aggregate/`, `gateway/`, `projection/`, `runner/`, `control/api/`, `shared/` - Control UI: `control/ui/` - Docker + Swarm + Compose: `docker/`, `docker-compose.yml`, `swarm/`, `observability/` -- Transport plans: - - `TRANSPORT_DEVELOPMENT_PLAN.md` - - `GATEWAY_TRANSPORT_PLAN.md` - - `NATS_TRANSPORT_PLAN.md` + +## Documentation +- docs/README.md +- Architecture: docs/architecture/overview.md, docs/architecture/transport.md +- Developer: docs/developer/setup.md, docs/developer/testing.md +- Usage: docs/usage/quickstart.md, docs/usage/api.md, docs/usage/nats.md ## Quick Start (Docker Compose) @@ -35,4 +37,5 @@ More details: `DOCKER.md` cargo fmt --check cargo clippy --workspace --all-targets -- -D warnings cargo test --workspace +cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build ``` diff --git a/TRANSPORT_DEVELOPMENT_PLAN.md b/TRANSPORT_DEVELOPMENT_PLAN.md deleted file mode 100644 index 29cc495..0000000 --- a/TRANSPORT_DEVELOPMENT_PLAN.md +++ /dev/null @@ -1,339 +0,0 @@ -# Transport Development Plan - -## Purpose -Unify and optimize the platform transport layer end-to-end: -- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes -- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate - -This plan merges and supersedes: -- `GATEWAY_TRANSPORT_PLAN.md` -- `NATS_TRANSPORT_PLAN.md` - -## Current Status (Codebase Reality) -- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway. -- Request context pieces are standardized: - - `shared` provides `TenantId`, `CorrelationId`, `TraceId` - - `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)` - - `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers - - Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated -- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`. -- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`). -- Node → NATS header propagation is improved and closer to consistent: - - Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing. - - Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing. - - Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them. -- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes. - -## Principles -- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone. -- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes. -- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources. -- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops. -- High performance: multiplexing, backpressure, low tail latency, predictable routing. -- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay. - -## Non-Negotiable Rules (Global) -- Every cross-component hop MUST carry tenant + correlation + trace context. -- Every transport path MUST have explicit timeouts/deadlines and bounded retries. -- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy). -- Every milestone is stop-the-line gated: - - All tasks completed - - All tests required by the milestone pass - - Workspace verification commands pass - - Gated integration tests for the milestone are runnable and documented - -## Baseline (Today) -- Gateway → Aggregate: gRPC (command submission) -- Gateway → Projection: HTTP (query proxy) -- Gateway → Runner: HTTP (admin proxy) -- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` - -## End State (Target Architecture) -- Edge contract (clients ↔ Gateway): HTTP/JSON -- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin -- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement -- `shared` is the single source of truth for: - - header names and injection/extraction rules - - trace parsing/validation (`traceparent`, `trace-id`) - - context object model (tenant/correlation/trace/request ids) - - NATS subject + consumer naming helpers - -## Standard Contracts -### Context Fields -- Tenant: HTTP `x-tenant-id`, NATS `tenant-id` -- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id` -- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible) -- Request id: HTTP `x-request-id` (optional for NATS) - -### Standard Service Endpoints (every service) -- `GET /health` liveness -- `GET /ready` readiness (includes tenant gating if relevant) -- `GET /metrics` Prometheus - -## Milestone 0: Shared Transport Contract (Headers + Context + Trace) - -### Goal -Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract. - -### Exit Criteria -- `shared` contains canonical constants for header names and NATS header names. -- `shared` contains canonical trace parsing/validation and trace derivation helpers. -- Library-level unit tests cover parsing/derivation behavior. -- All crates build and tests pass for the workspace. - -### Tasks -- [x] Add shared ID types in `shared`: - - [x] `TenantId` - - [x] `CorrelationId` - - [x] `TraceId` -- [x] Consolidate header constants in `shared`: - - [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop) - - [x] HTTP: `x-tenant-id`, `x-request-id` - - [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible) - - [x] NATS: `tenant-id`, `Nats-Msg-Id` -- [x] Add shared helpers: - - [x] derive `trace-id` from `traceparent` - - [x] derive `traceparent` from `trace-id` when valid - - [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`) - - [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`) -- [x] Add unit tests in `shared` for: - - [x] traceparent parsing validity - - [x] serialization shape for correlation/trace id newtypes - - [x] additional validation cases (invalid traceparents, all-zero ids) - -### Required Tests -- `cargo fmt --check` -- `cargo clippy --workspace --all-targets -- -D warnings` -- `cargo test --workspace` - -## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes) - -### Dependencies -- Milestone 0 - -### Goal -Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts. - -### Exit Criteria -- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only). -- All NATS producers set required headers consistently. -- All NATS consumers tolerate unknown fields and missing optional fields. -- “Contract tests” exist per service to verify produced headers and subject formats. - -### Tasks -- [x] Create/standardize subject builder helpers (prefer `shared`): - - [x] Aggregate event subject builder (`tenant..aggregate..`) - - [x] Runner effect/effect_result subject builders - - [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work) -- [x] Aggregate publishes: - - [x] `tenant-id` header always present - - [x] correlation + trace headers always present; generated when missing/invalid - - [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path) - - [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`) -- [x] Runner publishes (commands/results): - - [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing - - [x] trace headers always present/derived when possible; generated when missing/invalid - - [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`) - - [x] outbox metadata → NATS headers mapping standardized via shared helpers -- [x] Projection consumption: - - [x] envelope decoding remains tolerant (unknown fields ignored) - - [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback) -- [x] Add unit tests: - - [x] subject formatting tests (shared builders) - - [x] required header presence tests per publisher (Aggregate + Runner) - -### Required Tests -- Workspace verification commands - -## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup) - -### Dependencies -- Milestone 1 - -### Goal -Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes. - -### Exit Criteria -- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window). -- Services create streams if missing, and validate compatibility on startup. -- Startup does not silently replace or destructively mutate existing streams. -- Config-only tests validate stream config builders without requiring NATS. - -### Tasks -- [x] Define stream policies: - - [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup - - [x] `WORKFLOW_COMMANDS` is defined and validated on startup - - [x] `WORKFLOW_EVENTS` is defined and validated on startup - - [x] Centralize stream policy builders/validators in `shared` -- [x] Implement compatibility validation rules: - - [x] required subjects are present (superset allowed) - - [x] limits/max_age/duplicate window validated against minimums - - [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy) -- [x] Add unit tests for stream config builders + validators. - -### Required Tests -- Workspace verification commands - -## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load) - -### Dependencies -- Milestone 2 - -### Goal -Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling. - -### Exit Criteria -- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`). -- Application-level concurrency is bounded and aligned with `max_in_flight`. -- Poison policy is consistent across consumers (term + durable quarantine/deadletter record). -- Gated NATS integration tests prove: - - redelivery idempotency - - poison termination - - scale-out behavior (deliver group) where applicable - -### Tasks -- [x] Standardize consumer defaults: - - [x] `AckPolicy::Explicit` - - [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`) - - [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`) - - [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`) -- [x] Projection: - - [x] durable naming collision-free for Single/PerView modes - - [x] checkpoint gate semantics: “skip still acks” - - [x] poison handling persists durable records and terminates reliably (poison record + term) -- [x] Runner: - - [x] durable naming collision-free and stable across replicas - - [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured) - - [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default) -- [x] Aggregate: - - [x] ad-hoc fetch consumer always unique and bounded - - [x] best-effort deletion never targets unrelated consumers -- [x] Add gated NATS integration tests and document env flags: - - [x] Runner ignored tests - - [x] Projection ignored tests - -### Required Tests -- Workspace verification commands -- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored` -- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored` - -## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService) - -### Dependencies -- Milestone 0 (context contract) - -### Goal -Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use. - -### Exit Criteria -- Projection exposes `projection.gateway.v1.QueryService`. -- Gateway routes queries via gRPC by default. -- Authz remains enforced in Gateway (deny-by-default). -- Query responses remain stable for Control UI expectations. -- New gRPC query tests pass (unit + integration). - -### Tasks -- [x] Define protobuf API: `projection.gateway.v1.QueryService` -- [x] Implement Projection gRPC server for query execution -- [x] Implement Gateway gRPC client routing to Projection - - [x] deadlines - - [x] bounded retries (idempotent only) - - [x] context propagation -- [x] Preserve HTTP `/v1/query/*` as compatibility/debug: - - [x] route internally to gRPC -- [x] Add tests: - - [x] authz + forwarding via gRPC - - [x] tenant isolation enforcement in Projection QueryService - -### Required Tests -- Workspace verification commands - -## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin) - -### Dependencies -- Milestone 0 (context contract) - -### Goal -Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service. - -### Exit Criteria -- Runner exposes `runner.admin.v1.RunnerAdmin`. -- Gateway calls Runner admin via gRPC (authz enforced in Gateway). -- Tenant-spoof and unauthorized calls are rejected deterministically. -- Runner drain/readiness semantics validated and tested. - -### Tasks -- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin` -- [x] Implement Runner gRPC admin server -- [x] Implement Gateway gRPC client integration for admin operations -- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway -- [x] Add tests: - - [x] Gateway: rejects without rights - - [x] Gateway: rejects tenant spoof attempts - - [x] Runner: idempotency and drain semantics - -### Required Tests -- Workspace verification commands - -## Milestone 6: Gateway Upstream Performance + Operational Guardrails - -### Dependencies -- Milestones 4–5 (gRPC internal RPC surfaces available) - -### Goal -Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load. - -### Exit Criteria -- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction). -- Deadlines everywhere; retries only for idempotent operations. -- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context. -- Gated load/soak tests exist and are runnable. - -### Tasks -- [x] Implement upstream channel pool - - [x] bounded LRU - - [x] TTL/eviction - - [x] fast-path reuse under load (cached gRPC channels) -- [x] Standardize retry profiles - - [x] read-only: limited retry with jitter (Gateway gRPC calls) - - [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations) -- [x] Standardize timeouts/deadlines: - - [x] edge timeout limits - - [x] internal per-service deadlines -- [x] Fanout controls: - - [x] concurrency limiters for probes/snapshots - - [x] short TTL caching where safe -- [x] Ensure probes carry context (correlation/trace) for observability. - -### Required Tests -- Workspace verification commands -- Gated load/soak tests (document env + how to run) - -## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths) - -### Dependencies -- Milestone 6 - -### Goal -Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only. - -### Exit Criteria -- Gateway no longer depends on HTTP for Projection queries or Runner admin. -- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control. -- End-to-end smoke tests pass (gated). - -### Tasks -- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC) -- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC) -- [x] Ensure Control UI + Control API rely only on standardized surfaces -- [x] Harden metrics and readiness probes to match the standard contract everywhere - -### Required Tests -- Workspace verification commands -- End-to-end smoke tests (gated) - -## Workspace Verification Commands (Run for Every Milestone) -- `cargo fmt --check` -- `cargo clippy --workspace --all-targets -- -D warnings` -- `cargo test --workspace` -- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`) diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..46e4ebf --- /dev/null +++ b/docs/README.md @@ -0,0 +1,18 @@ +# Cloudlysis Documentation + +## Sections +- Architecture + - docs/architecture/overview.md + - docs/architecture/transport.md +- Developer + - docs/developer/setup.md + - docs/developer/testing.md +- Usage + - docs/usage/quickstart.md + - docs/usage/api.md + - docs/usage/nats.md + +## Conventions +- HTTP edge remains JSON over REST via Gateway. +- Internal RPC uses gRPC between Gateway and nodes (Aggregate, Projection, Runner). +- Async backbone uses NATS JetStream and KV with standardized subjects, headers, and stream/consumer policies. diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md new file mode 100644 index 0000000..13b732c --- /dev/null +++ b/docs/architecture/overview.md @@ -0,0 +1,21 @@ +# Architecture Overview + +## Monorepo +- Rust workspace: aggregate, projection, runner, gateway, control/api, shared +- Frontend: control/ui +- Infra: docker, observability, swarm + +## Data Flow +- Clients → Gateway (HTTP/JSON) +- Gateway ↔ Nodes (gRPC) +- Nodes ↔ NATS (JetStream + KV) + +## Services +- Aggregate: command handling + event sourcing; publishes events to JetStream +- Projection: materialized views; consumes aggregate events; exposes QueryService (gRPC) +- Runner: workflow/saga engine + effects/outbox; exposes RunnerAdmin (gRPC) +- Gateway: edge, authn/z, routing to nodes, admin entry points + +## Observability +- /health, /ready, /metrics on all services +- Correlation and tracing propagated across HTTP, gRPC, and NATS diff --git a/docs/architecture/transport.md b/docs/architecture/transport.md new file mode 100644 index 0000000..d43765b --- /dev/null +++ b/docs/architecture/transport.md @@ -0,0 +1,23 @@ +# Transport Contracts + +## Context +- Tenant: HTTP x-tenant-id; NATS tenant-id +- Correlation: HTTP x-correlation-id; NATS x-correlation-id + correlation-id +- Trace: HTTP traceparent; NATS traceparent + trace-id + +## Internal RPC (gRPC) +- Aggregate: CommandService (submit commands) +- Projection: QueryService (execute queries) +- Runner: RunnerAdmin (drain, status, reload) +- All calls set deadlines and propagate context metadata + +## NATS JetStream +- Streams: + - AGGREGATE_EVENTS: tenant.*.aggregate.*.* + - WORKFLOW_COMMANDS, WORKFLOW_EVENTS +- Producers set headers: tenant-id, Nats-Msg-Id, correlation, traceparent/trace-id +- Consumers use AckPolicy::Explicit, bounded ack_wait, max_deliver, max_ack_pending + +## Routing +- Gateway routes per-tenant to shards for Aggregate/Projection/Runner +- Routing tables hot-reload atomically diff --git a/docs/developer/setup.md b/docs/developer/setup.md new file mode 100644 index 0000000..ca36264 --- /dev/null +++ b/docs/developer/setup.md @@ -0,0 +1,26 @@ +# Developer Setup + +## Prerequisites +- Rust toolchain (stable) +- Node.js (LTS) for control/ui +- Docker (optional) for local stack + +## Build +```bash +cargo build +cd control/ui && npm ci && npm run build +``` + +## Workspace Verification +```bash +cargo fmt --check +cargo clippy --workspace --all-targets -- -D warnings +cargo test --workspace +cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build +``` + +## Environment +- Gateway: routing config using file or KV +- Projection: PROJECTION_GRPC_ADDR +- Runner: RUNNER_GRPC_ADDR +- NATS: URLs via service-specific settings diff --git a/docs/developer/testing.md b/docs/developer/testing.md new file mode 100644 index 0000000..882a938 --- /dev/null +++ b/docs/developer/testing.md @@ -0,0 +1,27 @@ +# Testing + +## Unit and Integration +```bash +cargo test --workspace +``` + +## Gated Tests (require external services) +- Runner NATS: +```bash +RUNNER_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p runner -- --ignored +``` +- Projection NATS: +```bash +PROJECTION_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p projection -- --ignored +``` +- Docker-based gates: +```bash +cargo test -p gateway -- --ignored +``` + +## Control UI +```bash +cd control/ui +npm ci +npm run test +``` diff --git a/docs/usage/api.md b/docs/usage/api.md new file mode 100644 index 0000000..017edcc --- /dev/null +++ b/docs/usage/api.md @@ -0,0 +1,38 @@ +# Usage: API Examples + +## Projection Query via Gateway (HTTP → gRPC) +```bash +curl -sS -X POST \ + -H "x-tenant-id: tenant-a" \ + -H "x-correlation-id: demo" \ + -H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \ + http://localhost:8080/v1/query/User \ + -d '{"uqf":"{\"eq\":{\"id\":\"u1\"}}"}' +``` + +## Projection Query via gRPC (direct, internal) +```bash +grpcurl -d '{"tenant_id":"tenant-a","view_type":"User","uqf":"{}"}' \ + -H 'x-tenant-id: tenant-a' \ + -H 'x-correlation-id: demo' \ + -H 'traceparent: 00-00000000000000000000000000000001-0000000000000001-01' \ + -plaintext localhost:9090 projection.gateway.v1.QueryService/ExecuteQuery +``` + +## Aggregate Command via Gateway (HTTP → gRPC) +```bash +curl -sS -X POST \ + -H "x-tenant-id: tenant-a" \ + -H "x-correlation-id: demo" \ + -H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \ + http://localhost:8080/v1/aggregate/BankAccount/command \ + -d '{"id":"acc-1","command_type":"Open","payload":{"owner":"Alice"}}' +``` + +## Runner Admin via Gateway (HTTP → gRPC) +```bash +curl -sS -X POST \ + -H "x-tenant-id: tenant-a" \ + -H "authorization: Bearer " \ + http://localhost:8080/admin/runner/drain?wait_ms=0 +``` diff --git a/docs/usage/nats.md b/docs/usage/nats.md new file mode 100644 index 0000000..bb2eeeb --- /dev/null +++ b/docs/usage/nats.md @@ -0,0 +1,17 @@ +# NATS Reference + +## Subjects +- Aggregate events: tenant..aggregate.. +- Workflow commands/events: shared helpers define exact formats + +## Headers (Producers) +- tenant-id: required +- Nats-Msg-Id: idempotency key (event_id, command_id, etc.) +- x-correlation-id and correlation-id +- traceparent and trace-id + +## Consumers +- AckPolicy::Explicit +- ack_wait: bounded timeout +- max_deliver: bounded +- max_ack_pending: aligned with concurrency diff --git a/docs/usage/quickstart.md b/docs/usage/quickstart.md new file mode 100644 index 0000000..f74c52e --- /dev/null +++ b/docs/usage/quickstart.md @@ -0,0 +1,26 @@ +# Quick Start + +## Compose +```bash +docker compose up -d --build +``` + +Full stack with observability: +```bash +docker compose -f docker-compose.yml -f observability/docker-compose.yml up -d --build +``` + +## Local Dev (minimal) +- Start NATS locally +- Run services: +```bash +cargo run -p aggregate +cargo run -p projection +cargo run -p runner +cargo run -p gateway +``` + +## Verify +- Gateway: GET /health, /ready, /metrics +- Projection: GET /health, /ready, /metrics +- Runner: GET /health, /ready, /metrics