docs: add docs folder (architecture, developer, usage); update README; wire probe TTL cache + concurrency notes into docs

2026-03-30 14:32:47 +03:00
parent 90c307016d
commit e9a0142396
12 changed files with 203 additions and 805 deletions
--- a/GATEWAY_TRANSPORT_PLAN.md
+++ b/GATEWAY_TRANSPORT_PLAN.md
@@ -1,216 +0,0 @@
 # Gateway Transport Plan
 ## Purpose
 Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
 - Simplicity (few patterns, minimal bespoke conventions)
 - Ease of operation (consistent health/ready/metrics, consistent failure modes)
 - Frugality (bounded connections, bounded fanout, low overhead)
 - High performance (low tail latency, backpressure-aware, predictable routing)
 - Safety (tenant isolation, deny-by-default authz, consistent context propagation)
 ## Non-Negotiable Rules (Global)
 - Every cross-service request MUST carry tenant + trace context.
 - Every transport path MUST have explicit timeouts/deadlines and bounded retries.
 - Every milestone below is “stop-the-line” gated:
  - All tasks completed
  - All tests passing
  - Workspace lint/format/type checks passing
  - Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
 ## Current State (Baseline)
 - Gateway → Aggregate: gRPC command submission
 - Gateway → Projection: HTTP query proxy (`/v1/query/*`)
 - Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`)
 - Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
 ## Target Architecture (End State)
 - Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
 - Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
 - Async/event backbone: NATS JetStream remains for event/work distribution
 - `shared` is the single source of truth for:
  - Header names and propagation rules
  - Trace parsing/validation rules (`traceparent`, `trace-id`)
  - Request context representation (tenant/correlation/trace)
 ## Definitions
 ### Request Context
 Fields that must be consistently propagated:
 - `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`)
 - `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`)
 - `traceparent` (HTTP: `traceparent`, NATS: `traceparent`)
 - `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`)
 - `request_id` (HTTP: `x-request-id`, optional for NATS)
 ### Standard Health Endpoints (per service)
 - `GET /health` liveness
 - `GET /ready` readiness (includes tenant gating if applicable)
 - `GET /metrics` Prometheus
 ## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
 ### Goal
 Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
 ### Exit Criteria
 - A single shared contract exists for header names and trace parsing.
 - Gateway injects context into all upstream calls (including rebalance/health probes).
 - Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
 - Unit tests prove propagation behavior for each transport.
 - `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass.
 ### Tasks
 - [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible.
 - [ ] Add `shared` helpers for:
  - HTTP extract/inject
  - gRPC metadata extract/inject
  - NATS header extract/inject
 - [ ] Gateway: ensure context is injected into:
  - gRPC upstream requests to Aggregate
  - HTTP upstream requests to Projection
  - Runner admin proxy requests
  - Any “probe” calls (rebalance gates, fleet snapshots, health checks)
 - [ ] Projection/Runner/Aggregate: ensure NATS published messages include:
  - `tenant-id`
  - `x-correlation-id` + `correlation-id`
  - `traceparent`
  - `trace-id` (derived when possible)
 - [ ] Add transport-level tests:
  - [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
  - [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved
  - [ ] NATS publish path: produced headers contain expected keys/values
 ### Required Tests
 - Unit tests for shared parsing/derivation utilities
 - Existing per-crate test suites
 - At least one per-service “transport contract” test verifying headers are present and correct
 ## Milestone 1: Internal RPC Standardization (Projection via gRPC)
 ### Goal
 Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
 ### Exit Criteria
 - A Projection gRPC service exists for query execution.
 - Gateway routes queries to Projection via gRPC by default.
 - Authorization semantics remain enforced in Gateway (deny-by-default).
 - Response shapes are stable and match the existing UI expectations.
 - All tests pass, including new gRPC query integration tests.
 ### Tasks
 - [ ] Define protobuf API: `projection.gateway.v1.QueryService`
  - [ ] Request includes tenant + view + query payload and metadata
  - [ ] Response includes result payload and standard context propagation
 - [ ] Implement Projection gRPC server:
  - [ ] Parse tenant/view/query
  - [ ] Execute query against current projection storage/query engine
  - [ ] Enforce tenant scope
 - [ ] Implement Gateway gRPC client path for queries:
  - [ ] Routing by tenant to Projection endpoint
  - [ ] Deadlines, bounded retries (idempotent only)
  - [ ] Context propagation (tenant/correlation/trace)
 - [ ] Keep HTTP `/v1/query/*`:
  - [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint
 - [ ] Add tests:
  - [ ] Gateway query authz + forwarding via gRPC
  - [ ] Projection gRPC query contract tests for tenant isolation
 ### Required Tests
 - New gRPC QueryService tests (unit + integration)
 - Existing query/authz tests in Gateway
 - Workspace fmt/clippy/test
 ## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
 ### Goal
 Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations.
 ### Exit Criteria
 - Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
 - Gateway uses gRPC to call Runner admin APIs.
 - Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
 - Admin operations are idempotent where appropriate and include audit hooks where required.
 - All tests pass and include negative/tenant-spoof cases.
 ### Tasks
 - [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
  - [ ] Drain/resume/status/reload/tenant-scoped controls
  - [ ] Standard error mapping
 - [ ] Implement Runner gRPC admin server:
  - [ ] Tenant gating enforced for tenant-scoped operations
  - [ ] Readiness/drain semantics aligned with platform contracts
 - [ ] Implement Gateway gRPC client integration:
  - [ ] Route to Runner endpoint via routing table
  - [ ] Enforce authz rights (e.g. `runner.admin`)
  - [ ] Context propagation
 - [ ] Keep HTTP `/admin/*` in Runner optional:
  - [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network
 - [ ] Tests:
  - [ ] Gateway: admin calls rejected without rights
  - [ ] Gateway: tenant spoof attempts rejected
  - [ ] Runner: idempotency and drain semantics validated
 ### Required Tests
 - gRPC RunnerAdmin unit/integration tests
 - Gateway proxy-to-gRPC tests
 - Workspace fmt/clippy/test
 ## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
 ### Goal
 Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
 ### Exit Criteria
 - Gateway maintains bounded upstream connection pools for gRPC endpoints.
 - All gRPC calls have deadlines; retries are only for idempotent operations.
 - All probe/fanout calls are bounded and do not cause thundering herds.
 - Load/soak tests show stable behavior under partial failure.
 ### Tasks
 - [ ] Implement a Gateway upstream channel pool:
  - [ ] LRU bounded by max endpoints
  - [ ] TTL/eviction strategy
  - [ ] Fast path reuse under load
 - [ ] Standardize retry profiles:
  - [ ] Read-only: short retry with jitter
  - [ ] Mutations: no automatic retry unless idempotency key present
 - [ ] Standardize timeouts:
  - [ ] Edge timeout limits
  - [ ] Internal per-service deadlines
 - [ ] Fanout controls:
  - [ ] Concurrency limiters for fleet snapshot/probes
  - [ ] Cache results where safe (short TTL)
 ### Required Tests
 - Unit tests for pool eviction/TTL
 - Gateway integration tests for deadline propagation
 - Gated load tests (document env + how to run)
 ## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
 ### Goal
 Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
 ### Exit Criteria
 - Gateway no longer depends on HTTP for Projection queries or Runner admin.
 - Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
 - All operational playbooks rely on standardized endpoints.
 ### Tasks
 - [ ] Remove Gateway’s HTTP query proxy usage (or keep only as compatibility shim).
 - [ ] Remove Gateway’s runner admin HTTP proxy usage (or keep only as compatibility shim).
 - [ ] Ensure Control UI + Control API use the standardized Gateway surfaces.
 - [ ] Harden metrics and health probes to always carry context.
 ### Required Tests
 - End-to-end smoke tests (gated)
 - Workspace fmt/clippy/test
 ## Verification Commands (Required at Each Milestone)
 - `cargo fmt --check`
 - `cargo clippy --workspace --all-targets -- -D warnings`
 - `cargo test --workspace`
 - `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
 ## Notes / Constraints
 - Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
 - Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.
--- a/NATS_TRANSPORT_PLAN.md
+++ b/NATS_TRANSPORT_PLAN.md
@@ -1,246 +0,0 @@
 # NATS Transport Plan
 ## Purpose
 Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:
 - Simplicity (few primitives, consistent naming, minimal per-service divergence)
 - Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
 - Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
 - Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
 - High performance (high throughput, low tail latency, reliable backpressure)
 - Safety (tenant isolation, idempotency, deterministic replay, poison handling)
 ## Non-Negotiable Rules (Global)
 - Every JetStream stream/consumer MUST have an explicit contract:
  - name, subjects, retention, storage, replication, max sizes
  - ack policy, ack wait, max deliver, max in flight
 - Every node MUST run with bounded work:
  - bounded pull batch sizes
  - bounded concurrency
  - bounded retry/backoff
 - Every message MUST be tenant-scoped in subject and/or headers.
 - Every milestone below is “stop-the-line” gated:
  - all tasks completed
  - all tests passing
  - workspace lint/format checks passing
  - required NATS-gated integration tests for the milestone passing (when gated by env)
 ## Current State (Baseline)
 - Streams:
  - `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume)
  - `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner)
 - Subject conventions:
  - Aggregate events: `tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>`
  - Defaults often use filters like `tenant.*.aggregate.*.*`
 - Durable consumers:
  - Projection uses a durable name (configurable)
  - Runner uses configurable durable prefix per role
  - Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
 - Headers:
  - Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist
 ## Target Architecture (End State)
 - A single “NATS wire protocol” contract shared across services:
  - subject naming
  - required headers (tenant/correlation/trace)
  - message envelope compatibility rules (tolerant decoding, optional fields)
 - Stable, minimal set of JetStream streams:
  - one stream per message class (aggregate events, workflow commands, workflow events)
  - no per-tenant streams unless there is a strong operational reason
 - Stable, limited consumers:
  - durable consumers for long-lived processors (Projection, Runner)
  - ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
 - Uniform backpressure + reliability defaults:
  - explicit ack
  - bounded `max_ack_pending` and application-level concurrency
  - bounded redelivery via `max_deliver` + poison policy
 ## Definitions
 ### Message Context (Headers)
 Standard headers for NATS published messages:
 - `tenant-id` (required)
 - `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing)
 - `traceparent` (optional but recommended; generated/propagated if present upstream)
 - `trace-id` (optional; derived from traceparent when possible)
 - `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable)
 ### Subject Naming Rules
 - Tenant-first prefix: `tenant.<tenant_id>.…`
 - Stable message class token:
  - `aggregate` for domain events
  - `effect`, `effect_result`, `workflow`, `workflow_event` for Runner
 - No ambiguous wildcard publishing:
  - producers publish concrete subjects only
  - consumers may filter with wildcards
 ### Consumer Naming Rules
 - Durable consumer names must be stable and collision-free:
  - include role + mode + optional view/saga name + shard/group
 - Ephemeral consumer names must be unique per operation:
  - include tenant + purpose + uuid
  - must be deleted best-effort when operation completes
 ## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)
 ### Goal
 Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.
 ### Exit Criteria
 - `shared` exposes NATS header constants and helpers for inject/extract/derive.
 - All producers set required headers consistently.
 - All consumers tolerate unknown fields and missing optional fields.
 - A single, documented subject naming convention is enforced in code (builder functions).
 - Workspace fmt/clippy/tests pass.
 ### Tasks
 - [ ] Centralize NATS header constants and helpers in `shared`:
  - [ ] inject headers for publish (tenant, correlation, trace)
  - [ ] extract headers on receive (best-effort)
  - [ ] derive `trace-id` from `traceparent`
 - [ ] Aggregate:
  - [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers
  - [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test)
 - [ ] Projection:
  - [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
  - [ ] Ensure correlation/trace context is carried into spans/metrics consistently
 - [ ] Runner:
  - [ ] Ensure publish paths include correlation/trace headers consistently for commands and results
  - [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested
 - [ ] Tests:
  - [ ] Unit tests for header injection/extraction in `shared`
  - [ ] Per-service unit tests asserting produced headers include required keys
 ### Required Tests
 - `cargo fmt --check`
 - `cargo clippy --workspace --all-targets -- -D warnings`
 - `cargo test --workspace`
 ## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)
 ### Goal
 Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.
 ### Exit Criteria
 - Stream config for each stream is explicitly defined and validated at startup.
 - Limits (max messages/bytes/age) are explicit and have defaults.
 - Duplicate windows and dedupe behavior are explicit and tested.
 - A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).
 ### Tasks
 - [ ] Define a single “stream config policy” module per service (or shared helper):
  - [ ] `AGGREGATE_EVENTS` subjects + retention policy
  - [ ] `WORKFLOW_COMMANDS` subjects + retention policy
  - [ ] `WORKFLOW_EVENTS` subjects + retention policy
 - [ ] Standardize defaults:
  - [ ] retention: limits appropriate for replay + rebuild
  - [ ] `duplicate_window` aligned with producer idempotency strategy
  - [ ] storage type and replication policy documented and configurable
 - [ ] Add startup validations:
  - [ ] verify stream exists and matches required subject set (compatible superset allowed)
  - [ ] verify required ack/dedupe assumptions hold
 - [ ] Add tests that parse and validate configs without NATS.
 ### Required Tests
 - Unit tests for stream config builders
 - Existing crate tests
 ## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)
 ### Goal
 Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.
 ### Exit Criteria
 - All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`.
 - Application concurrency is bounded and tied to `max_in_flight`.
 - Poison policy is consistent:
  - after `max_deliver`, term + deadletter/quarantine record is written
 - Replay behavior is deterministic on restart (checkpoint-based where applicable).
 ### Tasks
 - [ ] Define standard consumer config defaults:
  - [ ] `AckPolicy::Explicit`
  - [ ] `ack_wait` default + env override
  - [ ] `max_deliver` default + env override
  - [ ] `max_ack_pending` tied to application concurrency
 - [ ] Projection:
  - [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
  - [ ] Ensure checkpoint gates ack correctly (skip still acks)
  - [ ] Ensure poison policy writes durable records and terminates reliably
 - [ ] Runner:
  - [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
  - [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
 - [ ] Aggregate:
  - [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
  - [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers
 - [ ] Tests:
  - [ ] Unit tests for consumer name generation (sanitization + uniqueness)
  - [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)
 ### Required Tests
 - Workspace fmt/clippy/tests
 - NATS-gated integration tests for:
  - redelivery idempotency
  - poison termination behavior
  - scale-out with deliver group (where supported)
 ## Milestone 3: Connection Management + Failure Semantics (Operational Frugality)
 ### Goal
 Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.
 ### Exit Criteria
 - One NATS connection per process (or bounded pool only if justified).
 - Reconnect/backoff policy is explicit and consistent.
 - Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
 - No busy-looping on NATS outages.
 ### Tasks
 - [ ] Standardize connection options:
  - [ ] reconnect delays/backoff
  - [ ] max reconnect attempts or “infinite with backoff” strategy (explicit)
  - [ ] request timeouts around JetStream operations
 - [ ] Standardize readiness semantics:
  - [ ] `ready=false` when NATS is unavailable and the node depends on it
  - [ ] `health` stays “process alive” but reports NATS connectivity in payload
 - [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
 - [ ] Tests:
  - [ ] unit tests for backoff behavior (where possible)
  - [ ] gated integration test: temporary NATS outage does not crash-loop and recovers
 ## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)
 ### Goal
 Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.
 ### Exit Criteria
 - Durable names are deterministic and collision-free across replicas.
 - Deliver groups are used where appropriate to share work across replicas.
 - Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
 - A scale-out test suite exists and is gated but runnable.
 ### Tasks
 - [ ] Establish consumer naming scheme per service role:
  - [ ] Projection: per-view durable option uses sanitized names and stable mapping
  - [ ] Runner: durable prefix includes role + shard + optional group
 - [ ] Establish deliver group usage rules:
  - [ ] when to enable (scale-out consumers)
  - [ ] how to roll without duplication
 - [ ] Strengthen dedupe keys:
  - [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
  - [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id`
 - [ ] Add gated tests:
  - [ ] two replicas, same tenant, no duplicate publishes
  - [ ] rolling restart preserves checkpoint correctness
 ## Verification Commands (Required at Each Milestone)
 - `cargo fmt --check`
 - `cargo clippy --workspace --all-targets -- -D warnings`
 - `cargo test --workspace`
 - Gated NATS integration tests:
  - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
  - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
  - Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests
 ## Notes / Constraints
 - Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
 - Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
 - Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.
--- a/README.md
+++ b/README.md
@@ -4,10 +4,12 @@
 - Rust services (Cargo workspace): `aggregate/`, `gateway/`, `projection/`, `runner/`, `control/api/`, `shared/`
 - Control UI: `control/ui/`
 - Docker + Swarm + Compose: `docker/`, `docker-compose.yml`, `swarm/`, `observability/`
- Transport plans:
+
-  - `TRANSPORT_DEVELOPMENT_PLAN.md`
+## Documentation
-  - `GATEWAY_TRANSPORT_PLAN.md`
+- docs/README.md
-  - `NATS_TRANSPORT_PLAN.md`
+- Architecture: docs/architecture/overview.md, docs/architecture/transport.md
 - Developer: docs/developer/setup.md, docs/developer/testing.md
 - Usage: docs/usage/quickstart.md, docs/usage/api.md, docs/usage/nats.md
 ## Quick Start (Docker Compose)
@@ -35,4 +37,5 @@ More details: `DOCKER.md`
 cargo fmt --check
 cargo clippy --workspace --all-targets -- -D warnings
 cargo test --workspace
 cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build
 ```
--- a/TRANSPORT_DEVELOPMENT_PLAN.md
+++ b/TRANSPORT_DEVELOPMENT_PLAN.md
@@ -1,339 +0,0 @@
 # Transport Development Plan
 ## Purpose
 Unify and optimize the platform transport layer end-to-end:
 - Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
 - Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
 This plan merges and supersedes:
 - `GATEWAY_TRANSPORT_PLAN.md`
 - `NATS_TRANSPORT_PLAN.md`
 ## Current Status (Codebase Reality)
 - Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
 - Request context pieces are standardized:
  - `shared` provides `TenantId`, `CorrelationId`, `TraceId`
  - `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
  - `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
  - Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
 - Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
 - Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
 - Node → NATS header propagation is improved and closer to consistent:
  - Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
  - Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
  - Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
 - Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
 ## Principles
 - Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
 - Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
 - Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
 - Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
 - High performance: multiplexing, backpressure, low tail latency, predictable routing.
 - Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
 ## Non-Negotiable Rules (Global)
 - Every cross-component hop MUST carry tenant + correlation + trace context.
 - Every transport path MUST have explicit timeouts/deadlines and bounded retries.
 - Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
 - Every milestone is stop-the-line gated:
  - All tasks completed
  - All tests required by the milestone pass
  - Workspace verification commands pass
  - Gated integration tests for the milestone are runnable and documented
 ## Baseline (Today)
 - Gateway → Aggregate: gRPC (command submission)
 - Gateway → Projection: HTTP (query proxy)
 - Gateway → Runner: HTTP (admin proxy)
 - Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
 ## End State (Target Architecture)
 - Edge contract (clients ↔ Gateway): HTTP/JSON
 - Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
 - Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
 - `shared` is the single source of truth for:
  - header names and injection/extraction rules
  - trace parsing/validation (`traceparent`, `trace-id`)
  - context object model (tenant/correlation/trace/request ids)
  - NATS subject + consumer naming helpers
 ## Standard Contracts
 ### Context Fields
 - Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
 - Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
 - Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
 - Request id: HTTP `x-request-id` (optional for NATS)
 ### Standard Service Endpoints (every service)
 - `GET /health` liveness
 - `GET /ready` readiness (includes tenant gating if relevant)
 - `GET /metrics` Prometheus
 ## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
 ### Goal
 Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
 ### Exit Criteria
 - `shared` contains canonical constants for header names and NATS header names.
 - `shared` contains canonical trace parsing/validation and trace derivation helpers.
 - Library-level unit tests cover parsing/derivation behavior.
 - All crates build and tests pass for the workspace.
 ### Tasks
 - [x] Add shared ID types in `shared`:
  - [x] `TenantId`
  - [x] `CorrelationId`
  - [x] `TraceId`
 - [x] Consolidate header constants in `shared`:
  - [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
  - [x] HTTP: `x-tenant-id`, `x-request-id`
  - [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
  - [x] NATS: `tenant-id`, `Nats-Msg-Id`
 - [x] Add shared helpers:
  - [x] derive `trace-id` from `traceparent`
  - [x] derive `traceparent` from `trace-id` when valid
  - [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
  - [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
 - [x] Add unit tests in `shared` for:
  - [x] traceparent parsing validity
  - [x] serialization shape for correlation/trace id newtypes
  - [x] additional validation cases (invalid traceparents, all-zero ids)
 ### Required Tests
 - `cargo fmt --check`
 - `cargo clippy --workspace --all-targets -- -D warnings`
 - `cargo test --workspace`
 ## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
 ### Dependencies
 - Milestone 0
 ### Goal
 Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
 ### Exit Criteria
 - Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
 - All NATS producers set required headers consistently.
 - All NATS consumers tolerate unknown fields and missing optional fields.
 - “Contract tests” exist per service to verify produced headers and subject formats.
 ### Tasks
 - [x] Create/standardize subject builder helpers (prefer `shared`):
  - [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
  - [x] Runner effect/effect_result subject builders
  - [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
 - [x] Aggregate publishes:
  - [x] `tenant-id` header always present
  - [x] correlation + trace headers always present; generated when missing/invalid
  - [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
  - [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
 - [x] Runner publishes (commands/results):
  - [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
  - [x] trace headers always present/derived when possible; generated when missing/invalid
  - [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
  - [x] outbox metadata → NATS headers mapping standardized via shared helpers
 - [x] Projection consumption:
  - [x] envelope decoding remains tolerant (unknown fields ignored)
  - [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
 - [x] Add unit tests:
  - [x] subject formatting tests (shared builders)
  - [x] required header presence tests per publisher (Aggregate + Runner)
 ### Required Tests
 - Workspace verification commands
 ## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
 ### Dependencies
 - Milestone 1
 ### Goal
 Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
 ### Exit Criteria
 - Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
 - Services create streams if missing, and validate compatibility on startup.
 - Startup does not silently replace or destructively mutate existing streams.
 - Config-only tests validate stream config builders without requiring NATS.
 ### Tasks
 - [x] Define stream policies:
  - [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
  - [x] `WORKFLOW_COMMANDS` is defined and validated on startup
  - [x] `WORKFLOW_EVENTS` is defined and validated on startup
  - [x] Centralize stream policy builders/validators in `shared`
 - [x] Implement compatibility validation rules:
  - [x] required subjects are present (superset allowed)
  - [x] limits/max_age/duplicate window validated against minimums
  - [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
 - [x] Add unit tests for stream config builders + validators.
 ### Required Tests
 - Workspace verification commands
 ## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
 ### Dependencies
 - Milestone 2
 ### Goal
 Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
 ### Exit Criteria
 - All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
 - Application-level concurrency is bounded and aligned with `max_in_flight`.
 - Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
 - Gated NATS integration tests prove:
  - redelivery idempotency
  - poison termination
  - scale-out behavior (deliver group) where applicable
 ### Tasks
 - [x] Standardize consumer defaults:
  - [x] `AckPolicy::Explicit`
  - [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
  - [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
  - [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
 - [x] Projection:
  - [x] durable naming collision-free for Single/PerView modes
  - [x] checkpoint gate semantics: “skip still acks”
  - [x] poison handling persists durable records and terminates reliably (poison record + term)
 - [x] Runner:
  - [x] durable naming collision-free and stable across replicas
  - [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
  - [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
 - [x] Aggregate:
  - [x] ad-hoc fetch consumer always unique and bounded
  - [x] best-effort deletion never targets unrelated consumers
 - [x] Add gated NATS integration tests and document env flags:
  - [x] Runner ignored tests
  - [x] Projection ignored tests
 ### Required Tests
 - Workspace verification commands
 - Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
 - Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
 ## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
 ### Dependencies
 - Milestone 0 (context contract)
 ### Goal
 Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
 ### Exit Criteria
 - Projection exposes `projection.gateway.v1.QueryService`.
 - Gateway routes queries via gRPC by default.
 - Authz remains enforced in Gateway (deny-by-default).
 - Query responses remain stable for Control UI expectations.
 - New gRPC query tests pass (unit + integration).
 ### Tasks
 - [x] Define protobuf API: `projection.gateway.v1.QueryService`
 - [x] Implement Projection gRPC server for query execution
 - [x] Implement Gateway gRPC client routing to Projection
  - [x] deadlines
  - [x] bounded retries (idempotent only)
  - [x] context propagation
 - [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
  - [x] route internally to gRPC
 - [x] Add tests:
  - [x] authz + forwarding via gRPC
  - [x] tenant isolation enforcement in Projection QueryService
 ### Required Tests
 - Workspace verification commands
 ## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
 ### Dependencies
 - Milestone 0 (context contract)
 ### Goal
 Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
 ### Exit Criteria
 - Runner exposes `runner.admin.v1.RunnerAdmin`.
 - Gateway calls Runner admin via gRPC (authz enforced in Gateway).
 - Tenant-spoof and unauthorized calls are rejected deterministically.
 - Runner drain/readiness semantics validated and tested.
 ### Tasks
 - [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
 - [x] Implement Runner gRPC admin server
 - [x] Implement Gateway gRPC client integration for admin operations
 - [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
 - [x] Add tests:
  - [x] Gateway: rejects without rights
  - [x] Gateway: rejects tenant spoof attempts
  - [x] Runner: idempotency and drain semantics
 ### Required Tests
 - Workspace verification commands
 ## Milestone 6: Gateway Upstream Performance + Operational Guardrails
 ### Dependencies
 - Milestones 4–5 (gRPC internal RPC surfaces available)
 ### Goal
 Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
 ### Exit Criteria
 - Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
 - Deadlines everywhere; retries only for idempotent operations.
 - Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
 - Gated load/soak tests exist and are runnable.
 ### Tasks
 - [x] Implement upstream channel pool
  - [x] bounded LRU
  - [x] TTL/eviction
  - [x] fast-path reuse under load (cached gRPC channels)
 - [x] Standardize retry profiles
  - [x] read-only: limited retry with jitter (Gateway gRPC calls)
  - [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
 - [x] Standardize timeouts/deadlines:
  - [x] edge timeout limits
  - [x] internal per-service deadlines
 - [x] Fanout controls:
  - [x] concurrency limiters for probes/snapshots
  - [x] short TTL caching where safe
 - [x] Ensure probes carry context (correlation/trace) for observability.
 ### Required Tests
 - Workspace verification commands
 - Gated load/soak tests (document env + how to run)
 ## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
 ### Dependencies
 - Milestone 6
 ### Goal
 Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
 ### Exit Criteria
 - Gateway no longer depends on HTTP for Projection queries or Runner admin.
 - Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
 - End-to-end smoke tests pass (gated).
 ### Tasks
 - [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
 - [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
 - [x] Ensure Control UI + Control API rely only on standardized surfaces
 - [x] Harden metrics and readiness probes to match the standard contract everywhere
 ### Required Tests
 - Workspace verification commands
 - End-to-end smoke tests (gated)
 ## Workspace Verification Commands (Run for Every Milestone)
 - `cargo fmt --check`
 - `cargo clippy --workspace --all-targets -- -D warnings`
 - `cargo test --workspace`
 - `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
--- a/docs/README.md
+++ b/docs/README.md
@@ -0,0 +1,18 @@
 # Cloudlysis Documentation
 ## Sections
 - Architecture
  - docs/architecture/overview.md
  - docs/architecture/transport.md
 - Developer
  - docs/developer/setup.md
  - docs/developer/testing.md
 - Usage
  - docs/usage/quickstart.md
  - docs/usage/api.md
  - docs/usage/nats.md
 ## Conventions
 - HTTP edge remains JSON over REST via Gateway.
 - Internal RPC uses gRPC between Gateway and nodes (Aggregate, Projection, Runner).
 - Async backbone uses NATS JetStream and KV with standardized subjects, headers, and stream/consumer policies.
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -0,0 +1,21 @@
 # Architecture Overview
 ## Monorepo
 - Rust workspace: aggregate, projection, runner, gateway, control/api, shared
 - Frontend: control/ui
 - Infra: docker, observability, swarm
 ## Data Flow
 - Clients → Gateway (HTTP/JSON)
 - Gateway ↔ Nodes (gRPC)
 - Nodes ↔ NATS (JetStream + KV)
 ## Services
 - Aggregate: command handling + event sourcing; publishes events to JetStream
 - Projection: materialized views; consumes aggregate events; exposes QueryService (gRPC)
 - Runner: workflow/saga engine + effects/outbox; exposes RunnerAdmin (gRPC)
 - Gateway: edge, authn/z, routing to nodes, admin entry points
 ## Observability
 - /health, /ready, /metrics on all services
 - Correlation and tracing propagated across HTTP, gRPC, and NATS
--- a/docs/architecture/transport.md
+++ b/docs/architecture/transport.md
@@ -0,0 +1,23 @@
 # Transport Contracts
 ## Context
 - Tenant: HTTP x-tenant-id; NATS tenant-id
 - Correlation: HTTP x-correlation-id; NATS x-correlation-id + correlation-id
 - Trace: HTTP traceparent; NATS traceparent + trace-id
 ## Internal RPC (gRPC)
 - Aggregate: CommandService (submit commands)
 - Projection: QueryService (execute queries)
 - Runner: RunnerAdmin (drain, status, reload)
 - All calls set deadlines and propagate context metadata
 ## NATS JetStream
 - Streams:
  - AGGREGATE_EVENTS: tenant.*.aggregate.*.*
  - WORKFLOW_COMMANDS, WORKFLOW_EVENTS
 - Producers set headers: tenant-id, Nats-Msg-Id, correlation, traceparent/trace-id
 - Consumers use AckPolicy::Explicit, bounded ack_wait, max_deliver, max_ack_pending
 ## Routing
 - Gateway routes per-tenant to shards for Aggregate/Projection/Runner
 - Routing tables hot-reload atomically
--- a/docs/developer/setup.md
+++ b/docs/developer/setup.md
@@ -0,0 +1,26 @@
 # Developer Setup
 ## Prerequisites
 - Rust toolchain (stable)
 - Node.js (LTS) for control/ui
 - Docker (optional) for local stack
 ## Build
 ```bash
 cargo build
 cd control/ui && npm ci && npm run build
 ```
 ## Workspace Verification
 ```bash
 cargo fmt --check
 cargo clippy --workspace --all-targets -- -D warnings
 cargo test --workspace
 cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build
 ```
 ## Environment
 - Gateway: routing config using file or KV
 - Projection: PROJECTION_GRPC_ADDR
 - Runner: RUNNER_GRPC_ADDR
 - NATS: URLs via service-specific settings
--- a/docs/developer/testing.md
+++ b/docs/developer/testing.md
@@ -0,0 +1,27 @@
 # Testing
 ## Unit and Integration
 ```bash
 cargo test --workspace
 ```
 ## Gated Tests (require external services)
 - Runner NATS:
 ```bash
 RUNNER_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p runner -- --ignored
 ```
 - Projection NATS:
 ```bash
 PROJECTION_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p projection -- --ignored
 ```
 - Docker-based gates:
 ```bash
 cargo test -p gateway -- --ignored
 ```
 ## Control UI
 ```bash
 cd control/ui
 npm ci
 npm run test
 ```
--- a/docs/usage/api.md
+++ b/docs/usage/api.md
@@ -0,0 +1,38 @@
 # Usage: API Examples
 ## Projection Query via Gateway (HTTP → gRPC)
 ```bash
 curl -sS -X POST \
  -H "x-tenant-id: tenant-a" \
  -H "x-correlation-id: demo" \
  -H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \
  http://localhost:8080/v1/query/User \
  -d '{"uqf":"{\"eq\":{\"id\":\"u1\"}}"}'
 ```
 ## Projection Query via gRPC (direct, internal)
 ```bash
 grpcurl -d '{"tenant_id":"tenant-a","view_type":"User","uqf":"{}"}' \
  -H 'x-tenant-id: tenant-a' \
  -H 'x-correlation-id: demo' \
  -H 'traceparent: 00-00000000000000000000000000000001-0000000000000001-01' \
  -plaintext localhost:9090 projection.gateway.v1.QueryService/ExecuteQuery
 ```
 ## Aggregate Command via Gateway (HTTP → gRPC)
 ```bash
 curl -sS -X POST \
  -H "x-tenant-id: tenant-a" \
  -H "x-correlation-id: demo" \
  -H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \
  http://localhost:8080/v1/aggregate/BankAccount/command \
  -d '{"id":"acc-1","command_type":"Open","payload":{"owner":"Alice"}}'
 ```
 ## Runner Admin via Gateway (HTTP → gRPC)
 ```bash
 curl -sS -X POST \
  -H "x-tenant-id: tenant-a" \
  -H "authorization: Bearer <token>" \
  http://localhost:8080/admin/runner/drain?wait_ms=0
 ```
--- a/docs/usage/nats.md
+++ b/docs/usage/nats.md
@@ -0,0 +1,17 @@
 # NATS Reference
 ## Subjects
 - Aggregate events: tenant.<tenant>.aggregate.<type>.<id>
 - Workflow commands/events: shared helpers define exact formats
 ## Headers (Producers)
 - tenant-id: required
 - Nats-Msg-Id: idempotency key (event_id, command_id, etc.)
 - x-correlation-id and correlation-id
 - traceparent and trace-id
 ## Consumers
 - AckPolicy::Explicit
 - ack_wait: bounded timeout
 - max_deliver: bounded
 - max_ack_pending: aligned with concurrency
--- a/docs/usage/quickstart.md
+++ b/docs/usage/quickstart.md
@@ -0,0 +1,26 @@
 # Quick Start
 ## Compose
 ```bash
 docker compose up -d --build
 ```
 Full stack with observability:
 ```bash
 docker compose -f docker-compose.yml -f observability/docker-compose.yml up -d --build
 ```
 ## Local Dev (minimal)
 - Start NATS locally
 - Run services:
 ```bash
 cargo run -p aggregate
 cargo run -p projection
 cargo run -p runner
 cargo run -p gateway
 ```
 ## Verify
 - Gateway: GET /health, /ready, /metrics
 - Projection: GET /health, /ready, /metrics
 - Runner: GET /health, /ready, /metrics