docs: add docs folder (architecture, developer, usage); update README; wire probe TTL cache + concurrency notes into docs
This commit is contained in:
@@ -1,216 +0,0 @@
|
|||||||
# Gateway Transport Plan
|
|
||||||
|
|
||||||
## Purpose
|
|
||||||
Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
|
|
||||||
- Simplicity (few patterns, minimal bespoke conventions)
|
|
||||||
- Ease of operation (consistent health/ready/metrics, consistent failure modes)
|
|
||||||
- Frugality (bounded connections, bounded fanout, low overhead)
|
|
||||||
- High performance (low tail latency, backpressure-aware, predictable routing)
|
|
||||||
- Safety (tenant isolation, deny-by-default authz, consistent context propagation)
|
|
||||||
|
|
||||||
## Non-Negotiable Rules (Global)
|
|
||||||
- Every cross-service request MUST carry tenant + trace context.
|
|
||||||
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
|
|
||||||
- Every milestone below is “stop-the-line” gated:
|
|
||||||
- All tasks completed
|
|
||||||
- All tests passing
|
|
||||||
- Workspace lint/format/type checks passing
|
|
||||||
- Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
|
|
||||||
|
|
||||||
## Current State (Baseline)
|
|
||||||
- Gateway → Aggregate: gRPC command submission
|
|
||||||
- Gateway → Projection: HTTP query proxy (`/v1/query/*`)
|
|
||||||
- Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`)
|
|
||||||
- Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
|
|
||||||
|
|
||||||
## Target Architecture (End State)
|
|
||||||
- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
|
|
||||||
- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
|
|
||||||
- Async/event backbone: NATS JetStream remains for event/work distribution
|
|
||||||
- `shared` is the single source of truth for:
|
|
||||||
- Header names and propagation rules
|
|
||||||
- Trace parsing/validation rules (`traceparent`, `trace-id`)
|
|
||||||
- Request context representation (tenant/correlation/trace)
|
|
||||||
|
|
||||||
## Definitions
|
|
||||||
### Request Context
|
|
||||||
Fields that must be consistently propagated:
|
|
||||||
- `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`)
|
|
||||||
- `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`)
|
|
||||||
- `traceparent` (HTTP: `traceparent`, NATS: `traceparent`)
|
|
||||||
- `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`)
|
|
||||||
- `request_id` (HTTP: `x-request-id`, optional for NATS)
|
|
||||||
|
|
||||||
### Standard Health Endpoints (per service)
|
|
||||||
- `GET /health` liveness
|
|
||||||
- `GET /ready` readiness (includes tenant gating if applicable)
|
|
||||||
- `GET /metrics` Prometheus
|
|
||||||
|
|
||||||
## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- A single shared contract exists for header names and trace parsing.
|
|
||||||
- Gateway injects context into all upstream calls (including rebalance/health probes).
|
|
||||||
- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
|
|
||||||
- Unit tests prove propagation behavior for each transport.
|
|
||||||
- `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible.
|
|
||||||
- [ ] Add `shared` helpers for:
|
|
||||||
- HTTP extract/inject
|
|
||||||
- gRPC metadata extract/inject
|
|
||||||
- NATS header extract/inject
|
|
||||||
- [ ] Gateway: ensure context is injected into:
|
|
||||||
- gRPC upstream requests to Aggregate
|
|
||||||
- HTTP upstream requests to Projection
|
|
||||||
- Runner admin proxy requests
|
|
||||||
- Any “probe” calls (rebalance gates, fleet snapshots, health checks)
|
|
||||||
- [ ] Projection/Runner/Aggregate: ensure NATS published messages include:
|
|
||||||
- `tenant-id`
|
|
||||||
- `x-correlation-id` + `correlation-id`
|
|
||||||
- `traceparent`
|
|
||||||
- `trace-id` (derived when possible)
|
|
||||||
- [ ] Add transport-level tests:
|
|
||||||
- [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
|
|
||||||
- [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved
|
|
||||||
- [ ] NATS publish path: produced headers contain expected keys/values
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Unit tests for shared parsing/derivation utilities
|
|
||||||
- Existing per-crate test suites
|
|
||||||
- At least one per-service “transport contract” test verifying headers are present and correct
|
|
||||||
|
|
||||||
## Milestone 1: Internal RPC Standardization (Projection via gRPC)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- A Projection gRPC service exists for query execution.
|
|
||||||
- Gateway routes queries to Projection via gRPC by default.
|
|
||||||
- Authorization semantics remain enforced in Gateway (deny-by-default).
|
|
||||||
- Response shapes are stable and match the existing UI expectations.
|
|
||||||
- All tests pass, including new gRPC query integration tests.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
|
|
||||||
- [ ] Request includes tenant + view + query payload and metadata
|
|
||||||
- [ ] Response includes result payload and standard context propagation
|
|
||||||
- [ ] Implement Projection gRPC server:
|
|
||||||
- [ ] Parse tenant/view/query
|
|
||||||
- [ ] Execute query against current projection storage/query engine
|
|
||||||
- [ ] Enforce tenant scope
|
|
||||||
- [ ] Implement Gateway gRPC client path for queries:
|
|
||||||
- [ ] Routing by tenant to Projection endpoint
|
|
||||||
- [ ] Deadlines, bounded retries (idempotent only)
|
|
||||||
- [ ] Context propagation (tenant/correlation/trace)
|
|
||||||
- [ ] Keep HTTP `/v1/query/*`:
|
|
||||||
- [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint
|
|
||||||
- [ ] Add tests:
|
|
||||||
- [ ] Gateway query authz + forwarding via gRPC
|
|
||||||
- [ ] Projection gRPC query contract tests for tenant isolation
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- New gRPC QueryService tests (unit + integration)
|
|
||||||
- Existing query/authz tests in Gateway
|
|
||||||
- Workspace fmt/clippy/test
|
|
||||||
|
|
||||||
## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
|
|
||||||
- Gateway uses gRPC to call Runner admin APIs.
|
|
||||||
- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
|
|
||||||
- Admin operations are idempotent where appropriate and include audit hooks where required.
|
|
||||||
- All tests pass and include negative/tenant-spoof cases.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
|
||||||
- [ ] Drain/resume/status/reload/tenant-scoped controls
|
|
||||||
- [ ] Standard error mapping
|
|
||||||
- [ ] Implement Runner gRPC admin server:
|
|
||||||
- [ ] Tenant gating enforced for tenant-scoped operations
|
|
||||||
- [ ] Readiness/drain semantics aligned with platform contracts
|
|
||||||
- [ ] Implement Gateway gRPC client integration:
|
|
||||||
- [ ] Route to Runner endpoint via routing table
|
|
||||||
- [ ] Enforce authz rights (e.g. `runner.admin`)
|
|
||||||
- [ ] Context propagation
|
|
||||||
- [ ] Keep HTTP `/admin/*` in Runner optional:
|
|
||||||
- [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network
|
|
||||||
- [ ] Tests:
|
|
||||||
- [ ] Gateway: admin calls rejected without rights
|
|
||||||
- [ ] Gateway: tenant spoof attempts rejected
|
|
||||||
- [ ] Runner: idempotency and drain semantics validated
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- gRPC RunnerAdmin unit/integration tests
|
|
||||||
- Gateway proxy-to-gRPC tests
|
|
||||||
- Workspace fmt/clippy/test
|
|
||||||
|
|
||||||
## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Gateway maintains bounded upstream connection pools for gRPC endpoints.
|
|
||||||
- All gRPC calls have deadlines; retries are only for idempotent operations.
|
|
||||||
- All probe/fanout calls are bounded and do not cause thundering herds.
|
|
||||||
- Load/soak tests show stable behavior under partial failure.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Implement a Gateway upstream channel pool:
|
|
||||||
- [ ] LRU bounded by max endpoints
|
|
||||||
- [ ] TTL/eviction strategy
|
|
||||||
- [ ] Fast path reuse under load
|
|
||||||
- [ ] Standardize retry profiles:
|
|
||||||
- [ ] Read-only: short retry with jitter
|
|
||||||
- [ ] Mutations: no automatic retry unless idempotency key present
|
|
||||||
- [ ] Standardize timeouts:
|
|
||||||
- [ ] Edge timeout limits
|
|
||||||
- [ ] Internal per-service deadlines
|
|
||||||
- [ ] Fanout controls:
|
|
||||||
- [ ] Concurrency limiters for fleet snapshot/probes
|
|
||||||
- [ ] Cache results where safe (short TTL)
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Unit tests for pool eviction/TTL
|
|
||||||
- Gateway integration tests for deadline propagation
|
|
||||||
- Gated load tests (document env + how to run)
|
|
||||||
|
|
||||||
## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
|
|
||||||
- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
|
|
||||||
- All operational playbooks rely on standardized endpoints.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Remove Gateway’s HTTP query proxy usage (or keep only as compatibility shim).
|
|
||||||
- [ ] Remove Gateway’s runner admin HTTP proxy usage (or keep only as compatibility shim).
|
|
||||||
- [ ] Ensure Control UI + Control API use the standardized Gateway surfaces.
|
|
||||||
- [ ] Harden metrics and health probes to always carry context.
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- End-to-end smoke tests (gated)
|
|
||||||
- Workspace fmt/clippy/test
|
|
||||||
|
|
||||||
## Verification Commands (Required at Each Milestone)
|
|
||||||
- `cargo fmt --check`
|
|
||||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
|
||||||
- `cargo test --workspace`
|
|
||||||
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
|
|
||||||
|
|
||||||
## Notes / Constraints
|
|
||||||
- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
|
|
||||||
- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.
|
|
||||||
@@ -1,246 +0,0 @@
|
|||||||
# NATS Transport Plan
|
|
||||||
|
|
||||||
## Purpose
|
|
||||||
Standardize and optimize how nodes (Aggregate, Projection, Runner, Gateway where applicable) use NATS JetStream and NATS KV, under these principles:
|
|
||||||
- Simplicity (few primitives, consistent naming, minimal per-service divergence)
|
|
||||||
- Ease of operation (predictable streams/consumers, clear runbooks, easy debugging)
|
|
||||||
- Frugality (bounded consumers, bounded in-flight work, minimal churn, minimal storage)
|
|
||||||
- Low resource usage (stable durable consumers, controlled ack waits, limited fanout)
|
|
||||||
- High performance (high throughput, low tail latency, reliable backpressure)
|
|
||||||
- Safety (tenant isolation, idempotency, deterministic replay, poison handling)
|
|
||||||
|
|
||||||
## Non-Negotiable Rules (Global)
|
|
||||||
- Every JetStream stream/consumer MUST have an explicit contract:
|
|
||||||
- name, subjects, retention, storage, replication, max sizes
|
|
||||||
- ack policy, ack wait, max deliver, max in flight
|
|
||||||
- Every node MUST run with bounded work:
|
|
||||||
- bounded pull batch sizes
|
|
||||||
- bounded concurrency
|
|
||||||
- bounded retry/backoff
|
|
||||||
- Every message MUST be tenant-scoped in subject and/or headers.
|
|
||||||
- Every milestone below is “stop-the-line” gated:
|
|
||||||
- all tasks completed
|
|
||||||
- all tests passing
|
|
||||||
- workspace lint/format checks passing
|
|
||||||
- required NATS-gated integration tests for the milestone passing (when gated by env)
|
|
||||||
|
|
||||||
## Current State (Baseline)
|
|
||||||
- Streams:
|
|
||||||
- `AGGREGATE_EVENTS` (Aggregate publishes, Projection/Runner consume)
|
|
||||||
- `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS` (Runner)
|
|
||||||
- Subject conventions:
|
|
||||||
- Aggregate events: `tenant.<tenant_id>.aggregate.<aggregate_type>.<aggregate_id>`
|
|
||||||
- Defaults often use filters like `tenant.*.aggregate.*.*`
|
|
||||||
- Durable consumers:
|
|
||||||
- Projection uses a durable name (configurable)
|
|
||||||
- Runner uses configurable durable prefix per role
|
|
||||||
- Aggregate had ad-hoc fetch consumer risks; now mitigated with unique consumer names per fetch
|
|
||||||
- Headers:
|
|
||||||
- Tenant + correlation + trace headers exist but were historically inconsistent; shared utilities now exist
|
|
||||||
|
|
||||||
## Target Architecture (End State)
|
|
||||||
- A single “NATS wire protocol” contract shared across services:
|
|
||||||
- subject naming
|
|
||||||
- required headers (tenant/correlation/trace)
|
|
||||||
- message envelope compatibility rules (tolerant decoding, optional fields)
|
|
||||||
- Stable, minimal set of JetStream streams:
|
|
||||||
- one stream per message class (aggregate events, workflow commands, workflow events)
|
|
||||||
- no per-tenant streams unless there is a strong operational reason
|
|
||||||
- Stable, limited consumers:
|
|
||||||
- durable consumers for long-lived processors (Projection, Runner)
|
|
||||||
- ephemeral consumers only for bounded ad-hoc operations (Aggregate fetch), always unique + best-effort deletion
|
|
||||||
- Uniform backpressure + reliability defaults:
|
|
||||||
- explicit ack
|
|
||||||
- bounded `max_ack_pending` and application-level concurrency
|
|
||||||
- bounded redelivery via `max_deliver` + poison policy
|
|
||||||
|
|
||||||
## Definitions
|
|
||||||
### Message Context (Headers)
|
|
||||||
Standard headers for NATS published messages:
|
|
||||||
- `tenant-id` (required)
|
|
||||||
- `x-correlation-id` and `correlation-id` (required for any request-derived message; generated if missing)
|
|
||||||
- `traceparent` (optional but recommended; generated/propagated if present upstream)
|
|
||||||
- `trace-id` (optional; derived from traceparent when possible)
|
|
||||||
- `Nats-Msg-Id` (required for idempotent publish/dedupe when applicable)
|
|
||||||
|
|
||||||
### Subject Naming Rules
|
|
||||||
- Tenant-first prefix: `tenant.<tenant_id>.…`
|
|
||||||
- Stable message class token:
|
|
||||||
- `aggregate` for domain events
|
|
||||||
- `effect`, `effect_result`, `workflow`, `workflow_event` for Runner
|
|
||||||
- No ambiguous wildcard publishing:
|
|
||||||
- producers publish concrete subjects only
|
|
||||||
- consumers may filter with wildcards
|
|
||||||
|
|
||||||
### Consumer Naming Rules
|
|
||||||
- Durable consumer names must be stable and collision-free:
|
|
||||||
- include role + mode + optional view/saga name + shard/group
|
|
||||||
- Ephemeral consumer names must be unique per operation:
|
|
||||||
- include tenant + purpose + uuid
|
|
||||||
- must be deleted best-effort when operation completes
|
|
||||||
|
|
||||||
## Milestone 0: NATS Wire Contract Lock-in (Names, Headers, Envelopes)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make the NATS/JetStream wire contract explicit and enforced in code so all producers/consumers interoperate safely across scale-out and rolling restarts.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- `shared` exposes NATS header constants and helpers for inject/extract/derive.
|
|
||||||
- All producers set required headers consistently.
|
|
||||||
- All consumers tolerate unknown fields and missing optional fields.
|
|
||||||
- A single, documented subject naming convention is enforced in code (builder functions).
|
|
||||||
- Workspace fmt/clippy/tests pass.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Centralize NATS header constants and helpers in `shared`:
|
|
||||||
- [ ] inject headers for publish (tenant, correlation, trace)
|
|
||||||
- [ ] extract headers on receive (best-effort)
|
|
||||||
- [ ] derive `trace-id` from `traceparent`
|
|
||||||
- [ ] Aggregate:
|
|
||||||
- [ ] Ensure event publishing always sets `tenant-id`, correlation headers, trace headers
|
|
||||||
- [ ] Ensure `Nats-Msg-Id` strategy is correct for idempotency/dedupe (document and test)
|
|
||||||
- [ ] Projection:
|
|
||||||
- [ ] Ensure EventEnvelope decoding remains tolerant (unknown fields ignored, optional IDs supported)
|
|
||||||
- [ ] Ensure correlation/trace context is carried into spans/metrics consistently
|
|
||||||
- [ ] Runner:
|
|
||||||
- [ ] Ensure publish paths include correlation/trace headers consistently for commands and results
|
|
||||||
- [ ] Ensure outbox metadata → NATS headers mapping is consistent and tested
|
|
||||||
- [ ] Tests:
|
|
||||||
- [ ] Unit tests for header injection/extraction in `shared`
|
|
||||||
- [ ] Per-service unit tests asserting produced headers include required keys
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- `cargo fmt --check`
|
|
||||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
|
||||||
- `cargo test --workspace`
|
|
||||||
|
|
||||||
## Milestone 1: Stream Configuration Standardization (Retention, Limits, Storage)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make stream configs consistent, explicit, and operationally sane across environments (dev → prod), minimizing surprise and preventing runaway resource usage.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Stream config for each stream is explicitly defined and validated at startup.
|
|
||||||
- Limits (max messages/bytes/age) are explicit and have defaults.
|
|
||||||
- Duplicate windows and dedupe behavior are explicit and tested.
|
|
||||||
- A “no destructive changes on startup” policy is enforced (create if missing; do not silently replace).
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Define a single “stream config policy” module per service (or shared helper):
|
|
||||||
- [ ] `AGGREGATE_EVENTS` subjects + retention policy
|
|
||||||
- [ ] `WORKFLOW_COMMANDS` subjects + retention policy
|
|
||||||
- [ ] `WORKFLOW_EVENTS` subjects + retention policy
|
|
||||||
- [ ] Standardize defaults:
|
|
||||||
- [ ] retention: limits appropriate for replay + rebuild
|
|
||||||
- [ ] `duplicate_window` aligned with producer idempotency strategy
|
|
||||||
- [ ] storage type and replication policy documented and configurable
|
|
||||||
- [ ] Add startup validations:
|
|
||||||
- [ ] verify stream exists and matches required subject set (compatible superset allowed)
|
|
||||||
- [ ] verify required ack/dedupe assumptions hold
|
|
||||||
- [ ] Add tests that parse and validate configs without NATS.
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Unit tests for stream config builders
|
|
||||||
- Existing crate tests
|
|
||||||
|
|
||||||
## Milestone 2: Consumer Policy Standardization (Ack, Backpressure, Poison)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make consumption reliable and cheap under load by standardizing ack policy, concurrency, and poison/deadletter handling.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- All long-lived consumers use explicit ack with consistent `ack_wait`, `max_deliver`, `max_ack_pending`.
|
|
||||||
- Application concurrency is bounded and tied to `max_in_flight`.
|
|
||||||
- Poison policy is consistent:
|
|
||||||
- after `max_deliver`, term + deadletter/quarantine record is written
|
|
||||||
- Replay behavior is deterministic on restart (checkpoint-based where applicable).
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Define standard consumer config defaults:
|
|
||||||
- [ ] `AckPolicy::Explicit`
|
|
||||||
- [ ] `ack_wait` default + env override
|
|
||||||
- [ ] `max_deliver` default + env override
|
|
||||||
- [ ] `max_ack_pending` tied to application concurrency
|
|
||||||
- [ ] Projection:
|
|
||||||
- [ ] Ensure durable consumer naming is collision-free in all modes (Single vs PerView)
|
|
||||||
- [ ] Ensure checkpoint gates ack correctly (skip still acks)
|
|
||||||
- [ ] Ensure poison policy writes durable records and terminates reliably
|
|
||||||
- [ ] Runner:
|
|
||||||
- [ ] Ensure saga/effect consumers use consistent durable naming + deliver groups when scaling out
|
|
||||||
- [ ] Ensure outbox relay preserves exactly-once semantics via dedupe keys + idempotent publish
|
|
||||||
- [ ] Aggregate:
|
|
||||||
- [ ] Ensure ad-hoc fetch consumer is bounded (timeouts) and unique per operation (already required)
|
|
||||||
- [ ] Ensure best-effort cleanup is performed and cannot delete unrelated consumers
|
|
||||||
- [ ] Tests:
|
|
||||||
- [ ] Unit tests for consumer name generation (sanitization + uniqueness)
|
|
||||||
- [ ] NATS-gated tests for ack/redelivery/poison behavior (must be runnable with env flag)
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace fmt/clippy/tests
|
|
||||||
- NATS-gated integration tests for:
|
|
||||||
- redelivery idempotency
|
|
||||||
- poison termination behavior
|
|
||||||
- scale-out with deliver group (where supported)
|
|
||||||
|
|
||||||
## Milestone 3: Connection Management + Failure Semantics (Operational Frugality)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make NATS connection handling stable under partial failure while minimizing resource churn and cascading outages.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- One NATS connection per process (or bounded pool only if justified).
|
|
||||||
- Reconnect/backoff policy is explicit and consistent.
|
|
||||||
- Circuit breaker behavior is consistent (when used), and health/ready reflect NATS state correctly.
|
|
||||||
- No busy-looping on NATS outages.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Standardize connection options:
|
|
||||||
- [ ] reconnect delays/backoff
|
|
||||||
- [ ] max reconnect attempts or “infinite with backoff” strategy (explicit)
|
|
||||||
- [ ] request timeouts around JetStream operations
|
|
||||||
- [ ] Standardize readiness semantics:
|
|
||||||
- [ ] `ready=false` when NATS is unavailable and the node depends on it
|
|
||||||
- [ ] `health` stays “process alive” but reports NATS connectivity in payload
|
|
||||||
- [ ] Add “fast fail” mode for tests and dev (avoid 30x retries when env not set).
|
|
||||||
- [ ] Tests:
|
|
||||||
- [ ] unit tests for backoff behavior (where possible)
|
|
||||||
- [ ] gated integration test: temporary NATS outage does not crash-loop and recovers
|
|
||||||
|
|
||||||
## Milestone 4: Multi-Tenant Scale-Out Guarantees (Collision-Free + Predictable)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Guarantee safe multi-replica behavior: no consumer collisions, no duplicate side effects, predictable throughput with bounded resource usage.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Durable names are deterministic and collision-free across replicas.
|
|
||||||
- Deliver groups are used where appropriate to share work across replicas.
|
|
||||||
- Exactly-once side effects are enforced via idempotency + dedupe keys (not wishful thinking).
|
|
||||||
- A scale-out test suite exists and is gated but runnable.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [ ] Establish consumer naming scheme per service role:
|
|
||||||
- [ ] Projection: per-view durable option uses sanitized names and stable mapping
|
|
||||||
- [ ] Runner: durable prefix includes role + shard + optional group
|
|
||||||
- [ ] Establish deliver group usage rules:
|
|
||||||
- [ ] when to enable (scale-out consumers)
|
|
||||||
- [ ] how to roll without duplication
|
|
||||||
- [ ] Strengthen dedupe keys:
|
|
||||||
- [ ] event-driven sagas: checkpoint + dedupe marker strategy tested under redelivery
|
|
||||||
- [ ] outbox relay: verify publish idempotency with `Nats-Msg-Id`
|
|
||||||
- [ ] Add gated tests:
|
|
||||||
- [ ] two replicas, same tenant, no duplicate publishes
|
|
||||||
- [ ] rolling restart preserves checkpoint correctness
|
|
||||||
|
|
||||||
## Verification Commands (Required at Each Milestone)
|
|
||||||
- `cargo fmt --check`
|
|
||||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
|
||||||
- `cargo test --workspace`
|
|
||||||
- Gated NATS integration tests:
|
|
||||||
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
|
|
||||||
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
|
|
||||||
- Control API (if it runs NATS-gated tests): set documented env flags and run ignored tests
|
|
||||||
|
|
||||||
## Notes / Constraints
|
|
||||||
- Do not create per-tenant streams unless scaling evidence requires it; prefer subject partitioning and consumer groups.
|
|
||||||
- Prefer backward-compatible envelope changes (optional fields, tolerant decoding).
|
|
||||||
- Prefer stable durable consumers; ephemeral consumers must be unique and bounded and must cleanup best-effort.
|
|
||||||
11
README.md
11
README.md
@@ -4,10 +4,12 @@
|
|||||||
- Rust services (Cargo workspace): `aggregate/`, `gateway/`, `projection/`, `runner/`, `control/api/`, `shared/`
|
- Rust services (Cargo workspace): `aggregate/`, `gateway/`, `projection/`, `runner/`, `control/api/`, `shared/`
|
||||||
- Control UI: `control/ui/`
|
- Control UI: `control/ui/`
|
||||||
- Docker + Swarm + Compose: `docker/`, `docker-compose.yml`, `swarm/`, `observability/`
|
- Docker + Swarm + Compose: `docker/`, `docker-compose.yml`, `swarm/`, `observability/`
|
||||||
- Transport plans:
|
|
||||||
- `TRANSPORT_DEVELOPMENT_PLAN.md`
|
## Documentation
|
||||||
- `GATEWAY_TRANSPORT_PLAN.md`
|
- docs/README.md
|
||||||
- `NATS_TRANSPORT_PLAN.md`
|
- Architecture: docs/architecture/overview.md, docs/architecture/transport.md
|
||||||
|
- Developer: docs/developer/setup.md, docs/developer/testing.md
|
||||||
|
- Usage: docs/usage/quickstart.md, docs/usage/api.md, docs/usage/nats.md
|
||||||
|
|
||||||
## Quick Start (Docker Compose)
|
## Quick Start (Docker Compose)
|
||||||
|
|
||||||
@@ -35,4 +37,5 @@ More details: `DOCKER.md`
|
|||||||
cargo fmt --check
|
cargo fmt --check
|
||||||
cargo clippy --workspace --all-targets -- -D warnings
|
cargo clippy --workspace --all-targets -- -D warnings
|
||||||
cargo test --workspace
|
cargo test --workspace
|
||||||
|
cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -1,339 +0,0 @@
|
|||||||
# Transport Development Plan
|
|
||||||
|
|
||||||
## Purpose
|
|
||||||
Unify and optimize the platform transport layer end-to-end:
|
|
||||||
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
|
|
||||||
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
|
|
||||||
|
|
||||||
This plan merges and supersedes:
|
|
||||||
- `GATEWAY_TRANSPORT_PLAN.md`
|
|
||||||
- `NATS_TRANSPORT_PLAN.md`
|
|
||||||
|
|
||||||
## Current Status (Codebase Reality)
|
|
||||||
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
|
|
||||||
- Request context pieces are standardized:
|
|
||||||
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
|
|
||||||
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
|
|
||||||
- `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
|
|
||||||
- Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
|
|
||||||
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
|
|
||||||
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
|
|
||||||
- Node → NATS header propagation is improved and closer to consistent:
|
|
||||||
- Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
|
|
||||||
- Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
|
|
||||||
- Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
|
|
||||||
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
|
|
||||||
|
|
||||||
## Principles
|
|
||||||
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
|
|
||||||
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
|
|
||||||
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
|
|
||||||
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
|
|
||||||
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
|
|
||||||
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
|
|
||||||
|
|
||||||
## Non-Negotiable Rules (Global)
|
|
||||||
- Every cross-component hop MUST carry tenant + correlation + trace context.
|
|
||||||
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
|
|
||||||
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
|
|
||||||
- Every milestone is stop-the-line gated:
|
|
||||||
- All tasks completed
|
|
||||||
- All tests required by the milestone pass
|
|
||||||
- Workspace verification commands pass
|
|
||||||
- Gated integration tests for the milestone are runnable and documented
|
|
||||||
|
|
||||||
## Baseline (Today)
|
|
||||||
- Gateway → Aggregate: gRPC (command submission)
|
|
||||||
- Gateway → Projection: HTTP (query proxy)
|
|
||||||
- Gateway → Runner: HTTP (admin proxy)
|
|
||||||
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
|
|
||||||
|
|
||||||
## End State (Target Architecture)
|
|
||||||
- Edge contract (clients ↔ Gateway): HTTP/JSON
|
|
||||||
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
|
|
||||||
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
|
|
||||||
- `shared` is the single source of truth for:
|
|
||||||
- header names and injection/extraction rules
|
|
||||||
- trace parsing/validation (`traceparent`, `trace-id`)
|
|
||||||
- context object model (tenant/correlation/trace/request ids)
|
|
||||||
- NATS subject + consumer naming helpers
|
|
||||||
|
|
||||||
## Standard Contracts
|
|
||||||
### Context Fields
|
|
||||||
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
|
|
||||||
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
|
|
||||||
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
|
|
||||||
- Request id: HTTP `x-request-id` (optional for NATS)
|
|
||||||
|
|
||||||
### Standard Service Endpoints (every service)
|
|
||||||
- `GET /health` liveness
|
|
||||||
- `GET /ready` readiness (includes tenant gating if relevant)
|
|
||||||
- `GET /metrics` Prometheus
|
|
||||||
|
|
||||||
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- `shared` contains canonical constants for header names and NATS header names.
|
|
||||||
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
|
|
||||||
- Library-level unit tests cover parsing/derivation behavior.
|
|
||||||
- All crates build and tests pass for the workspace.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Add shared ID types in `shared`:
|
|
||||||
- [x] `TenantId`
|
|
||||||
- [x] `CorrelationId`
|
|
||||||
- [x] `TraceId`
|
|
||||||
- [x] Consolidate header constants in `shared`:
|
|
||||||
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
|
|
||||||
- [x] HTTP: `x-tenant-id`, `x-request-id`
|
|
||||||
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
|
|
||||||
- [x] NATS: `tenant-id`, `Nats-Msg-Id`
|
|
||||||
- [x] Add shared helpers:
|
|
||||||
- [x] derive `trace-id` from `traceparent`
|
|
||||||
- [x] derive `traceparent` from `trace-id` when valid
|
|
||||||
- [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
|
|
||||||
- [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
|
|
||||||
- [x] Add unit tests in `shared` for:
|
|
||||||
- [x] traceparent parsing validity
|
|
||||||
- [x] serialization shape for correlation/trace id newtypes
|
|
||||||
- [x] additional validation cases (invalid traceparents, all-zero ids)
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- `cargo fmt --check`
|
|
||||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
|
||||||
- `cargo test --workspace`
|
|
||||||
|
|
||||||
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestone 0
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
|
|
||||||
- All NATS producers set required headers consistently.
|
|
||||||
- All NATS consumers tolerate unknown fields and missing optional fields.
|
|
||||||
- “Contract tests” exist per service to verify produced headers and subject formats.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Create/standardize subject builder helpers (prefer `shared`):
|
|
||||||
- [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
|
|
||||||
- [x] Runner effect/effect_result subject builders
|
|
||||||
- [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
|
|
||||||
- [x] Aggregate publishes:
|
|
||||||
- [x] `tenant-id` header always present
|
|
||||||
- [x] correlation + trace headers always present; generated when missing/invalid
|
|
||||||
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
|
|
||||||
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
|
|
||||||
- [x] Runner publishes (commands/results):
|
|
||||||
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
|
|
||||||
- [x] trace headers always present/derived when possible; generated when missing/invalid
|
|
||||||
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
|
|
||||||
- [x] outbox metadata → NATS headers mapping standardized via shared helpers
|
|
||||||
- [x] Projection consumption:
|
|
||||||
- [x] envelope decoding remains tolerant (unknown fields ignored)
|
|
||||||
- [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
|
|
||||||
- [x] Add unit tests:
|
|
||||||
- [x] subject formatting tests (shared builders)
|
|
||||||
- [x] required header presence tests per publisher (Aggregate + Runner)
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
|
|
||||||
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestone 1
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
|
|
||||||
- Services create streams if missing, and validate compatibility on startup.
|
|
||||||
- Startup does not silently replace or destructively mutate existing streams.
|
|
||||||
- Config-only tests validate stream config builders without requiring NATS.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Define stream policies:
|
|
||||||
- [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
|
|
||||||
- [x] `WORKFLOW_COMMANDS` is defined and validated on startup
|
|
||||||
- [x] `WORKFLOW_EVENTS` is defined and validated on startup
|
|
||||||
- [x] Centralize stream policy builders/validators in `shared`
|
|
||||||
- [x] Implement compatibility validation rules:
|
|
||||||
- [x] required subjects are present (superset allowed)
|
|
||||||
- [x] limits/max_age/duplicate window validated against minimums
|
|
||||||
- [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
|
|
||||||
- [x] Add unit tests for stream config builders + validators.
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
|
|
||||||
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestone 2
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
|
|
||||||
- Application-level concurrency is bounded and aligned with `max_in_flight`.
|
|
||||||
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
|
|
||||||
- Gated NATS integration tests prove:
|
|
||||||
- redelivery idempotency
|
|
||||||
- poison termination
|
|
||||||
- scale-out behavior (deliver group) where applicable
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Standardize consumer defaults:
|
|
||||||
- [x] `AckPolicy::Explicit`
|
|
||||||
- [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
|
|
||||||
- [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
|
|
||||||
- [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
|
|
||||||
- [x] Projection:
|
|
||||||
- [x] durable naming collision-free for Single/PerView modes
|
|
||||||
- [x] checkpoint gate semantics: “skip still acks”
|
|
||||||
- [x] poison handling persists durable records and terminates reliably (poison record + term)
|
|
||||||
- [x] Runner:
|
|
||||||
- [x] durable naming collision-free and stable across replicas
|
|
||||||
- [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
|
|
||||||
- [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
|
|
||||||
- [x] Aggregate:
|
|
||||||
- [x] ad-hoc fetch consumer always unique and bounded
|
|
||||||
- [x] best-effort deletion never targets unrelated consumers
|
|
||||||
- [x] Add gated NATS integration tests and document env flags:
|
|
||||||
- [x] Runner ignored tests
|
|
||||||
- [x] Projection ignored tests
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
|
|
||||||
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
|
|
||||||
|
|
||||||
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestone 0 (context contract)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Projection exposes `projection.gateway.v1.QueryService`.
|
|
||||||
- Gateway routes queries via gRPC by default.
|
|
||||||
- Authz remains enforced in Gateway (deny-by-default).
|
|
||||||
- Query responses remain stable for Control UI expectations.
|
|
||||||
- New gRPC query tests pass (unit + integration).
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Define protobuf API: `projection.gateway.v1.QueryService`
|
|
||||||
- [x] Implement Projection gRPC server for query execution
|
|
||||||
- [x] Implement Gateway gRPC client routing to Projection
|
|
||||||
- [x] deadlines
|
|
||||||
- [x] bounded retries (idempotent only)
|
|
||||||
- [x] context propagation
|
|
||||||
- [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
|
|
||||||
- [x] route internally to gRPC
|
|
||||||
- [x] Add tests:
|
|
||||||
- [x] authz + forwarding via gRPC
|
|
||||||
- [x] tenant isolation enforcement in Projection QueryService
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
|
|
||||||
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestone 0 (context contract)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Runner exposes `runner.admin.v1.RunnerAdmin`.
|
|
||||||
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
|
|
||||||
- Tenant-spoof and unauthorized calls are rejected deterministically.
|
|
||||||
- Runner drain/readiness semantics validated and tested.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
|
||||||
- [x] Implement Runner gRPC admin server
|
|
||||||
- [x] Implement Gateway gRPC client integration for admin operations
|
|
||||||
- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
|
|
||||||
- [x] Add tests:
|
|
||||||
- [x] Gateway: rejects without rights
|
|
||||||
- [x] Gateway: rejects tenant spoof attempts
|
|
||||||
- [x] Runner: idempotency and drain semantics
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
|
|
||||||
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestones 4–5 (gRPC internal RPC surfaces available)
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
|
|
||||||
- Deadlines everywhere; retries only for idempotent operations.
|
|
||||||
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
|
|
||||||
- Gated load/soak tests exist and are runnable.
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Implement upstream channel pool
|
|
||||||
- [x] bounded LRU
|
|
||||||
- [x] TTL/eviction
|
|
||||||
- [x] fast-path reuse under load (cached gRPC channels)
|
|
||||||
- [x] Standardize retry profiles
|
|
||||||
- [x] read-only: limited retry with jitter (Gateway gRPC calls)
|
|
||||||
- [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
|
|
||||||
- [x] Standardize timeouts/deadlines:
|
|
||||||
- [x] edge timeout limits
|
|
||||||
- [x] internal per-service deadlines
|
|
||||||
- [x] Fanout controls:
|
|
||||||
- [x] concurrency limiters for probes/snapshots
|
|
||||||
- [x] short TTL caching where safe
|
|
||||||
- [x] Ensure probes carry context (correlation/trace) for observability.
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
- Gated load/soak tests (document env + how to run)
|
|
||||||
|
|
||||||
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- Milestone 6
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
|
|
||||||
|
|
||||||
### Exit Criteria
|
|
||||||
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
|
|
||||||
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
|
|
||||||
- End-to-end smoke tests pass (gated).
|
|
||||||
|
|
||||||
### Tasks
|
|
||||||
- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
|
|
||||||
- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
|
|
||||||
- [x] Ensure Control UI + Control API rely only on standardized surfaces
|
|
||||||
- [x] Harden metrics and readiness probes to match the standard contract everywhere
|
|
||||||
|
|
||||||
### Required Tests
|
|
||||||
- Workspace verification commands
|
|
||||||
- End-to-end smoke tests (gated)
|
|
||||||
|
|
||||||
## Workspace Verification Commands (Run for Every Milestone)
|
|
||||||
- `cargo fmt --check`
|
|
||||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
|
||||||
- `cargo test --workspace`
|
|
||||||
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
|
|
||||||
18
docs/README.md
Normal file
18
docs/README.md
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
# Cloudlysis Documentation
|
||||||
|
|
||||||
|
## Sections
|
||||||
|
- Architecture
|
||||||
|
- docs/architecture/overview.md
|
||||||
|
- docs/architecture/transport.md
|
||||||
|
- Developer
|
||||||
|
- docs/developer/setup.md
|
||||||
|
- docs/developer/testing.md
|
||||||
|
- Usage
|
||||||
|
- docs/usage/quickstart.md
|
||||||
|
- docs/usage/api.md
|
||||||
|
- docs/usage/nats.md
|
||||||
|
|
||||||
|
## Conventions
|
||||||
|
- HTTP edge remains JSON over REST via Gateway.
|
||||||
|
- Internal RPC uses gRPC between Gateway and nodes (Aggregate, Projection, Runner).
|
||||||
|
- Async backbone uses NATS JetStream and KV with standardized subjects, headers, and stream/consumer policies.
|
||||||
21
docs/architecture/overview.md
Normal file
21
docs/architecture/overview.md
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
# Architecture Overview
|
||||||
|
|
||||||
|
## Monorepo
|
||||||
|
- Rust workspace: aggregate, projection, runner, gateway, control/api, shared
|
||||||
|
- Frontend: control/ui
|
||||||
|
- Infra: docker, observability, swarm
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
- Clients → Gateway (HTTP/JSON)
|
||||||
|
- Gateway ↔ Nodes (gRPC)
|
||||||
|
- Nodes ↔ NATS (JetStream + KV)
|
||||||
|
|
||||||
|
## Services
|
||||||
|
- Aggregate: command handling + event sourcing; publishes events to JetStream
|
||||||
|
- Projection: materialized views; consumes aggregate events; exposes QueryService (gRPC)
|
||||||
|
- Runner: workflow/saga engine + effects/outbox; exposes RunnerAdmin (gRPC)
|
||||||
|
- Gateway: edge, authn/z, routing to nodes, admin entry points
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
- /health, /ready, /metrics on all services
|
||||||
|
- Correlation and tracing propagated across HTTP, gRPC, and NATS
|
||||||
23
docs/architecture/transport.md
Normal file
23
docs/architecture/transport.md
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
# Transport Contracts
|
||||||
|
|
||||||
|
## Context
|
||||||
|
- Tenant: HTTP x-tenant-id; NATS tenant-id
|
||||||
|
- Correlation: HTTP x-correlation-id; NATS x-correlation-id + correlation-id
|
||||||
|
- Trace: HTTP traceparent; NATS traceparent + trace-id
|
||||||
|
|
||||||
|
## Internal RPC (gRPC)
|
||||||
|
- Aggregate: CommandService (submit commands)
|
||||||
|
- Projection: QueryService (execute queries)
|
||||||
|
- Runner: RunnerAdmin (drain, status, reload)
|
||||||
|
- All calls set deadlines and propagate context metadata
|
||||||
|
|
||||||
|
## NATS JetStream
|
||||||
|
- Streams:
|
||||||
|
- AGGREGATE_EVENTS: tenant.*.aggregate.*.*
|
||||||
|
- WORKFLOW_COMMANDS, WORKFLOW_EVENTS
|
||||||
|
- Producers set headers: tenant-id, Nats-Msg-Id, correlation, traceparent/trace-id
|
||||||
|
- Consumers use AckPolicy::Explicit, bounded ack_wait, max_deliver, max_ack_pending
|
||||||
|
|
||||||
|
## Routing
|
||||||
|
- Gateway routes per-tenant to shards for Aggregate/Projection/Runner
|
||||||
|
- Routing tables hot-reload atomically
|
||||||
26
docs/developer/setup.md
Normal file
26
docs/developer/setup.md
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
# Developer Setup
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
- Rust toolchain (stable)
|
||||||
|
- Node.js (LTS) for control/ui
|
||||||
|
- Docker (optional) for local stack
|
||||||
|
|
||||||
|
## Build
|
||||||
|
```bash
|
||||||
|
cargo build
|
||||||
|
cd control/ui && npm ci && npm run build
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workspace Verification
|
||||||
|
```bash
|
||||||
|
cargo fmt --check
|
||||||
|
cargo clippy --workspace --all-targets -- -D warnings
|
||||||
|
cargo test --workspace
|
||||||
|
cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
- Gateway: routing config using file or KV
|
||||||
|
- Projection: PROJECTION_GRPC_ADDR
|
||||||
|
- Runner: RUNNER_GRPC_ADDR
|
||||||
|
- NATS: URLs via service-specific settings
|
||||||
27
docs/developer/testing.md
Normal file
27
docs/developer/testing.md
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
# Testing
|
||||||
|
|
||||||
|
## Unit and Integration
|
||||||
|
```bash
|
||||||
|
cargo test --workspace
|
||||||
|
```
|
||||||
|
|
||||||
|
## Gated Tests (require external services)
|
||||||
|
- Runner NATS:
|
||||||
|
```bash
|
||||||
|
RUNNER_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p runner -- --ignored
|
||||||
|
```
|
||||||
|
- Projection NATS:
|
||||||
|
```bash
|
||||||
|
PROJECTION_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -p projection -- --ignored
|
||||||
|
```
|
||||||
|
- Docker-based gates:
|
||||||
|
```bash
|
||||||
|
cargo test -p gateway -- --ignored
|
||||||
|
```
|
||||||
|
|
||||||
|
## Control UI
|
||||||
|
```bash
|
||||||
|
cd control/ui
|
||||||
|
npm ci
|
||||||
|
npm run test
|
||||||
|
```
|
||||||
38
docs/usage/api.md
Normal file
38
docs/usage/api.md
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
# Usage: API Examples
|
||||||
|
|
||||||
|
## Projection Query via Gateway (HTTP → gRPC)
|
||||||
|
```bash
|
||||||
|
curl -sS -X POST \
|
||||||
|
-H "x-tenant-id: tenant-a" \
|
||||||
|
-H "x-correlation-id: demo" \
|
||||||
|
-H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \
|
||||||
|
http://localhost:8080/v1/query/User \
|
||||||
|
-d '{"uqf":"{\"eq\":{\"id\":\"u1\"}}"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Projection Query via gRPC (direct, internal)
|
||||||
|
```bash
|
||||||
|
grpcurl -d '{"tenant_id":"tenant-a","view_type":"User","uqf":"{}"}' \
|
||||||
|
-H 'x-tenant-id: tenant-a' \
|
||||||
|
-H 'x-correlation-id: demo' \
|
||||||
|
-H 'traceparent: 00-00000000000000000000000000000001-0000000000000001-01' \
|
||||||
|
-plaintext localhost:9090 projection.gateway.v1.QueryService/ExecuteQuery
|
||||||
|
```
|
||||||
|
|
||||||
|
## Aggregate Command via Gateway (HTTP → gRPC)
|
||||||
|
```bash
|
||||||
|
curl -sS -X POST \
|
||||||
|
-H "x-tenant-id: tenant-a" \
|
||||||
|
-H "x-correlation-id: demo" \
|
||||||
|
-H "traceparent: 00-00000000000000000000000000000001-0000000000000001-01" \
|
||||||
|
http://localhost:8080/v1/aggregate/BankAccount/command \
|
||||||
|
-d '{"id":"acc-1","command_type":"Open","payload":{"owner":"Alice"}}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Runner Admin via Gateway (HTTP → gRPC)
|
||||||
|
```bash
|
||||||
|
curl -sS -X POST \
|
||||||
|
-H "x-tenant-id: tenant-a" \
|
||||||
|
-H "authorization: Bearer <token>" \
|
||||||
|
http://localhost:8080/admin/runner/drain?wait_ms=0
|
||||||
|
```
|
||||||
17
docs/usage/nats.md
Normal file
17
docs/usage/nats.md
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
# NATS Reference
|
||||||
|
|
||||||
|
## Subjects
|
||||||
|
- Aggregate events: tenant.<tenant>.aggregate.<type>.<id>
|
||||||
|
- Workflow commands/events: shared helpers define exact formats
|
||||||
|
|
||||||
|
## Headers (Producers)
|
||||||
|
- tenant-id: required
|
||||||
|
- Nats-Msg-Id: idempotency key (event_id, command_id, etc.)
|
||||||
|
- x-correlation-id and correlation-id
|
||||||
|
- traceparent and trace-id
|
||||||
|
|
||||||
|
## Consumers
|
||||||
|
- AckPolicy::Explicit
|
||||||
|
- ack_wait: bounded timeout
|
||||||
|
- max_deliver: bounded
|
||||||
|
- max_ack_pending: aligned with concurrency
|
||||||
26
docs/usage/quickstart.md
Normal file
26
docs/usage/quickstart.md
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
# Quick Start
|
||||||
|
|
||||||
|
## Compose
|
||||||
|
```bash
|
||||||
|
docker compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
Full stack with observability:
|
||||||
|
```bash
|
||||||
|
docker compose -f docker-compose.yml -f observability/docker-compose.yml up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
## Local Dev (minimal)
|
||||||
|
- Start NATS locally
|
||||||
|
- Run services:
|
||||||
|
```bash
|
||||||
|
cargo run -p aggregate
|
||||||
|
cargo run -p projection
|
||||||
|
cargo run -p runner
|
||||||
|
cargo run -p gateway
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verify
|
||||||
|
- Gateway: GET /health, /ready, /metrics
|
||||||
|
- Projection: GET /health, /ready, /metrics
|
||||||
|
- Runner: GET /health, /ready, /metrics
|
||||||
Reference in New Issue
Block a user