shared: add stream+consumer policy helpers; NATS context header builder aggregate/runner/projection: centralize stream validation and header usage; set bounded consumer params projection: add QueryService gRPC and wire into main; settings include PROJECTION_GRPC_ADDR gateway: gRPC routing to Projection/Runner with deadlines; bounded read-only retries; pooled gRPC channels (bounded LRU+TTL); admin proxy forwards to gRPC; probes use concurrency limiter + TTL cache runner: add RunnerAdmin gRPC server (drain, status, reload) and wire into main; settings include RUNNER_GRPC_ADDR tests: add gateway authz for runner admin, projection tenant isolation, runner admin drain semantics docs: update TRANSPORT_DEVELOPMENT_PLAN to reflect completed milestones and details
340 lines
16 KiB
Markdown
340 lines
16 KiB
Markdown
# Transport Development Plan
|
||
|
||
## Purpose
|
||
Unify and optimize the platform transport layer end-to-end:
|
||
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
|
||
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
|
||
|
||
This plan merges and supersedes:
|
||
- `GATEWAY_TRANSPORT_PLAN.md`
|
||
- `NATS_TRANSPORT_PLAN.md`
|
||
|
||
## Current Status (Codebase Reality)
|
||
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
|
||
- Request context pieces are standardized:
|
||
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
|
||
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
|
||
- `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
|
||
- Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
|
||
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
|
||
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
|
||
- Node → NATS header propagation is improved and closer to consistent:
|
||
- Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
|
||
- Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
|
||
- Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
|
||
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
|
||
|
||
## Principles
|
||
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
|
||
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
|
||
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
|
||
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
|
||
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
|
||
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
|
||
|
||
## Non-Negotiable Rules (Global)
|
||
- Every cross-component hop MUST carry tenant + correlation + trace context.
|
||
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
|
||
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
|
||
- Every milestone is stop-the-line gated:
|
||
- All tasks completed
|
||
- All tests required by the milestone pass
|
||
- Workspace verification commands pass
|
||
- Gated integration tests for the milestone are runnable and documented
|
||
|
||
## Baseline (Today)
|
||
- Gateway → Aggregate: gRPC (command submission)
|
||
- Gateway → Projection: HTTP (query proxy)
|
||
- Gateway → Runner: HTTP (admin proxy)
|
||
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
|
||
|
||
## End State (Target Architecture)
|
||
- Edge contract (clients ↔ Gateway): HTTP/JSON
|
||
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
|
||
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
|
||
- `shared` is the single source of truth for:
|
||
- header names and injection/extraction rules
|
||
- trace parsing/validation (`traceparent`, `trace-id`)
|
||
- context object model (tenant/correlation/trace/request ids)
|
||
- NATS subject + consumer naming helpers
|
||
|
||
## Standard Contracts
|
||
### Context Fields
|
||
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
|
||
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
|
||
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
|
||
- Request id: HTTP `x-request-id` (optional for NATS)
|
||
|
||
### Standard Service Endpoints (every service)
|
||
- `GET /health` liveness
|
||
- `GET /ready` readiness (includes tenant gating if relevant)
|
||
- `GET /metrics` Prometheus
|
||
|
||
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
|
||
|
||
### Goal
|
||
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
|
||
|
||
### Exit Criteria
|
||
- `shared` contains canonical constants for header names and NATS header names.
|
||
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
|
||
- Library-level unit tests cover parsing/derivation behavior.
|
||
- All crates build and tests pass for the workspace.
|
||
|
||
### Tasks
|
||
- [x] Add shared ID types in `shared`:
|
||
- [x] `TenantId`
|
||
- [x] `CorrelationId`
|
||
- [x] `TraceId`
|
||
- [x] Consolidate header constants in `shared`:
|
||
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
|
||
- [x] HTTP: `x-tenant-id`, `x-request-id`
|
||
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
|
||
- [x] NATS: `tenant-id`, `Nats-Msg-Id`
|
||
- [x] Add shared helpers:
|
||
- [x] derive `trace-id` from `traceparent`
|
||
- [x] derive `traceparent` from `trace-id` when valid
|
||
- [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
|
||
- [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
|
||
- [x] Add unit tests in `shared` for:
|
||
- [x] traceparent parsing validity
|
||
- [x] serialization shape for correlation/trace id newtypes
|
||
- [x] additional validation cases (invalid traceparents, all-zero ids)
|
||
|
||
### Required Tests
|
||
- `cargo fmt --check`
|
||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
||
- `cargo test --workspace`
|
||
|
||
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
|
||
|
||
### Dependencies
|
||
- Milestone 0
|
||
|
||
### Goal
|
||
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
|
||
|
||
### Exit Criteria
|
||
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
|
||
- All NATS producers set required headers consistently.
|
||
- All NATS consumers tolerate unknown fields and missing optional fields.
|
||
- “Contract tests” exist per service to verify produced headers and subject formats.
|
||
|
||
### Tasks
|
||
- [x] Create/standardize subject builder helpers (prefer `shared`):
|
||
- [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
|
||
- [x] Runner effect/effect_result subject builders
|
||
- [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
|
||
- [x] Aggregate publishes:
|
||
- [x] `tenant-id` header always present
|
||
- [x] correlation + trace headers always present; generated when missing/invalid
|
||
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
|
||
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
|
||
- [x] Runner publishes (commands/results):
|
||
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
|
||
- [x] trace headers always present/derived when possible; generated when missing/invalid
|
||
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
|
||
- [x] outbox metadata → NATS headers mapping standardized via shared helpers
|
||
- [x] Projection consumption:
|
||
- [x] envelope decoding remains tolerant (unknown fields ignored)
|
||
- [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
|
||
- [x] Add unit tests:
|
||
- [x] subject formatting tests (shared builders)
|
||
- [x] required header presence tests per publisher (Aggregate + Runner)
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
|
||
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
|
||
|
||
### Dependencies
|
||
- Milestone 1
|
||
|
||
### Goal
|
||
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
|
||
|
||
### Exit Criteria
|
||
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
|
||
- Services create streams if missing, and validate compatibility on startup.
|
||
- Startup does not silently replace or destructively mutate existing streams.
|
||
- Config-only tests validate stream config builders without requiring NATS.
|
||
|
||
### Tasks
|
||
- [x] Define stream policies:
|
||
- [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
|
||
- [x] `WORKFLOW_COMMANDS` is defined and validated on startup
|
||
- [x] `WORKFLOW_EVENTS` is defined and validated on startup
|
||
- [x] Centralize stream policy builders/validators in `shared`
|
||
- [x] Implement compatibility validation rules:
|
||
- [x] required subjects are present (superset allowed)
|
||
- [x] limits/max_age/duplicate window validated against minimums
|
||
- [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
|
||
- [x] Add unit tests for stream config builders + validators.
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
|
||
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
|
||
|
||
### Dependencies
|
||
- Milestone 2
|
||
|
||
### Goal
|
||
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
|
||
|
||
### Exit Criteria
|
||
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
|
||
- Application-level concurrency is bounded and aligned with `max_in_flight`.
|
||
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
|
||
- Gated NATS integration tests prove:
|
||
- redelivery idempotency
|
||
- poison termination
|
||
- scale-out behavior (deliver group) where applicable
|
||
|
||
### Tasks
|
||
- [x] Standardize consumer defaults:
|
||
- [x] `AckPolicy::Explicit`
|
||
- [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
|
||
- [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
|
||
- [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
|
||
- [x] Projection:
|
||
- [x] durable naming collision-free for Single/PerView modes
|
||
- [x] checkpoint gate semantics: “skip still acks”
|
||
- [x] poison handling persists durable records and terminates reliably (poison record + term)
|
||
- [x] Runner:
|
||
- [x] durable naming collision-free and stable across replicas
|
||
- [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
|
||
- [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
|
||
- [x] Aggregate:
|
||
- [x] ad-hoc fetch consumer always unique and bounded
|
||
- [x] best-effort deletion never targets unrelated consumers
|
||
- [x] Add gated NATS integration tests and document env flags:
|
||
- [x] Runner ignored tests
|
||
- [x] Projection ignored tests
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
|
||
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
|
||
|
||
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
|
||
|
||
### Dependencies
|
||
- Milestone 0 (context contract)
|
||
|
||
### Goal
|
||
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
|
||
|
||
### Exit Criteria
|
||
- Projection exposes `projection.gateway.v1.QueryService`.
|
||
- Gateway routes queries via gRPC by default.
|
||
- Authz remains enforced in Gateway (deny-by-default).
|
||
- Query responses remain stable for Control UI expectations.
|
||
- New gRPC query tests pass (unit + integration).
|
||
|
||
### Tasks
|
||
- [x] Define protobuf API: `projection.gateway.v1.QueryService`
|
||
- [x] Implement Projection gRPC server for query execution
|
||
- [x] Implement Gateway gRPC client routing to Projection
|
||
- [x] deadlines
|
||
- [x] bounded retries (idempotent only)
|
||
- [x] context propagation
|
||
- [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
|
||
- [x] route internally to gRPC
|
||
- [x] Add tests:
|
||
- [x] authz + forwarding via gRPC
|
||
- [x] tenant isolation enforcement in Projection QueryService
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
|
||
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
|
||
|
||
### Dependencies
|
||
- Milestone 0 (context contract)
|
||
|
||
### Goal
|
||
Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
|
||
|
||
### Exit Criteria
|
||
- Runner exposes `runner.admin.v1.RunnerAdmin`.
|
||
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
|
||
- Tenant-spoof and unauthorized calls are rejected deterministically.
|
||
- Runner drain/readiness semantics validated and tested.
|
||
|
||
### Tasks
|
||
- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
||
- [x] Implement Runner gRPC admin server
|
||
- [x] Implement Gateway gRPC client integration for admin operations
|
||
- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
|
||
- [x] Add tests:
|
||
- [x] Gateway: rejects without rights
|
||
- [x] Gateway: rejects tenant spoof attempts
|
||
- [x] Runner: idempotency and drain semantics
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
|
||
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
|
||
|
||
### Dependencies
|
||
- Milestones 4–5 (gRPC internal RPC surfaces available)
|
||
|
||
### Goal
|
||
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
|
||
|
||
### Exit Criteria
|
||
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
|
||
- Deadlines everywhere; retries only for idempotent operations.
|
||
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
|
||
- Gated load/soak tests exist and are runnable.
|
||
|
||
### Tasks
|
||
- [x] Implement upstream channel pool
|
||
- [x] bounded LRU
|
||
- [x] TTL/eviction
|
||
- [x] fast-path reuse under load (cached gRPC channels)
|
||
- [x] Standardize retry profiles
|
||
- [x] read-only: limited retry with jitter (Gateway gRPC calls)
|
||
- [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
|
||
- [x] Standardize timeouts/deadlines:
|
||
- [x] edge timeout limits
|
||
- [x] internal per-service deadlines
|
||
- [x] Fanout controls:
|
||
- [x] concurrency limiters for probes/snapshots
|
||
- [x] short TTL caching where safe
|
||
- [x] Ensure probes carry context (correlation/trace) for observability.
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
- Gated load/soak tests (document env + how to run)
|
||
|
||
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
|
||
|
||
### Dependencies
|
||
- Milestone 6
|
||
|
||
### Goal
|
||
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
|
||
|
||
### Exit Criteria
|
||
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
|
||
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
|
||
- End-to-end smoke tests pass (gated).
|
||
|
||
### Tasks
|
||
- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
|
||
- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
|
||
- [x] Ensure Control UI + Control API rely only on standardized surfaces
|
||
- [x] Harden metrics and readiness probes to match the standard contract everywhere
|
||
|
||
### Required Tests
|
||
- Workspace verification commands
|
||
- End-to-end smoke tests (gated)
|
||
|
||
## Workspace Verification Commands (Run for Every Milestone)
|
||
- `cargo fmt --check`
|
||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
||
- `cargo test --workspace`
|
||
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
|