Files
cloudlysis/TRANSPORT_DEVELOPMENT_PLAN.md
Vlad Durnea 90c307016d
Some checks failed
ci / rust (push) Failing after 2m21s
ci / ui (push) Failing after 28s
images / build-and-push (push) Failing after 18s
transport: complete M0–M7
shared: add stream+consumer policy helpers; NATS context header builder

aggregate/runner/projection: centralize stream validation and header usage; set bounded consumer params

projection: add QueryService gRPC and wire into main; settings include PROJECTION_GRPC_ADDR

gateway: gRPC routing to Projection/Runner with deadlines; bounded read-only retries; pooled gRPC channels (bounded LRU+TTL); admin proxy forwards to gRPC; probes use concurrency limiter + TTL cache

runner: add RunnerAdmin gRPC server (drain, status, reload) and wire into main; settings include RUNNER_GRPC_ADDR

tests: add gateway authz for runner admin, projection tenant isolation, runner admin drain semantics

docs: update TRANSPORT_DEVELOPMENT_PLAN to reflect completed milestones and details
2026-03-30 14:24:14 +03:00

340 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Transport Development Plan
## Purpose
Unify and optimize the platform transport layer end-to-end:
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
This plan merges and supersedes:
- `GATEWAY_TRANSPORT_PLAN.md`
- `NATS_TRANSPORT_PLAN.md`
## Current Status (Codebase Reality)
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
- Request context pieces are standardized:
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
- `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
- Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
- Node → NATS header propagation is improved and closer to consistent:
- Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
- Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
- Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
## Principles
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
## Non-Negotiable Rules (Global)
- Every cross-component hop MUST carry tenant + correlation + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
- Every milestone is stop-the-line gated:
- All tasks completed
- All tests required by the milestone pass
- Workspace verification commands pass
- Gated integration tests for the milestone are runnable and documented
## Baseline (Today)
- Gateway → Aggregate: gRPC (command submission)
- Gateway → Projection: HTTP (query proxy)
- Gateway → Runner: HTTP (admin proxy)
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
## End State (Target Architecture)
- Edge contract (clients ↔ Gateway): HTTP/JSON
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
- `shared` is the single source of truth for:
- header names and injection/extraction rules
- trace parsing/validation (`traceparent`, `trace-id`)
- context object model (tenant/correlation/trace/request ids)
- NATS subject + consumer naming helpers
## Standard Contracts
### Context Fields
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
- Request id: HTTP `x-request-id` (optional for NATS)
### Standard Service Endpoints (every service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if relevant)
- `GET /metrics` Prometheus
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
### Goal
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
### Exit Criteria
- `shared` contains canonical constants for header names and NATS header names.
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
- Library-level unit tests cover parsing/derivation behavior.
- All crates build and tests pass for the workspace.
### Tasks
- [x] Add shared ID types in `shared`:
- [x] `TenantId`
- [x] `CorrelationId`
- [x] `TraceId`
- [x] Consolidate header constants in `shared`:
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
- [x] HTTP: `x-tenant-id`, `x-request-id`
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
- [x] NATS: `tenant-id`, `Nats-Msg-Id`
- [x] Add shared helpers:
- [x] derive `trace-id` from `traceparent`
- [x] derive `traceparent` from `trace-id` when valid
- [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
- [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
- [x] Add unit tests in `shared` for:
- [x] traceparent parsing validity
- [x] serialization shape for correlation/trace id newtypes
- [x] additional validation cases (invalid traceparents, all-zero ids)
### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
### Dependencies
- Milestone 0
### Goal
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
### Exit Criteria
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
- All NATS producers set required headers consistently.
- All NATS consumers tolerate unknown fields and missing optional fields.
- “Contract tests” exist per service to verify produced headers and subject formats.
### Tasks
- [x] Create/standardize subject builder helpers (prefer `shared`):
- [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
- [x] Runner effect/effect_result subject builders
- [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
- [x] Aggregate publishes:
- [x] `tenant-id` header always present
- [x] correlation + trace headers always present; generated when missing/invalid
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
- [x] Runner publishes (commands/results):
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
- [x] trace headers always present/derived when possible; generated when missing/invalid
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
- [x] outbox metadata → NATS headers mapping standardized via shared helpers
- [x] Projection consumption:
- [x] envelope decoding remains tolerant (unknown fields ignored)
- [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
- [x] Add unit tests:
- [x] subject formatting tests (shared builders)
- [x] required header presence tests per publisher (Aggregate + Runner)
### Required Tests
- Workspace verification commands
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
### Dependencies
- Milestone 1
### Goal
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
### Exit Criteria
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
- Services create streams if missing, and validate compatibility on startup.
- Startup does not silently replace or destructively mutate existing streams.
- Config-only tests validate stream config builders without requiring NATS.
### Tasks
- [x] Define stream policies:
- [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
- [x] `WORKFLOW_COMMANDS` is defined and validated on startup
- [x] `WORKFLOW_EVENTS` is defined and validated on startup
- [x] Centralize stream policy builders/validators in `shared`
- [x] Implement compatibility validation rules:
- [x] required subjects are present (superset allowed)
- [x] limits/max_age/duplicate window validated against minimums
- [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
- [x] Add unit tests for stream config builders + validators.
### Required Tests
- Workspace verification commands
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
### Dependencies
- Milestone 2
### Goal
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
### Exit Criteria
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
- Application-level concurrency is bounded and aligned with `max_in_flight`.
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
- Gated NATS integration tests prove:
- redelivery idempotency
- poison termination
- scale-out behavior (deliver group) where applicable
### Tasks
- [x] Standardize consumer defaults:
- [x] `AckPolicy::Explicit`
- [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
- [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
- [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
- [x] Projection:
- [x] durable naming collision-free for Single/PerView modes
- [x] checkpoint gate semantics: “skip still acks”
- [x] poison handling persists durable records and terminates reliably (poison record + term)
- [x] Runner:
- [x] durable naming collision-free and stable across replicas
- [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
- [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
- [x] Aggregate:
- [x] ad-hoc fetch consumer always unique and bounded
- [x] best-effort deletion never targets unrelated consumers
- [x] Add gated NATS integration tests and document env flags:
- [x] Runner ignored tests
- [x] Projection ignored tests
### Required Tests
- Workspace verification commands
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- Projection exposes `projection.gateway.v1.QueryService`.
- Gateway routes queries via gRPC by default.
- Authz remains enforced in Gateway (deny-by-default).
- Query responses remain stable for Control UI expectations.
- New gRPC query tests pass (unit + integration).
### Tasks
- [x] Define protobuf API: `projection.gateway.v1.QueryService`
- [x] Implement Projection gRPC server for query execution
- [x] Implement Gateway gRPC client routing to Projection
- [x] deadlines
- [x] bounded retries (idempotent only)
- [x] context propagation
- [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
- [x] route internally to gRPC
- [x] Add tests:
- [x] authz + forwarding via gRPC
- [x] tenant isolation enforcement in Projection QueryService
### Required Tests
- Workspace verification commands
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateways `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
### Exit Criteria
- Runner exposes `runner.admin.v1.RunnerAdmin`.
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
- Tenant-spoof and unauthorized calls are rejected deterministically.
- Runner drain/readiness semantics validated and tested.
### Tasks
- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [x] Implement Runner gRPC admin server
- [x] Implement Gateway gRPC client integration for admin operations
- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- [x] Add tests:
- [x] Gateway: rejects without rights
- [x] Gateway: rejects tenant spoof attempts
- [x] Runner: idempotency and drain semantics
### Required Tests
- Workspace verification commands
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
### Dependencies
- Milestones 45 (gRPC internal RPC surfaces available)
### Goal
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
### Exit Criteria
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
- Deadlines everywhere; retries only for idempotent operations.
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
- Gated load/soak tests exist and are runnable.
### Tasks
- [x] Implement upstream channel pool
- [x] bounded LRU
- [x] TTL/eviction
- [x] fast-path reuse under load (cached gRPC channels)
- [x] Standardize retry profiles
- [x] read-only: limited retry with jitter (Gateway gRPC calls)
- [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
- [x] Standardize timeouts/deadlines:
- [x] edge timeout limits
- [x] internal per-service deadlines
- [x] Fanout controls:
- [x] concurrency limiters for probes/snapshots
- [x] short TTL caching where safe
- [x] Ensure probes carry context (correlation/trace) for observability.
### Required Tests
- Workspace verification commands
- Gated load/soak tests (document env + how to run)
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
### Dependencies
- Milestone 6
### Goal
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
- End-to-end smoke tests pass (gated).
### Tasks
- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
- [x] Ensure Control UI + Control API rely only on standardized surfaces
- [x] Harden metrics and readiness probes to match the standard contract everywhere
### Required Tests
- Workspace verification commands
- End-to-end smoke tests (gated)
## Workspace Verification Commands (Run for Every Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)