Files
cloudlysis/TRANSPORT_DEVELOPMENT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

334 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Transport Development Plan
## Purpose
Unify and optimize the platform transport layer end-to-end:
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
This plan merges and supersedes:
- `GATEWAY_TRANSPORT_PLAN.md`
- `NATS_TRANSPORT_PLAN.md`
## Current Status (Codebase Reality)
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
- Request context pieces are partially standardized:
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
- Some header names are centralized in `shared` but not all call sites use constants yet.
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
- Node → NATS header propagation is improved and closer to consistent:
- Runner publishes `x-correlation-id` and `correlation-id`, and ensures `traceparent`/`trace-id` are present/derived when possible.
- Aggregate publishes `trace-id` when `traceparent` is present.
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
## Principles
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
## Non-Negotiable Rules (Global)
- Every cross-component hop MUST carry tenant + correlation + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
- Every milestone is stop-the-line gated:
- All tasks completed
- All tests required by the milestone pass
- Workspace verification commands pass
- Gated integration tests for the milestone are runnable and documented
## Baseline (Today)
- Gateway → Aggregate: gRPC (command submission)
- Gateway → Projection: HTTP (query proxy)
- Gateway → Runner: HTTP (admin proxy)
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
## End State (Target Architecture)
- Edge contract (clients ↔ Gateway): HTTP/JSON
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
- `shared` is the single source of truth for:
- header names and injection/extraction rules
- trace parsing/validation (`traceparent`, `trace-id`)
- context object model (tenant/correlation/trace/request ids)
- NATS subject + consumer naming helpers
## Standard Contracts
### Context Fields
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
- Request id: HTTP `x-request-id` (optional for NATS)
### Standard Service Endpoints (every service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if relevant)
- `GET /metrics` Prometheus
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
### Goal
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
### Exit Criteria
- `shared` contains canonical constants for header names and NATS header names.
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
- Library-level unit tests cover parsing/derivation behavior.
- All crates build and tests pass for the workspace.
### Tasks
- [x] Add shared ID types in `shared`:
- [x] `TenantId`
- [x] `CorrelationId`
- [x] `TraceId`
- [~] Consolidate header constants in `shared`:
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
- [ ] HTTP: `x-tenant-id`, `x-request-id` (missing constants)
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
- [ ] NATS: `tenant-id` constant, `Nats-Msg-Id` constant (missing constants)
- [x] Add shared helpers:
- [x] derive `trace-id` from `traceparent`
- [x] derive `traceparent` from `trace-id` when valid
- [ ] normalize/generate correlation id when missing across all transports (helper exists for `CorrelationId::generate()`; adoption incomplete)
- [x] Add unit tests in `shared` for:
- [x] traceparent parsing validity
- [x] serialization shape for correlation/trace id newtypes
- [ ] additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement
### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
### Dependencies
- Milestone 0
### Goal
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
### Exit Criteria
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
- All NATS producers set required headers consistently.
- All NATS consumers tolerate unknown fields and missing optional fields.
- “Contract tests” exist per service to verify produced headers and subject formats.
### Tasks
- [ ] Create/standardize subject builder helpers (prefer `shared`):
- [ ] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
- [ ] Runner effect/effect_result/workflow subject builders
- [~] Aggregate publishes:
- [ ] `tenant-id` header always present (still needs enforcement everywhere)
- [ ] correlation + trace headers always present when available, generated when required
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
- [ ] `Nats-Msg-Id` strategy explicitly defined and tested
- [~] Runner publishes (commands/results):
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`)
- [x] trace headers derived consistently when possible (`traceparent` from `trace-id`, `trace-id` from `traceparent`)
- [ ] outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
- [~] Projection consumption:
- [x] envelope decoding remains tolerant (unknown fields ignored)
- [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
- [ ] Add unit tests:
- [ ] subject formatting tests per service (once builders exist)
- [ ] required header presence tests per publisher (enforce required keys)
### Required Tests
- Workspace verification commands
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
### Dependencies
- Milestone 1
### Goal
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
### Exit Criteria
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
- Services create streams if missing, and validate compatibility on startup.
- Startup does not silently replace or destructively mutate existing streams.
- Config-only tests validate stream config builders without requiring NATS.
### Tasks
- [ ] Define stream policies:
- [ ] `AGGREGATE_EVENTS` (subjects, retention, duplicate window)
- [ ] `WORKFLOW_COMMANDS`
- [ ] `WORKFLOW_EVENTS`
- [ ] Implement compatibility validation rules:
- [ ] required subjects are present (superset allowed)
- [ ] retention/limits are within allowed ranges
- [ ] dedupe assumptions align with producer `Nats-Msg-Id` usage
- [ ] Add unit tests for stream config builders + validators.
### Required Tests
- Workspace verification commands
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
### Dependencies
- Milestone 2
### Goal
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
### Exit Criteria
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
- Application-level concurrency is bounded and aligned with `max_in_flight`.
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
- Gated NATS integration tests prove:
- redelivery idempotency
- poison termination
- scale-out behavior (deliver group) where applicable
### Tasks
- [ ] Standardize consumer defaults:
- [ ] `AckPolicy::Explicit`
- [ ] `ack_wait` default + env override
- [ ] `max_deliver` default + env override
- [ ] `max_ack_pending` tied to worker concurrency
- [ ] Projection:
- [ ] durable naming collision-free for Single/PerView modes
- [ ] checkpoint gate semantics: “skip still acks”
- [ ] poison handling persists durable records and terminates reliably
- [ ] Runner:
- [ ] durable naming collision-free and stable across replicas
- [ ] deliver group rules defined and tested
- [ ] outbox relay exactly-once behavior verified under redelivery
- [ ] Aggregate:
- [ ] ad-hoc fetch consumer always unique and bounded
- [ ] best-effort deletion never targets unrelated consumers
- [ ] Add gated NATS integration tests and document env flags:
- [ ] Runner ignored tests
- [ ] Projection ignored tests
### Required Tests
- Workspace verification commands
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- Projection exposes `projection.gateway.v1.QueryService`.
- Gateway routes queries via gRPC by default.
- Authz remains enforced in Gateway (deny-by-default).
- Query responses remain stable for Control UI expectations.
- New gRPC query tests pass (unit + integration).
### Tasks
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
- [ ] Implement Projection gRPC server for query execution
- [ ] Implement Gateway gRPC client routing to Projection
- [ ] deadlines
- [ ] bounded retries (idempotent only)
- [ ] context propagation
- [ ] Preserve HTTP `/v1/query/*` as compatibility/debug:
- [ ] route internally to gRPC or keep as legacy endpoint
- [ ] Add tests:
- [ ] authz + forwarding via gRPC
- [ ] tenant isolation enforcement in Projection QueryService
### Required Tests
- Workspace verification commands
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateways `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
### Exit Criteria
- Runner exposes `runner.admin.v1.RunnerAdmin`.
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
- Tenant-spoof and unauthorized calls are rejected deterministically.
- Runner drain/readiness semantics validated and tested.
### Tasks
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [ ] Implement Runner gRPC admin server
- [ ] Implement Gateway gRPC client integration for admin operations
- [ ] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- [ ] Add tests:
- [ ] Gateway: rejects without rights
- [ ] Gateway: rejects tenant spoof attempts
- [ ] Runner: idempotency and drain semantics
### Required Tests
- Workspace verification commands
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
### Dependencies
- Milestones 45 (gRPC internal RPC surfaces available)
### Goal
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
### Exit Criteria
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
- Deadlines everywhere; retries only for idempotent operations.
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
- Gated load/soak tests exist and are runnable.
### Tasks
- [ ] Implement upstream channel pool
- [ ] bounded LRU
- [ ] TTL/eviction
- [ ] fast-path reuse under load
- [ ] Standardize retry profiles
- [ ] read-only: limited retry with jitter
- [ ] mutations: no retry unless idempotency key is present and semantics are safe
- [ ] Standardize timeouts/deadlines:
- [ ] edge timeout limits
- [ ] internal per-service deadlines
- [ ] Fanout controls:
- [ ] concurrency limiters for probes/snapshots
- [ ] short TTL caching where safe
- [ ] Ensure probes carry context (correlation/trace) for observability.
### Required Tests
- Workspace verification commands
- Gated load/soak tests (document env + how to run)
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
### Dependencies
- Milestone 6
### Goal
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
- End-to-end smoke tests pass (gated).
### Tasks
- [ ] Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
- [ ] Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
- [ ] Ensure Control UI + Control API rely only on standardized surfaces
- [ ] Harden metrics and readiness probes to match the standard contract everywhere
### Required Tests
- Workspace verification commands
- End-to-end smoke tests (gated)
## Workspace Verification Commands (Run for Every Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)