Files
cloudlysis/TRANSPORT_DEVELOPMENT_PLAN.md
Vlad Durnea 90c307016d
Some checks failed
ci / rust (push) Failing after 2m21s
ci / ui (push) Failing after 28s
images / build-and-push (push) Failing after 18s
transport: complete M0–M7
shared: add stream+consumer policy helpers; NATS context header builder

aggregate/runner/projection: centralize stream validation and header usage; set bounded consumer params

projection: add QueryService gRPC and wire into main; settings include PROJECTION_GRPC_ADDR

gateway: gRPC routing to Projection/Runner with deadlines; bounded read-only retries; pooled gRPC channels (bounded LRU+TTL); admin proxy forwards to gRPC; probes use concurrency limiter + TTL cache

runner: add RunnerAdmin gRPC server (drain, status, reload) and wire into main; settings include RUNNER_GRPC_ADDR

tests: add gateway authz for runner admin, projection tenant isolation, runner admin drain semantics

docs: update TRANSPORT_DEVELOPMENT_PLAN to reflect completed milestones and details
2026-03-30 14:24:14 +03:00

16 KiB
Raw Blame History

Transport Development Plan

Purpose

Unify and optimize the platform transport layer end-to-end:

  • Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
  • Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate

This plan merges and supersedes:

  • GATEWAY_TRANSPORT_PLAN.md
  • NATS_TRANSPORT_PLAN.md

Current Status (Codebase Reality)

  • Monorepo workspace exists; shared crate exists and is used by Aggregate/Projection/Runner/Gateway.
  • Request context pieces are standardized:
    • shared provides TenantId, CorrelationId, TraceId
    • shared provides trace_id_from_traceparent(...) and traceparent_from_trace_id(...)
    • shared provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
    • Most call sites now use shared constants/helpers; remaining gaps should be treated as Milestone-gated
  • Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates x-tenant-id, x-correlation-id, and traceparent.
  • Gateway → Projection remains HTTP proxy (/v1/query/...) and Gateway → Runner remains HTTP admin proxy (/admin/runner/...).
  • Node → NATS header propagation is improved and closer to consistent:
    • Runner publishes required headers for effect commands/results (tenant-id, Nats-Msg-Id, correlation, traceparent/trace-id), generating when missing.
    • Aggregate publishes required headers for events (tenant-id, Nats-Msg-Id, correlation, traceparent/trace-id), generating when missing.
    • Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
  • Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.

Principles

  • Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
  • Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
  • Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
  • Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
  • High performance: multiplexing, backpressure, low tail latency, predictable routing.
  • Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.

Non-Negotiable Rules (Global)

  • Every cross-component hop MUST carry tenant + correlation + trace context.
  • Every transport path MUST have explicit timeouts/deadlines and bounded retries.
  • Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
  • Every milestone is stop-the-line gated:
    • All tasks completed
    • All tests required by the milestone pass
    • Workspace verification commands pass
    • Gated integration tests for the milestone are runnable and documented

Baseline (Today)

  • Gateway → Aggregate: gRPC (command submission)
  • Gateway → Projection: HTTP (query proxy)
  • Gateway → Runner: HTTP (admin proxy)
  • Node ↔ NATS JetStream: AGGREGATE_EVENTS, WORKFLOW_COMMANDS, WORKFLOW_EVENTS

End State (Target Architecture)

  • Edge contract (clients ↔ Gateway): HTTP/JSON
  • Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
  • Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
  • shared is the single source of truth for:
    • header names and injection/extraction rules
    • trace parsing/validation (traceparent, trace-id)
    • context object model (tenant/correlation/trace/request ids)
    • NATS subject + consumer naming helpers

Standard Contracts

Context Fields

  • Tenant: HTTP x-tenant-id, NATS tenant-id
  • Correlation: HTTP x-correlation-id, NATS x-correlation-id and correlation-id
  • Trace: HTTP traceparent, NATS traceparent and trace-id (derived when possible)
  • Request id: HTTP x-request-id (optional for NATS)

Standard Service Endpoints (every service)

  • GET /health liveness
  • GET /ready readiness (includes tenant gating if relevant)
  • GET /metrics Prometheus

Milestone 0: Shared Transport Contract (Headers + Context + Trace)

Goal

Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.

Exit Criteria

  • shared contains canonical constants for header names and NATS header names.
  • shared contains canonical trace parsing/validation and trace derivation helpers.
  • Library-level unit tests cover parsing/derivation behavior.
  • All crates build and tests pass for the workspace.

Tasks

  • Add shared ID types in shared:
    • TenantId
    • CorrelationId
    • TraceId
  • Consolidate header constants in shared:
    • HTTP: x-correlation-id, traceparent, trace-id (for NATS/interop)
    • HTTP: x-tenant-id, x-request-id
    • NATS: correlation-id (used in Runner), trace-id (now emitted where possible)
    • NATS: tenant-id, Nats-Msg-Id
  • Add shared helpers:
    • derive trace-id from traceparent
    • derive traceparent from trace-id when valid
    • normalize/generate correlation id when missing (normalize_correlation_id(...))
    • normalize/generate traceparent when missing/invalid (normalize_traceparent(...))
  • Add unit tests in shared for:
    • traceparent parsing validity
    • serialization shape for correlation/trace id newtypes
    • additional validation cases (invalid traceparents, all-zero ids)

Required Tests

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace

Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)

Dependencies

  • Milestone 0

Goal

Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.

Exit Criteria

  • Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
  • All NATS producers set required headers consistently.
  • All NATS consumers tolerate unknown fields and missing optional fields.
  • “Contract tests” exist per service to verify produced headers and subject formats.

Tasks

  • Create/standardize subject builder helpers (prefer shared):
    • Aggregate event subject builder (tenant.<tenant>.aggregate.<type>.<id>)
    • Runner effect/effect_result subject builders
    • Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
  • Aggregate publishes:
    • tenant-id header always present
    • correlation + trace headers always present; generated when missing/invalid
    • trace-id is derived when traceparent is present (now emitted in publish path)
    • Nats-Msg-Id strategy explicitly defined and tested (Aggregate events use event_id)
  • Runner publishes (commands/results):
    • correlation headers emitted consistently (x-correlation-id + correlation-id) and generated when missing
    • trace headers always present/derived when possible; generated when missing/invalid
    • Nats-Msg-Id strategy explicitly defined and tested (Runner commands/results use command_id)
    • outbox metadata → NATS headers mapping standardized via shared helpers
  • Projection consumption:
    • envelope decoding remains tolerant (unknown fields ignored)
    • correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
  • Add unit tests:
    • subject formatting tests (shared builders)
    • required header presence tests per publisher (Aggregate + Runner)

Required Tests

  • Workspace verification commands

Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)

Dependencies

  • Milestone 1

Goal

Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.

Exit Criteria

  • Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
  • Services create streams if missing, and validate compatibility on startup.
  • Startup does not silently replace or destructively mutate existing streams.
  • Config-only tests validate stream config builders without requiring NATS.

Tasks

  • Define stream policies:
    • AGGREGATE_EVENTS (subjects, limits, duplicate window) is defined and validated on startup
    • WORKFLOW_COMMANDS is defined and validated on startup
    • WORKFLOW_EVENTS is defined and validated on startup
    • Centralize stream policy builders/validators in shared
  • Implement compatibility validation rules:
    • required subjects are present (superset allowed)
    • limits/max_age/duplicate window validated against minimums
    • dedupe assumptions align with producer Nats-Msg-Id usage (duplicate window + msg-id strategy)
  • Add unit tests for stream config builders + validators.

Required Tests

  • Workspace verification commands

Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)

Dependencies

  • Milestone 2

Goal

Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.

Exit Criteria

  • All long-lived consumers use explicit ack with standardized defaults (ack_wait, max_deliver, max_ack_pending).
  • Application-level concurrency is bounded and aligned with max_in_flight.
  • Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
  • Gated NATS integration tests prove:
    • redelivery idempotency
    • poison termination
    • scale-out behavior (deliver group) where applicable

Tasks

  • Standardize consumer defaults:
    • AckPolicy::Explicit
    • ack_wait default + env override (Runner/Projection: *_ACK_TIMEOUT_MS)
    • max_deliver default + env override (Runner/Projection: *_MAX_DELIVER)
    • max_ack_pending tied to worker concurrency (Runner/Projection: max_in_flight)
  • Projection:
    • durable naming collision-free for Single/PerView modes
    • checkpoint gate semantics: “skip still acks”
    • poison handling persists durable records and terminates reliably (poison record + term)
  • Runner:
    • durable naming collision-free and stable across replicas
    • deliver group rules defined (pull consumers; deliver_group is rejected if configured)
    • outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
  • Aggregate:
    • ad-hoc fetch consumer always unique and bounded
    • best-effort deletion never targets unrelated consumers
  • Add gated NATS integration tests and document env flags:
    • Runner ignored tests
    • Projection ignored tests

Required Tests

  • Workspace verification commands
  • Runner: RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored
  • Projection: PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored

Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)

Dependencies

  • Milestone 0 (context contract)

Goal

Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.

Exit Criteria

  • Projection exposes projection.gateway.v1.QueryService.
  • Gateway routes queries via gRPC by default.
  • Authz remains enforced in Gateway (deny-by-default).
  • Query responses remain stable for Control UI expectations.
  • New gRPC query tests pass (unit + integration).

Tasks

  • Define protobuf API: projection.gateway.v1.QueryService
  • Implement Projection gRPC server for query execution
  • Implement Gateway gRPC client routing to Projection
    • deadlines
    • bounded retries (idempotent only)
    • context propagation
  • Preserve HTTP /v1/query/* as compatibility/debug:
    • route internally to gRPC
  • Add tests:
    • authz + forwarding via gRPC
    • tenant isolation enforcement in Projection QueryService

Required Tests

  • Workspace verification commands

Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)

Dependencies

  • Milestone 0 (context contract)

Goal

Replace Gateways /admin/runner/* HTTP proxy usage with a first-class gRPC admin service.

Exit Criteria

  • Runner exposes runner.admin.v1.RunnerAdmin.
  • Gateway calls Runner admin via gRPC (authz enforced in Gateway).
  • Tenant-spoof and unauthorized calls are rejected deterministically.
  • Runner drain/readiness semantics validated and tested.

Tasks

  • Define protobuf API: runner.admin.v1.RunnerAdmin
  • Implement Runner gRPC admin server
  • Implement Gateway gRPC client integration for admin operations
  • Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
  • Add tests:
    • Gateway: rejects without rights
    • Gateway: rejects tenant spoof attempts
    • Runner: idempotency and drain semantics

Required Tests

  • Workspace verification commands

Milestone 6: Gateway Upstream Performance + Operational Guardrails

Dependencies

  • Milestones 45 (gRPC internal RPC surfaces available)

Goal

Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.

Exit Criteria

  • Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
  • Deadlines everywhere; retries only for idempotent operations.
  • Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
  • Gated load/soak tests exist and are runnable.

Tasks

  • Implement upstream channel pool
    • bounded LRU
    • TTL/eviction
    • fast-path reuse under load (cached gRPC channels)
  • Standardize retry profiles
    • read-only: limited retry with jitter (Gateway gRPC calls)
    • mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
  • Standardize timeouts/deadlines:
    • edge timeout limits
    • internal per-service deadlines
  • Fanout controls:
    • concurrency limiters for probes/snapshots
    • short TTL caching where safe
  • Ensure probes carry context (correlation/trace) for observability.

Required Tests

  • Workspace verification commands
  • Gated load/soak tests (document env + how to run)

Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)

Dependencies

  • Milestone 6

Goal

Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.

Exit Criteria

  • Gateway no longer depends on HTTP for Projection queries or Runner admin.
  • Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
  • End-to-end smoke tests pass (gated).

Tasks

  • Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
  • Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
  • Ensure Control UI + Control API rely only on standardized surfaces
  • Harden metrics and readiness probes to match the standard contract everywhere

Required Tests

  • Workspace verification commands
  • End-to-end smoke tests (gated)

Workspace Verification Commands (Run for Every Milestone)

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace
  • npm ci && npm run lint && npm run typecheck && npm run test && npm run build (in control/ui)