Files
cloudlysis/TRANSPORT_DEVELOPMENT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

15 KiB
Raw Blame History

Transport Development Plan

Purpose

Unify and optimize the platform transport layer end-to-end:

  • Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
  • Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate

This plan merges and supersedes:

  • GATEWAY_TRANSPORT_PLAN.md
  • NATS_TRANSPORT_PLAN.md

Current Status (Codebase Reality)

  • Monorepo workspace exists; shared crate exists and is used by Aggregate/Projection/Runner/Gateway.
  • Request context pieces are partially standardized:
    • shared provides TenantId, CorrelationId, TraceId
    • shared provides trace_id_from_traceparent(...) and traceparent_from_trace_id(...)
    • Some header names are centralized in shared but not all call sites use constants yet.
  • Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates x-tenant-id, x-correlation-id, and traceparent.
  • Gateway → Projection remains HTTP proxy (/v1/query/...) and Gateway → Runner remains HTTP admin proxy (/admin/runner/...).
  • Node → NATS header propagation is improved and closer to consistent:
    • Runner publishes x-correlation-id and correlation-id, and ensures traceparent/trace-id are present/derived when possible.
    • Aggregate publishes trace-id when traceparent is present.
  • Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.

Principles

  • Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
  • Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
  • Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
  • Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
  • High performance: multiplexing, backpressure, low tail latency, predictable routing.
  • Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.

Non-Negotiable Rules (Global)

  • Every cross-component hop MUST carry tenant + correlation + trace context.
  • Every transport path MUST have explicit timeouts/deadlines and bounded retries.
  • Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
  • Every milestone is stop-the-line gated:
    • All tasks completed
    • All tests required by the milestone pass
    • Workspace verification commands pass
    • Gated integration tests for the milestone are runnable and documented

Baseline (Today)

  • Gateway → Aggregate: gRPC (command submission)
  • Gateway → Projection: HTTP (query proxy)
  • Gateway → Runner: HTTP (admin proxy)
  • Node ↔ NATS JetStream: AGGREGATE_EVENTS, WORKFLOW_COMMANDS, WORKFLOW_EVENTS

End State (Target Architecture)

  • Edge contract (clients ↔ Gateway): HTTP/JSON
  • Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
  • Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
  • shared is the single source of truth for:
    • header names and injection/extraction rules
    • trace parsing/validation (traceparent, trace-id)
    • context object model (tenant/correlation/trace/request ids)
    • NATS subject + consumer naming helpers

Standard Contracts

Context Fields

  • Tenant: HTTP x-tenant-id, NATS tenant-id
  • Correlation: HTTP x-correlation-id, NATS x-correlation-id and correlation-id
  • Trace: HTTP traceparent, NATS traceparent and trace-id (derived when possible)
  • Request id: HTTP x-request-id (optional for NATS)

Standard Service Endpoints (every service)

  • GET /health liveness
  • GET /ready readiness (includes tenant gating if relevant)
  • GET /metrics Prometheus

Milestone 0: Shared Transport Contract (Headers + Context + Trace)

Goal

Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.

Exit Criteria

  • shared contains canonical constants for header names and NATS header names.
  • shared contains canonical trace parsing/validation and trace derivation helpers.
  • Library-level unit tests cover parsing/derivation behavior.
  • All crates build and tests pass for the workspace.

Tasks

  • Add shared ID types in shared:
    • TenantId
    • CorrelationId
    • TraceId
  • [~] Consolidate header constants in shared:
    • HTTP: x-correlation-id, traceparent, trace-id (for NATS/interop)
    • HTTP: x-tenant-id, x-request-id (missing constants)
    • NATS: correlation-id (used in Runner), trace-id (now emitted where possible)
    • NATS: tenant-id constant, Nats-Msg-Id constant (missing constants)
  • Add shared helpers:
    • derive trace-id from traceparent
    • derive traceparent from trace-id when valid
    • normalize/generate correlation id when missing across all transports (helper exists for CorrelationId::generate(); adoption incomplete)
  • Add unit tests in shared for:
    • traceparent parsing validity
    • serialization shape for correlation/trace id newtypes
    • additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement

Required Tests

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace

Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)

Dependencies

  • Milestone 0

Goal

Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.

Exit Criteria

  • Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
  • All NATS producers set required headers consistently.
  • All NATS consumers tolerate unknown fields and missing optional fields.
  • “Contract tests” exist per service to verify produced headers and subject formats.

Tasks

  • Create/standardize subject builder helpers (prefer shared):
    • Aggregate event subject builder (tenant.<tenant>.aggregate.<type>.<id>)
    • Runner effect/effect_result/workflow subject builders
  • [~] Aggregate publishes:
    • tenant-id header always present (still needs enforcement everywhere)
    • correlation + trace headers always present when available, generated when required
    • trace-id is derived when traceparent is present (now emitted in publish path)
    • Nats-Msg-Id strategy explicitly defined and tested
  • [~] Runner publishes (commands/results):
    • correlation headers emitted consistently (x-correlation-id + correlation-id)
    • trace headers derived consistently when possible (traceparent from trace-id, trace-id from traceparent)
    • outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
  • [~] Projection consumption:
    • envelope decoding remains tolerant (unknown fields ignored)
    • [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
  • Add unit tests:
    • subject formatting tests per service (once builders exist)
    • required header presence tests per publisher (enforce required keys)

Required Tests

  • Workspace verification commands

Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)

Dependencies

  • Milestone 1

Goal

Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.

Exit Criteria

  • Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
  • Services create streams if missing, and validate compatibility on startup.
  • Startup does not silently replace or destructively mutate existing streams.
  • Config-only tests validate stream config builders without requiring NATS.

Tasks

  • Define stream policies:
    • AGGREGATE_EVENTS (subjects, retention, duplicate window)
    • WORKFLOW_COMMANDS
    • WORKFLOW_EVENTS
  • Implement compatibility validation rules:
    • required subjects are present (superset allowed)
    • retention/limits are within allowed ranges
    • dedupe assumptions align with producer Nats-Msg-Id usage
  • Add unit tests for stream config builders + validators.

Required Tests

  • Workspace verification commands

Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)

Dependencies

  • Milestone 2

Goal

Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.

Exit Criteria

  • All long-lived consumers use explicit ack with standardized defaults (ack_wait, max_deliver, max_ack_pending).
  • Application-level concurrency is bounded and aligned with max_in_flight.
  • Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
  • Gated NATS integration tests prove:
    • redelivery idempotency
    • poison termination
    • scale-out behavior (deliver group) where applicable

Tasks

  • Standardize consumer defaults:
    • AckPolicy::Explicit
    • ack_wait default + env override
    • max_deliver default + env override
    • max_ack_pending tied to worker concurrency
  • Projection:
    • durable naming collision-free for Single/PerView modes
    • checkpoint gate semantics: “skip still acks”
    • poison handling persists durable records and terminates reliably
  • Runner:
    • durable naming collision-free and stable across replicas
    • deliver group rules defined and tested
    • outbox relay exactly-once behavior verified under redelivery
  • Aggregate:
    • ad-hoc fetch consumer always unique and bounded
    • best-effort deletion never targets unrelated consumers
  • Add gated NATS integration tests and document env flags:
    • Runner ignored tests
    • Projection ignored tests

Required Tests

  • Workspace verification commands
  • Runner: RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored
  • Projection: PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored

Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)

Dependencies

  • Milestone 0 (context contract)

Goal

Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.

Exit Criteria

  • Projection exposes projection.gateway.v1.QueryService.
  • Gateway routes queries via gRPC by default.
  • Authz remains enforced in Gateway (deny-by-default).
  • Query responses remain stable for Control UI expectations.
  • New gRPC query tests pass (unit + integration).

Tasks

  • Define protobuf API: projection.gateway.v1.QueryService
  • Implement Projection gRPC server for query execution
  • Implement Gateway gRPC client routing to Projection
    • deadlines
    • bounded retries (idempotent only)
    • context propagation
  • Preserve HTTP /v1/query/* as compatibility/debug:
    • route internally to gRPC or keep as legacy endpoint
  • Add tests:
    • authz + forwarding via gRPC
    • tenant isolation enforcement in Projection QueryService

Required Tests

  • Workspace verification commands

Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)

Dependencies

  • Milestone 0 (context contract)

Goal

Replace Gateways /admin/runner/* HTTP proxy usage with a first-class gRPC admin service.

Exit Criteria

  • Runner exposes runner.admin.v1.RunnerAdmin.
  • Gateway calls Runner admin via gRPC (authz enforced in Gateway).
  • Tenant-spoof and unauthorized calls are rejected deterministically.
  • Runner drain/readiness semantics validated and tested.

Tasks

  • Define protobuf API: runner.admin.v1.RunnerAdmin
  • Implement Runner gRPC admin server
  • Implement Gateway gRPC client integration for admin operations
  • Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
  • Add tests:
    • Gateway: rejects without rights
    • Gateway: rejects tenant spoof attempts
    • Runner: idempotency and drain semantics

Required Tests

  • Workspace verification commands

Milestone 6: Gateway Upstream Performance + Operational Guardrails

Dependencies

  • Milestones 45 (gRPC internal RPC surfaces available)

Goal

Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.

Exit Criteria

  • Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
  • Deadlines everywhere; retries only for idempotent operations.
  • Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
  • Gated load/soak tests exist and are runnable.

Tasks

  • Implement upstream channel pool
    • bounded LRU
    • TTL/eviction
    • fast-path reuse under load
  • Standardize retry profiles
    • read-only: limited retry with jitter
    • mutations: no retry unless idempotency key is present and semantics are safe
  • Standardize timeouts/deadlines:
    • edge timeout limits
    • internal per-service deadlines
  • Fanout controls:
    • concurrency limiters for probes/snapshots
    • short TTL caching where safe
  • Ensure probes carry context (correlation/trace) for observability.

Required Tests

  • Workspace verification commands
  • Gated load/soak tests (document env + how to run)

Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)

Dependencies

  • Milestone 6

Goal

Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.

Exit Criteria

  • Gateway no longer depends on HTTP for Projection queries or Runner admin.
  • Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
  • End-to-end smoke tests pass (gated).

Tasks

  • Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
  • Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
  • Ensure Control UI + Control API rely only on standardized surfaces
  • Harden metrics and readiness probes to match the standard contract everywhere

Required Tests

  • Workspace verification commands
  • End-to-end smoke tests (gated)

Workspace Verification Commands (Run for Every Milestone)

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace
  • npm ci && npm run lint && npm run typecheck && npm run test && npm run build (in control/ui)