Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s

This commit is contained in:
2026-03-30 11:40:42 +03:00
parent 7e7041cf8b
commit 1298d9a3df
246 changed files with 55434 additions and 0 deletions

View File

@@ -0,0 +1,333 @@
# Transport Development Plan
## Purpose
Unify and optimize the platform transport layer end-to-end:
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
This plan merges and supersedes:
- `GATEWAY_TRANSPORT_PLAN.md`
- `NATS_TRANSPORT_PLAN.md`
## Current Status (Codebase Reality)
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
- Request context pieces are partially standardized:
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
- Some header names are centralized in `shared` but not all call sites use constants yet.
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
- Node → NATS header propagation is improved and closer to consistent:
- Runner publishes `x-correlation-id` and `correlation-id`, and ensures `traceparent`/`trace-id` are present/derived when possible.
- Aggregate publishes `trace-id` when `traceparent` is present.
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
## Principles
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
## Non-Negotiable Rules (Global)
- Every cross-component hop MUST carry tenant + correlation + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
- Every milestone is stop-the-line gated:
- All tasks completed
- All tests required by the milestone pass
- Workspace verification commands pass
- Gated integration tests for the milestone are runnable and documented
## Baseline (Today)
- Gateway → Aggregate: gRPC (command submission)
- Gateway → Projection: HTTP (query proxy)
- Gateway → Runner: HTTP (admin proxy)
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
## End State (Target Architecture)
- Edge contract (clients ↔ Gateway): HTTP/JSON
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
- `shared` is the single source of truth for:
- header names and injection/extraction rules
- trace parsing/validation (`traceparent`, `trace-id`)
- context object model (tenant/correlation/trace/request ids)
- NATS subject + consumer naming helpers
## Standard Contracts
### Context Fields
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
- Request id: HTTP `x-request-id` (optional for NATS)
### Standard Service Endpoints (every service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if relevant)
- `GET /metrics` Prometheus
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
### Goal
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
### Exit Criteria
- `shared` contains canonical constants for header names and NATS header names.
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
- Library-level unit tests cover parsing/derivation behavior.
- All crates build and tests pass for the workspace.
### Tasks
- [x] Add shared ID types in `shared`:
- [x] `TenantId`
- [x] `CorrelationId`
- [x] `TraceId`
- [~] Consolidate header constants in `shared`:
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
- [ ] HTTP: `x-tenant-id`, `x-request-id` (missing constants)
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
- [ ] NATS: `tenant-id` constant, `Nats-Msg-Id` constant (missing constants)
- [x] Add shared helpers:
- [x] derive `trace-id` from `traceparent`
- [x] derive `traceparent` from `trace-id` when valid
- [ ] normalize/generate correlation id when missing across all transports (helper exists for `CorrelationId::generate()`; adoption incomplete)
- [x] Add unit tests in `shared` for:
- [x] traceparent parsing validity
- [x] serialization shape for correlation/trace id newtypes
- [ ] additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement
### Required Tests
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
### Dependencies
- Milestone 0
### Goal
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
### Exit Criteria
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
- All NATS producers set required headers consistently.
- All NATS consumers tolerate unknown fields and missing optional fields.
- “Contract tests” exist per service to verify produced headers and subject formats.
### Tasks
- [ ] Create/standardize subject builder helpers (prefer `shared`):
- [ ] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
- [ ] Runner effect/effect_result/workflow subject builders
- [~] Aggregate publishes:
- [ ] `tenant-id` header always present (still needs enforcement everywhere)
- [ ] correlation + trace headers always present when available, generated when required
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
- [ ] `Nats-Msg-Id` strategy explicitly defined and tested
- [~] Runner publishes (commands/results):
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`)
- [x] trace headers derived consistently when possible (`traceparent` from `trace-id`, `trace-id` from `traceparent`)
- [ ] outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
- [~] Projection consumption:
- [x] envelope decoding remains tolerant (unknown fields ignored)
- [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
- [ ] Add unit tests:
- [ ] subject formatting tests per service (once builders exist)
- [ ] required header presence tests per publisher (enforce required keys)
### Required Tests
- Workspace verification commands
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
### Dependencies
- Milestone 1
### Goal
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
### Exit Criteria
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
- Services create streams if missing, and validate compatibility on startup.
- Startup does not silently replace or destructively mutate existing streams.
- Config-only tests validate stream config builders without requiring NATS.
### Tasks
- [ ] Define stream policies:
- [ ] `AGGREGATE_EVENTS` (subjects, retention, duplicate window)
- [ ] `WORKFLOW_COMMANDS`
- [ ] `WORKFLOW_EVENTS`
- [ ] Implement compatibility validation rules:
- [ ] required subjects are present (superset allowed)
- [ ] retention/limits are within allowed ranges
- [ ] dedupe assumptions align with producer `Nats-Msg-Id` usage
- [ ] Add unit tests for stream config builders + validators.
### Required Tests
- Workspace verification commands
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
### Dependencies
- Milestone 2
### Goal
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
### Exit Criteria
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
- Application-level concurrency is bounded and aligned with `max_in_flight`.
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
- Gated NATS integration tests prove:
- redelivery idempotency
- poison termination
- scale-out behavior (deliver group) where applicable
### Tasks
- [ ] Standardize consumer defaults:
- [ ] `AckPolicy::Explicit`
- [ ] `ack_wait` default + env override
- [ ] `max_deliver` default + env override
- [ ] `max_ack_pending` tied to worker concurrency
- [ ] Projection:
- [ ] durable naming collision-free for Single/PerView modes
- [ ] checkpoint gate semantics: “skip still acks”
- [ ] poison handling persists durable records and terminates reliably
- [ ] Runner:
- [ ] durable naming collision-free and stable across replicas
- [ ] deliver group rules defined and tested
- [ ] outbox relay exactly-once behavior verified under redelivery
- [ ] Aggregate:
- [ ] ad-hoc fetch consumer always unique and bounded
- [ ] best-effort deletion never targets unrelated consumers
- [ ] Add gated NATS integration tests and document env flags:
- [ ] Runner ignored tests
- [ ] Projection ignored tests
### Required Tests
- Workspace verification commands
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- Projection exposes `projection.gateway.v1.QueryService`.
- Gateway routes queries via gRPC by default.
- Authz remains enforced in Gateway (deny-by-default).
- Query responses remain stable for Control UI expectations.
- New gRPC query tests pass (unit + integration).
### Tasks
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
- [ ] Implement Projection gRPC server for query execution
- [ ] Implement Gateway gRPC client routing to Projection
- [ ] deadlines
- [ ] bounded retries (idempotent only)
- [ ] context propagation
- [ ] Preserve HTTP `/v1/query/*` as compatibility/debug:
- [ ] route internally to gRPC or keep as legacy endpoint
- [ ] Add tests:
- [ ] authz + forwarding via gRPC
- [ ] tenant isolation enforcement in Projection QueryService
### Required Tests
- Workspace verification commands
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
### Dependencies
- Milestone 0 (context contract)
### Goal
Replace Gateways `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
### Exit Criteria
- Runner exposes `runner.admin.v1.RunnerAdmin`.
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
- Tenant-spoof and unauthorized calls are rejected deterministically.
- Runner drain/readiness semantics validated and tested.
### Tasks
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [ ] Implement Runner gRPC admin server
- [ ] Implement Gateway gRPC client integration for admin operations
- [ ] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- [ ] Add tests:
- [ ] Gateway: rejects without rights
- [ ] Gateway: rejects tenant spoof attempts
- [ ] Runner: idempotency and drain semantics
### Required Tests
- Workspace verification commands
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
### Dependencies
- Milestones 45 (gRPC internal RPC surfaces available)
### Goal
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
### Exit Criteria
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
- Deadlines everywhere; retries only for idempotent operations.
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
- Gated load/soak tests exist and are runnable.
### Tasks
- [ ] Implement upstream channel pool
- [ ] bounded LRU
- [ ] TTL/eviction
- [ ] fast-path reuse under load
- [ ] Standardize retry profiles
- [ ] read-only: limited retry with jitter
- [ ] mutations: no retry unless idempotency key is present and semantics are safe
- [ ] Standardize timeouts/deadlines:
- [ ] edge timeout limits
- [ ] internal per-service deadlines
- [ ] Fanout controls:
- [ ] concurrency limiters for probes/snapshots
- [ ] short TTL caching where safe
- [ ] Ensure probes carry context (correlation/trace) for observability.
### Required Tests
- Workspace verification commands
- Gated load/soak tests (document env + how to run)
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
### Dependencies
- Milestone 6
### Goal
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
- End-to-end smoke tests pass (gated).
### Tasks
- [ ] Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
- [ ] Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
- [ ] Ensure Control UI + Control API rely only on standardized surfaces
- [ ] Harden metrics and readiness probes to match the standard contract everywhere
### Required Tests
- Workspace verification commands
- End-to-end smoke tests (gated)
## Workspace Verification Commands (Run for Every Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)