Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
This commit is contained in:
333
TRANSPORT_DEVELOPMENT_PLAN.md
Normal file
333
TRANSPORT_DEVELOPMENT_PLAN.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Transport Development Plan
|
||||
|
||||
## Purpose
|
||||
Unify and optimize the platform transport layer end-to-end:
|
||||
- Gateway ↔ nodes (Aggregate, Projection, Runner): routing + RPC/proxying + probes
|
||||
- Node ↔ NATS JetStream/KV: event/work distribution + configuration substrate
|
||||
|
||||
This plan merges and supersedes:
|
||||
- `GATEWAY_TRANSPORT_PLAN.md`
|
||||
- `NATS_TRANSPORT_PLAN.md`
|
||||
|
||||
## Current Status (Codebase Reality)
|
||||
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
|
||||
- Request context pieces are partially standardized:
|
||||
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
|
||||
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
|
||||
- Some header names are centralized in `shared` but not all call sites use constants yet.
|
||||
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
|
||||
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
|
||||
- Node → NATS header propagation is improved and closer to consistent:
|
||||
- Runner publishes `x-correlation-id` and `correlation-id`, and ensures `traceparent`/`trace-id` are present/derived when possible.
|
||||
- Aggregate publishes `trace-id` when `traceparent` is present.
|
||||
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
|
||||
|
||||
## Principles
|
||||
- Simplicity: minimize distinct patterns; prefer one internal RPC stack + one async backbone.
|
||||
- Ease of operation: consistent health/ready/metrics; consistent naming; predictable failure modes.
|
||||
- Frugality: bounded connections, bounded consumers, bounded in-flight work; no churny resources.
|
||||
- Low resource usage: stable durables; avoid per-request reconnects; avoid unbounded loops.
|
||||
- High performance: multiplexing, backpressure, low tail latency, predictable routing.
|
||||
- Safety: tenant isolation, deny-by-default authz at the edge, idempotency, deterministic replay.
|
||||
|
||||
## Non-Negotiable Rules (Global)
|
||||
- Every cross-component hop MUST carry tenant + correlation + trace context.
|
||||
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
|
||||
- Every JetStream stream/consumer MUST have an explicit contract (name/subjects/retention/ack policy).
|
||||
- Every milestone is stop-the-line gated:
|
||||
- All tasks completed
|
||||
- All tests required by the milestone pass
|
||||
- Workspace verification commands pass
|
||||
- Gated integration tests for the milestone are runnable and documented
|
||||
|
||||
## Baseline (Today)
|
||||
- Gateway → Aggregate: gRPC (command submission)
|
||||
- Gateway → Projection: HTTP (query proxy)
|
||||
- Gateway → Runner: HTTP (admin proxy)
|
||||
- Node ↔ NATS JetStream: `AGGREGATE_EVENTS`, `WORKFLOW_COMMANDS`, `WORKFLOW_EVENTS`
|
||||
|
||||
## End State (Target Architecture)
|
||||
- Edge contract (clients ↔ Gateway): HTTP/JSON
|
||||
- Internal RPC (Gateway ↔ nodes): gRPC for Aggregate + Projection + Runner admin
|
||||
- Async backbone: NATS JetStream for events/work distribution; NATS KV for routing/placement
|
||||
- `shared` is the single source of truth for:
|
||||
- header names and injection/extraction rules
|
||||
- trace parsing/validation (`traceparent`, `trace-id`)
|
||||
- context object model (tenant/correlation/trace/request ids)
|
||||
- NATS subject + consumer naming helpers
|
||||
|
||||
## Standard Contracts
|
||||
### Context Fields
|
||||
- Tenant: HTTP `x-tenant-id`, NATS `tenant-id`
|
||||
- Correlation: HTTP `x-correlation-id`, NATS `x-correlation-id` and `correlation-id`
|
||||
- Trace: HTTP `traceparent`, NATS `traceparent` and `trace-id` (derived when possible)
|
||||
- Request id: HTTP `x-request-id` (optional for NATS)
|
||||
|
||||
### Standard Service Endpoints (every service)
|
||||
- `GET /health` liveness
|
||||
- `GET /ready` readiness (includes tenant gating if relevant)
|
||||
- `GET /metrics` Prometheus
|
||||
|
||||
## Milestone 0: Shared Transport Contract (Headers + Context + Trace)
|
||||
|
||||
### Goal
|
||||
Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so every later milestone builds on one contract.
|
||||
|
||||
### Exit Criteria
|
||||
- `shared` contains canonical constants for header names and NATS header names.
|
||||
- `shared` contains canonical trace parsing/validation and trace derivation helpers.
|
||||
- Library-level unit tests cover parsing/derivation behavior.
|
||||
- All crates build and tests pass for the workspace.
|
||||
|
||||
### Tasks
|
||||
- [x] Add shared ID types in `shared`:
|
||||
- [x] `TenantId`
|
||||
- [x] `CorrelationId`
|
||||
- [x] `TraceId`
|
||||
- [~] Consolidate header constants in `shared`:
|
||||
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
|
||||
- [ ] HTTP: `x-tenant-id`, `x-request-id` (missing constants)
|
||||
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
|
||||
- [ ] NATS: `tenant-id` constant, `Nats-Msg-Id` constant (missing constants)
|
||||
- [x] Add shared helpers:
|
||||
- [x] derive `trace-id` from `traceparent`
|
||||
- [x] derive `traceparent` from `trace-id` when valid
|
||||
- [ ] normalize/generate correlation id when missing across all transports (helper exists for `CorrelationId::generate()`; adoption incomplete)
|
||||
- [x] Add unit tests in `shared` for:
|
||||
- [x] traceparent parsing validity
|
||||
- [x] serialization shape for correlation/trace id newtypes
|
||||
- [ ] additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement
|
||||
|
||||
### Required Tests
|
||||
- `cargo fmt --check`
|
||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
||||
- `cargo test --workspace`
|
||||
|
||||
## Milestone 1: NATS Wire Protocol Lock-In (Subjects + Headers + Envelopes)
|
||||
|
||||
### Dependencies
|
||||
- Milestone 0
|
||||
|
||||
### Goal
|
||||
Make the JetStream/NATS “wire protocol” explicit and uniform so interop is safe across scale-out and rolling restarts.
|
||||
|
||||
### Exit Criteria
|
||||
- Subject naming is standardized and enforced via builder functions (producers publish concrete subjects only).
|
||||
- All NATS producers set required headers consistently.
|
||||
- All NATS consumers tolerate unknown fields and missing optional fields.
|
||||
- “Contract tests” exist per service to verify produced headers and subject formats.
|
||||
|
||||
### Tasks
|
||||
- [ ] Create/standardize subject builder helpers (prefer `shared`):
|
||||
- [ ] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
|
||||
- [ ] Runner effect/effect_result/workflow subject builders
|
||||
- [~] Aggregate publishes:
|
||||
- [ ] `tenant-id` header always present (still needs enforcement everywhere)
|
||||
- [ ] correlation + trace headers always present when available, generated when required
|
||||
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
|
||||
- [ ] `Nats-Msg-Id` strategy explicitly defined and tested
|
||||
- [~] Runner publishes (commands/results):
|
||||
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`)
|
||||
- [x] trace headers derived consistently when possible (`traceparent` from `trace-id`, `trace-id` from `traceparent`)
|
||||
- [ ] outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
|
||||
- [~] Projection consumption:
|
||||
- [x] envelope decoding remains tolerant (unknown fields ignored)
|
||||
- [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
|
||||
- [ ] Add unit tests:
|
||||
- [ ] subject formatting tests per service (once builders exist)
|
||||
- [ ] required header presence tests per publisher (enforce required keys)
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
|
||||
## Milestone 2: JetStream Stream Policy (Create/Validate, No Destructive Startup)
|
||||
|
||||
### Dependencies
|
||||
- Milestone 1
|
||||
|
||||
### Goal
|
||||
Make stream definitions explicit, validated, and safe in all environments, preventing resource runaway and accidental destructive changes.
|
||||
|
||||
### Exit Criteria
|
||||
- Each stream has a single authoritative config policy (name/subjects/retention/limits/duplicate window).
|
||||
- Services create streams if missing, and validate compatibility on startup.
|
||||
- Startup does not silently replace or destructively mutate existing streams.
|
||||
- Config-only tests validate stream config builders without requiring NATS.
|
||||
|
||||
### Tasks
|
||||
- [ ] Define stream policies:
|
||||
- [ ] `AGGREGATE_EVENTS` (subjects, retention, duplicate window)
|
||||
- [ ] `WORKFLOW_COMMANDS`
|
||||
- [ ] `WORKFLOW_EVENTS`
|
||||
- [ ] Implement compatibility validation rules:
|
||||
- [ ] required subjects are present (superset allowed)
|
||||
- [ ] retention/limits are within allowed ranges
|
||||
- [ ] dedupe assumptions align with producer `Nats-Msg-Id` usage
|
||||
- [ ] Add unit tests for stream config builders + validators.
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
|
||||
## Milestone 3: Consumer Policy + Backpressure + Poison (Reliable and Cheap Under Load)
|
||||
|
||||
### Dependencies
|
||||
- Milestone 2
|
||||
|
||||
### Goal
|
||||
Standardize consumer configs and runtime behavior to guarantee bounded in-flight work, predictable redelivery behavior, and consistent poison handling.
|
||||
|
||||
### Exit Criteria
|
||||
- All long-lived consumers use explicit ack with standardized defaults (`ack_wait`, `max_deliver`, `max_ack_pending`).
|
||||
- Application-level concurrency is bounded and aligned with `max_in_flight`.
|
||||
- Poison policy is consistent across consumers (term + durable quarantine/deadletter record).
|
||||
- Gated NATS integration tests prove:
|
||||
- redelivery idempotency
|
||||
- poison termination
|
||||
- scale-out behavior (deliver group) where applicable
|
||||
|
||||
### Tasks
|
||||
- [ ] Standardize consumer defaults:
|
||||
- [ ] `AckPolicy::Explicit`
|
||||
- [ ] `ack_wait` default + env override
|
||||
- [ ] `max_deliver` default + env override
|
||||
- [ ] `max_ack_pending` tied to worker concurrency
|
||||
- [ ] Projection:
|
||||
- [ ] durable naming collision-free for Single/PerView modes
|
||||
- [ ] checkpoint gate semantics: “skip still acks”
|
||||
- [ ] poison handling persists durable records and terminates reliably
|
||||
- [ ] Runner:
|
||||
- [ ] durable naming collision-free and stable across replicas
|
||||
- [ ] deliver group rules defined and tested
|
||||
- [ ] outbox relay exactly-once behavior verified under redelivery
|
||||
- [ ] Aggregate:
|
||||
- [ ] ad-hoc fetch consumer always unique and bounded
|
||||
- [ ] best-effort deletion never targets unrelated consumers
|
||||
- [ ] Add gated NATS integration tests and document env flags:
|
||||
- [ ] Runner ignored tests
|
||||
- [ ] Projection ignored tests
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
- Runner: `RUNNER_TEST_NATS_URL=... cargo test -p runner -- --ignored`
|
||||
- Projection: `PROJECTION_TEST_NATS_URL=... cargo test -p projection -- --ignored`
|
||||
|
||||
## Milestone 4: Gateway → Projection Internal RPC (gRPC QueryService)
|
||||
|
||||
### Dependencies
|
||||
- Milestone 0 (context contract)
|
||||
|
||||
### Goal
|
||||
Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query service, keeping HTTP optional for human/debug use.
|
||||
|
||||
### Exit Criteria
|
||||
- Projection exposes `projection.gateway.v1.QueryService`.
|
||||
- Gateway routes queries via gRPC by default.
|
||||
- Authz remains enforced in Gateway (deny-by-default).
|
||||
- Query responses remain stable for Control UI expectations.
|
||||
- New gRPC query tests pass (unit + integration).
|
||||
|
||||
### Tasks
|
||||
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
|
||||
- [ ] Implement Projection gRPC server for query execution
|
||||
- [ ] Implement Gateway gRPC client routing to Projection
|
||||
- [ ] deadlines
|
||||
- [ ] bounded retries (idempotent only)
|
||||
- [ ] context propagation
|
||||
- [ ] Preserve HTTP `/v1/query/*` as compatibility/debug:
|
||||
- [ ] route internally to gRPC or keep as legacy endpoint
|
||||
- [ ] Add tests:
|
||||
- [ ] authz + forwarding via gRPC
|
||||
- [ ] tenant isolation enforcement in Projection QueryService
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
|
||||
## Milestone 5: Gateway → Runner Admin Internal RPC (gRPC RunnerAdmin)
|
||||
|
||||
### Dependencies
|
||||
- Milestone 0 (context contract)
|
||||
|
||||
### Goal
|
||||
Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC admin service.
|
||||
|
||||
### Exit Criteria
|
||||
- Runner exposes `runner.admin.v1.RunnerAdmin`.
|
||||
- Gateway calls Runner admin via gRPC (authz enforced in Gateway).
|
||||
- Tenant-spoof and unauthorized calls are rejected deterministically.
|
||||
- Runner drain/readiness semantics validated and tested.
|
||||
|
||||
### Tasks
|
||||
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
||||
- [ ] Implement Runner gRPC admin server
|
||||
- [ ] Implement Gateway gRPC client integration for admin operations
|
||||
- [ ] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
|
||||
- [ ] Add tests:
|
||||
- [ ] Gateway: rejects without rights
|
||||
- [ ] Gateway: rejects tenant spoof attempts
|
||||
- [ ] Runner: idempotency and drain semantics
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
|
||||
## Milestone 6: Gateway Upstream Performance + Operational Guardrails
|
||||
|
||||
### Dependencies
|
||||
- Milestones 4–5 (gRPC internal RPC surfaces available)
|
||||
|
||||
### Goal
|
||||
Make Gateway upstream connection handling, retry behavior, and probe/fanout operations consistent, bounded, and cheap under load.
|
||||
|
||||
### Exit Criteria
|
||||
- Bounded upstream gRPC channel pool exists (LRU + TTL/eviction).
|
||||
- Deadlines everywhere; retries only for idempotent operations.
|
||||
- Probe/fanout calls are bounded (timeouts + concurrency limits) and carry context.
|
||||
- Gated load/soak tests exist and are runnable.
|
||||
|
||||
### Tasks
|
||||
- [ ] Implement upstream channel pool
|
||||
- [ ] bounded LRU
|
||||
- [ ] TTL/eviction
|
||||
- [ ] fast-path reuse under load
|
||||
- [ ] Standardize retry profiles
|
||||
- [ ] read-only: limited retry with jitter
|
||||
- [ ] mutations: no retry unless idempotency key is present and semantics are safe
|
||||
- [ ] Standardize timeouts/deadlines:
|
||||
- [ ] edge timeout limits
|
||||
- [ ] internal per-service deadlines
|
||||
- [ ] Fanout controls:
|
||||
- [ ] concurrency limiters for probes/snapshots
|
||||
- [ ] short TTL caching where safe
|
||||
- [ ] Ensure probes carry context (correlation/trace) for observability.
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
- Gated load/soak tests (document env + how to run)
|
||||
|
||||
## Milestone 7: Transport Cleanup (Remove Legacy Internal Paths)
|
||||
|
||||
### Dependencies
|
||||
- Milestone 6
|
||||
|
||||
### Goal
|
||||
Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS async, with legacy internal HTTP proxy paths removed or clearly debug-only.
|
||||
|
||||
### Exit Criteria
|
||||
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
|
||||
- Legacy paths are removed or explicitly debug-only and not referenced by Gateway/Control.
|
||||
- End-to-end smoke tests pass (gated).
|
||||
|
||||
### Tasks
|
||||
- [ ] Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
|
||||
- [ ] Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
|
||||
- [ ] Ensure Control UI + Control API rely only on standardized surfaces
|
||||
- [ ] Harden metrics and readiness probes to match the standard contract everywhere
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
- End-to-end smoke tests (gated)
|
||||
|
||||
## Workspace Verification Commands (Run for Every Milestone)
|
||||
- `cargo fmt --check`
|
||||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
||||
- `cargo test --workspace`
|
||||
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
|
||||
Reference in New Issue
Block a user