transport: complete M0–M7
Some checks failed
ci / rust (push) Failing after 2m21s
ci / ui (push) Failing after 28s
images / build-and-push (push) Failing after 18s

shared: add stream+consumer policy helpers; NATS context header builder

aggregate/runner/projection: centralize stream validation and header usage; set bounded consumer params

projection: add QueryService gRPC and wire into main; settings include PROJECTION_GRPC_ADDR

gateway: gRPC routing to Projection/Runner with deadlines; bounded read-only retries; pooled gRPC channels (bounded LRU+TTL); admin proxy forwards to gRPC; probes use concurrency limiter + TTL cache

runner: add RunnerAdmin gRPC server (drain, status, reload) and wire into main; settings include RUNNER_GRPC_ADDR

tests: add gateway authz for runner admin, projection tenant isolation, runner admin drain semantics

docs: update TRANSPORT_DEVELOPMENT_PLAN to reflect completed milestones and details
This commit is contained in:
2026-03-30 14:24:14 +03:00
parent 1ab112438b
commit 90c307016d
41 changed files with 2391 additions and 505 deletions

View File

@@ -11,15 +11,17 @@ This plan merges and supersedes:
## Current Status (Codebase Reality)
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
- Request context pieces are partially standardized:
- Request context pieces are standardized:
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
- Some header names are centralized in `shared` but not all call sites use constants yet.
- `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
- Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
- Node → NATS header propagation is improved and closer to consistent:
- Runner publishes `x-correlation-id` and `correlation-id`, and ensures `traceparent`/`trace-id` are present/derived when possible.
- Aggregate publishes `trace-id` when `traceparent` is present.
- Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
- Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
- Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
## Principles
@@ -84,19 +86,20 @@ Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so
- [x] `TenantId`
- [x] `CorrelationId`
- [x] `TraceId`
- [~] Consolidate header constants in `shared`:
- [x] Consolidate header constants in `shared`:
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
- [ ] HTTP: `x-tenant-id`, `x-request-id` (missing constants)
- [x] HTTP: `x-tenant-id`, `x-request-id`
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
- [ ] NATS: `tenant-id` constant, `Nats-Msg-Id` constant (missing constants)
- [x] NATS: `tenant-id`, `Nats-Msg-Id`
- [x] Add shared helpers:
- [x] derive `trace-id` from `traceparent`
- [x] derive `traceparent` from `trace-id` when valid
- [ ] normalize/generate correlation id when missing across all transports (helper exists for `CorrelationId::generate()`; adoption incomplete)
- [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
- [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
- [x] Add unit tests in `shared` for:
- [x] traceparent parsing validity
- [x] serialization shape for correlation/trace id newtypes
- [ ] additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement
- [x] additional validation cases (invalid traceparents, all-zero ids)
### Required Tests
- `cargo fmt --check`
@@ -118,24 +121,26 @@ Make the JetStream/NATS “wire protocol” explicit and uniform so interop is s
- “Contract tests” exist per service to verify produced headers and subject formats.
### Tasks
- [ ] Create/standardize subject builder helpers (prefer `shared`):
- [ ] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
- [ ] Runner effect/effect_result/workflow subject builders
- [~] Aggregate publishes:
- [ ] `tenant-id` header always present (still needs enforcement everywhere)
- [ ] correlation + trace headers always present when available, generated when required
- [x] Create/standardize subject builder helpers (prefer `shared`):
- [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
- [x] Runner effect/effect_result subject builders
- [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
- [x] Aggregate publishes:
- [x] `tenant-id` header always present
- [x] correlation + trace headers always present; generated when missing/invalid
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
- [ ] `Nats-Msg-Id` strategy explicitly defined and tested
- [~] Runner publishes (commands/results):
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`)
- [x] trace headers derived consistently when possible (`traceparent` from `trace-id`, `trace-id` from `traceparent`)
- [ ] outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
- [~] Projection consumption:
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
- [x] Runner publishes (commands/results):
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
- [x] trace headers always present/derived when possible; generated when missing/invalid
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
- [x] outbox metadata → NATS headers mapping standardized via shared helpers
- [x] Projection consumption:
- [x] envelope decoding remains tolerant (unknown fields ignored)
- [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
- [ ] Add unit tests:
- [ ] subject formatting tests per service (once builders exist)
- [ ] required header presence tests per publisher (enforce required keys)
- [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
- [x] Add unit tests:
- [x] subject formatting tests (shared builders)
- [x] required header presence tests per publisher (Aggregate + Runner)
### Required Tests
- Workspace verification commands
@@ -155,15 +160,16 @@ Make stream definitions explicit, validated, and safe in all environments, preve
- Config-only tests validate stream config builders without requiring NATS.
### Tasks
- [ ] Define stream policies:
- [ ] `AGGREGATE_EVENTS` (subjects, retention, duplicate window)
- [ ] `WORKFLOW_COMMANDS`
- [ ] `WORKFLOW_EVENTS`
- [ ] Implement compatibility validation rules:
- [ ] required subjects are present (superset allowed)
- [ ] retention/limits are within allowed ranges
- [ ] dedupe assumptions align with producer `Nats-Msg-Id` usage
- [ ] Add unit tests for stream config builders + validators.
- [x] Define stream policies:
- [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
- [x] `WORKFLOW_COMMANDS` is defined and validated on startup
- [x] `WORKFLOW_EVENTS` is defined and validated on startup
- [x] Centralize stream policy builders/validators in `shared`
- [x] Implement compatibility validation rules:
- [x] required subjects are present (superset allowed)
- [x] limits/max_age/duplicate window validated against minimums
- [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
- [x] Add unit tests for stream config builders + validators.
### Required Tests
- Workspace verification commands
@@ -186,25 +192,25 @@ Standardize consumer configs and runtime behavior to guarantee bounded in-flight
- scale-out behavior (deliver group) where applicable
### Tasks
- [ ] Standardize consumer defaults:
- [ ] `AckPolicy::Explicit`
- [ ] `ack_wait` default + env override
- [ ] `max_deliver` default + env override
- [ ] `max_ack_pending` tied to worker concurrency
- [ ] Projection:
- [ ] durable naming collision-free for Single/PerView modes
- [ ] checkpoint gate semantics: “skip still acks”
- [ ] poison handling persists durable records and terminates reliably
- [ ] Runner:
- [ ] durable naming collision-free and stable across replicas
- [ ] deliver group rules defined and tested
- [ ] outbox relay exactly-once behavior verified under redelivery
- [ ] Aggregate:
- [ ] ad-hoc fetch consumer always unique and bounded
- [ ] best-effort deletion never targets unrelated consumers
- [ ] Add gated NATS integration tests and document env flags:
- [ ] Runner ignored tests
- [ ] Projection ignored tests
- [x] Standardize consumer defaults:
- [x] `AckPolicy::Explicit`
- [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
- [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
- [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
- [x] Projection:
- [x] durable naming collision-free for Single/PerView modes
- [x] checkpoint gate semantics: “skip still acks”
- [x] poison handling persists durable records and terminates reliably (poison record + term)
- [x] Runner:
- [x] durable naming collision-free and stable across replicas
- [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
- [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
- [x] Aggregate:
- [x] ad-hoc fetch consumer always unique and bounded
- [x] best-effort deletion never targets unrelated consumers
- [x] Add gated NATS integration tests and document env flags:
- [x] Runner ignored tests
- [x] Projection ignored tests
### Required Tests
- Workspace verification commands
@@ -227,17 +233,17 @@ Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query
- New gRPC query tests pass (unit + integration).
### Tasks
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
- [ ] Implement Projection gRPC server for query execution
- [ ] Implement Gateway gRPC client routing to Projection
- [ ] deadlines
- [ ] bounded retries (idempotent only)
- [ ] context propagation
- [ ] Preserve HTTP `/v1/query/*` as compatibility/debug:
- [ ] route internally to gRPC or keep as legacy endpoint
- [ ] Add tests:
- [ ] authz + forwarding via gRPC
- [ ] tenant isolation enforcement in Projection QueryService
- [x] Define protobuf API: `projection.gateway.v1.QueryService`
- [x] Implement Projection gRPC server for query execution
- [x] Implement Gateway gRPC client routing to Projection
- [x] deadlines
- [x] bounded retries (idempotent only)
- [x] context propagation
- [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
- [x] route internally to gRPC
- [x] Add tests:
- [x] authz + forwarding via gRPC
- [x] tenant isolation enforcement in Projection QueryService
### Required Tests
- Workspace verification commands
@@ -257,14 +263,14 @@ Replace Gateways `/admin/runner/*` HTTP proxy usage with a first-class gRPC a
- Runner drain/readiness semantics validated and tested.
### Tasks
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [ ] Implement Runner gRPC admin server
- [ ] Implement Gateway gRPC client integration for admin operations
- [ ] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- [ ] Add tests:
- [ ] Gateway: rejects without rights
- [ ] Gateway: rejects tenant spoof attempts
- [ ] Runner: idempotency and drain semantics
- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [x] Implement Runner gRPC admin server
- [x] Implement Gateway gRPC client integration for admin operations
- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
- [x] Add tests:
- [x] Gateway: rejects without rights
- [x] Gateway: rejects tenant spoof attempts
- [x] Runner: idempotency and drain semantics
### Required Tests
- Workspace verification commands
@@ -284,20 +290,20 @@ Make Gateway upstream connection handling, retry behavior, and probe/fanout oper
- Gated load/soak tests exist and are runnable.
### Tasks
- [ ] Implement upstream channel pool
- [ ] bounded LRU
- [ ] TTL/eviction
- [ ] fast-path reuse under load
- [ ] Standardize retry profiles
- [ ] read-only: limited retry with jitter
- [ ] mutations: no retry unless idempotency key is present and semantics are safe
- [ ] Standardize timeouts/deadlines:
- [ ] edge timeout limits
- [ ] internal per-service deadlines
- [ ] Fanout controls:
- [ ] concurrency limiters for probes/snapshots
- [ ] short TTL caching where safe
- [ ] Ensure probes carry context (correlation/trace) for observability.
- [x] Implement upstream channel pool
- [x] bounded LRU
- [x] TTL/eviction
- [x] fast-path reuse under load (cached gRPC channels)
- [x] Standardize retry profiles
- [x] read-only: limited retry with jitter (Gateway gRPC calls)
- [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
- [x] Standardize timeouts/deadlines:
- [x] edge timeout limits
- [x] internal per-service deadlines
- [x] Fanout controls:
- [x] concurrency limiters for probes/snapshots
- [x] short TTL caching where safe
- [x] Ensure probes carry context (correlation/trace) for observability.
### Required Tests
- Workspace verification commands
@@ -317,10 +323,10 @@ Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS
- End-to-end smoke tests pass (gated).
### Tasks
- [ ] Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
- [ ] Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
- [ ] Ensure Control UI + Control API rely only on standardized surfaces
- [ ] Harden metrics and readiness probes to match the standard contract everywhere
- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
- [x] Ensure Control UI + Control API rely only on standardized surfaces
- [x] Harden metrics and readiness probes to match the standard contract everywhere
### Required Tests
- Workspace verification commands