transport: complete M0–M7
shared: add stream+consumer policy helpers; NATS context header builder aggregate/runner/projection: centralize stream validation and header usage; set bounded consumer params projection: add QueryService gRPC and wire into main; settings include PROJECTION_GRPC_ADDR gateway: gRPC routing to Projection/Runner with deadlines; bounded read-only retries; pooled gRPC channels (bounded LRU+TTL); admin proxy forwards to gRPC; probes use concurrency limiter + TTL cache runner: add RunnerAdmin gRPC server (drain, status, reload) and wire into main; settings include RUNNER_GRPC_ADDR tests: add gateway authz for runner admin, projection tenant isolation, runner admin drain semantics docs: update TRANSPORT_DEVELOPMENT_PLAN to reflect completed milestones and details
This commit is contained in:
@@ -11,15 +11,17 @@ This plan merges and supersedes:
|
||||
|
||||
## Current Status (Codebase Reality)
|
||||
- Monorepo workspace exists; `shared` crate exists and is used by Aggregate/Projection/Runner/Gateway.
|
||||
- Request context pieces are partially standardized:
|
||||
- Request context pieces are standardized:
|
||||
- `shared` provides `TenantId`, `CorrelationId`, `TraceId`
|
||||
- `shared` provides `trace_id_from_traceparent(...)` and `traceparent_from_trace_id(...)`
|
||||
- Some header names are centralized in `shared` but not all call sites use constants yet.
|
||||
- `shared` provides canonical header constants (HTTP + NATS) and trace/correlation normalization helpers
|
||||
- Most call sites now use `shared` constants/helpers; remaining gaps should be treated as Milestone-gated
|
||||
- Gateway → Aggregate is already HTTP(edge) → gRPC(internal) and propagates `x-tenant-id`, `x-correlation-id`, and `traceparent`.
|
||||
- Gateway → Projection remains HTTP proxy (`/v1/query/...`) and Gateway → Runner remains HTTP admin proxy (`/admin/runner/...`).
|
||||
- Node → NATS header propagation is improved and closer to consistent:
|
||||
- Runner publishes `x-correlation-id` and `correlation-id`, and ensures `traceparent`/`trace-id` are present/derived when possible.
|
||||
- Aggregate publishes `trace-id` when `traceparent` is present.
|
||||
- Runner publishes required headers for effect commands/results (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
|
||||
- Aggregate publishes required headers for events (`tenant-id`, `Nats-Msg-Id`, correlation, traceparent/trace-id), generating when missing.
|
||||
- Projection hydrates correlation/trace context from NATS headers when the JSON envelope omits them.
|
||||
- Many “hard” NATS tests already exist but are gated/ignored by default; they should be treated as milestone gates when enabling changes.
|
||||
|
||||
## Principles
|
||||
@@ -84,19 +86,20 @@ Make propagation rules consistent and enforceable across HTTP, gRPC, and NATS so
|
||||
- [x] `TenantId`
|
||||
- [x] `CorrelationId`
|
||||
- [x] `TraceId`
|
||||
- [~] Consolidate header constants in `shared`:
|
||||
- [x] Consolidate header constants in `shared`:
|
||||
- [x] HTTP: `x-correlation-id`, `traceparent`, `trace-id` (for NATS/interop)
|
||||
- [ ] HTTP: `x-tenant-id`, `x-request-id` (missing constants)
|
||||
- [x] HTTP: `x-tenant-id`, `x-request-id`
|
||||
- [x] NATS: `correlation-id` (used in Runner), `trace-id` (now emitted where possible)
|
||||
- [ ] NATS: `tenant-id` constant, `Nats-Msg-Id` constant (missing constants)
|
||||
- [x] NATS: `tenant-id`, `Nats-Msg-Id`
|
||||
- [x] Add shared helpers:
|
||||
- [x] derive `trace-id` from `traceparent`
|
||||
- [x] derive `traceparent` from `trace-id` when valid
|
||||
- [ ] normalize/generate correlation id when missing across all transports (helper exists for `CorrelationId::generate()`; adoption incomplete)
|
||||
- [x] normalize/generate correlation id when missing (`normalize_correlation_id(...)`)
|
||||
- [x] normalize/generate traceparent when missing/invalid (`normalize_traceparent(...)`)
|
||||
- [x] Add unit tests in `shared` for:
|
||||
- [x] traceparent parsing validity
|
||||
- [x] serialization shape for correlation/trace id newtypes
|
||||
- [ ] additional validation cases (invalid traceparents, invalid trace-id lengths) if needed for stricter enforcement
|
||||
- [x] additional validation cases (invalid traceparents, all-zero ids)
|
||||
|
||||
### Required Tests
|
||||
- `cargo fmt --check`
|
||||
@@ -118,24 +121,26 @@ Make the JetStream/NATS “wire protocol” explicit and uniform so interop is s
|
||||
- “Contract tests” exist per service to verify produced headers and subject formats.
|
||||
|
||||
### Tasks
|
||||
- [ ] Create/standardize subject builder helpers (prefer `shared`):
|
||||
- [ ] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
|
||||
- [ ] Runner effect/effect_result/workflow subject builders
|
||||
- [~] Aggregate publishes:
|
||||
- [ ] `tenant-id` header always present (still needs enforcement everywhere)
|
||||
- [ ] correlation + trace headers always present when available, generated when required
|
||||
- [x] Create/standardize subject builder helpers (prefer `shared`):
|
||||
- [x] Aggregate event subject builder (`tenant.<tenant>.aggregate.<type>.<id>`)
|
||||
- [x] Runner effect/effect_result subject builders
|
||||
- [x] Runner workflow/workflow_event subject builders (helpers exist; concrete publishers/consumers are future work)
|
||||
- [x] Aggregate publishes:
|
||||
- [x] `tenant-id` header always present
|
||||
- [x] correlation + trace headers always present; generated when missing/invalid
|
||||
- [x] `trace-id` is derived when `traceparent` is present (now emitted in publish path)
|
||||
- [ ] `Nats-Msg-Id` strategy explicitly defined and tested
|
||||
- [~] Runner publishes (commands/results):
|
||||
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`)
|
||||
- [x] trace headers derived consistently when possible (`traceparent` from `trace-id`, `trace-id` from `traceparent`)
|
||||
- [ ] outbox metadata → NATS headers mapping standardized via shared helpers (adoption incomplete)
|
||||
- [~] Projection consumption:
|
||||
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Aggregate events use `event_id`)
|
||||
- [x] Runner publishes (commands/results):
|
||||
- [x] correlation headers emitted consistently (`x-correlation-id` + `correlation-id`) and generated when missing
|
||||
- [x] trace headers always present/derived when possible; generated when missing/invalid
|
||||
- [x] `Nats-Msg-Id` strategy explicitly defined and tested (Runner commands/results use `command_id`)
|
||||
- [x] outbox metadata → NATS headers mapping standardized via shared helpers
|
||||
- [x] Projection consumption:
|
||||
- [x] envelope decoding remains tolerant (unknown fields ignored)
|
||||
- [~] correlation/trace context flows into spans/metrics consistently (types are shared; header extraction remains best-effort and should be unified)
|
||||
- [ ] Add unit tests:
|
||||
- [ ] subject formatting tests per service (once builders exist)
|
||||
- [ ] required header presence tests per publisher (enforce required keys)
|
||||
- [x] correlation/trace context flows into spans/metrics consistently (envelope + NATS header fallback)
|
||||
- [x] Add unit tests:
|
||||
- [x] subject formatting tests (shared builders)
|
||||
- [x] required header presence tests per publisher (Aggregate + Runner)
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
@@ -155,15 +160,16 @@ Make stream definitions explicit, validated, and safe in all environments, preve
|
||||
- Config-only tests validate stream config builders without requiring NATS.
|
||||
|
||||
### Tasks
|
||||
- [ ] Define stream policies:
|
||||
- [ ] `AGGREGATE_EVENTS` (subjects, retention, duplicate window)
|
||||
- [ ] `WORKFLOW_COMMANDS`
|
||||
- [ ] `WORKFLOW_EVENTS`
|
||||
- [ ] Implement compatibility validation rules:
|
||||
- [ ] required subjects are present (superset allowed)
|
||||
- [ ] retention/limits are within allowed ranges
|
||||
- [ ] dedupe assumptions align with producer `Nats-Msg-Id` usage
|
||||
- [ ] Add unit tests for stream config builders + validators.
|
||||
- [x] Define stream policies:
|
||||
- [x] `AGGREGATE_EVENTS` (subjects, limits, duplicate window) is defined and validated on startup
|
||||
- [x] `WORKFLOW_COMMANDS` is defined and validated on startup
|
||||
- [x] `WORKFLOW_EVENTS` is defined and validated on startup
|
||||
- [x] Centralize stream policy builders/validators in `shared`
|
||||
- [x] Implement compatibility validation rules:
|
||||
- [x] required subjects are present (superset allowed)
|
||||
- [x] limits/max_age/duplicate window validated against minimums
|
||||
- [x] dedupe assumptions align with producer `Nats-Msg-Id` usage (duplicate window + msg-id strategy)
|
||||
- [x] Add unit tests for stream config builders + validators.
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
@@ -186,25 +192,25 @@ Standardize consumer configs and runtime behavior to guarantee bounded in-flight
|
||||
- scale-out behavior (deliver group) where applicable
|
||||
|
||||
### Tasks
|
||||
- [ ] Standardize consumer defaults:
|
||||
- [ ] `AckPolicy::Explicit`
|
||||
- [ ] `ack_wait` default + env override
|
||||
- [ ] `max_deliver` default + env override
|
||||
- [ ] `max_ack_pending` tied to worker concurrency
|
||||
- [ ] Projection:
|
||||
- [ ] durable naming collision-free for Single/PerView modes
|
||||
- [ ] checkpoint gate semantics: “skip still acks”
|
||||
- [ ] poison handling persists durable records and terminates reliably
|
||||
- [ ] Runner:
|
||||
- [ ] durable naming collision-free and stable across replicas
|
||||
- [ ] deliver group rules defined and tested
|
||||
- [ ] outbox relay exactly-once behavior verified under redelivery
|
||||
- [ ] Aggregate:
|
||||
- [ ] ad-hoc fetch consumer always unique and bounded
|
||||
- [ ] best-effort deletion never targets unrelated consumers
|
||||
- [ ] Add gated NATS integration tests and document env flags:
|
||||
- [ ] Runner ignored tests
|
||||
- [ ] Projection ignored tests
|
||||
- [x] Standardize consumer defaults:
|
||||
- [x] `AckPolicy::Explicit`
|
||||
- [x] `ack_wait` default + env override (Runner/Projection: `*_ACK_TIMEOUT_MS`)
|
||||
- [x] `max_deliver` default + env override (Runner/Projection: `*_MAX_DELIVER`)
|
||||
- [x] `max_ack_pending` tied to worker concurrency (Runner/Projection: `max_in_flight`)
|
||||
- [x] Projection:
|
||||
- [x] durable naming collision-free for Single/PerView modes
|
||||
- [x] checkpoint gate semantics: “skip still acks”
|
||||
- [x] poison handling persists durable records and terminates reliably (poison record + term)
|
||||
- [x] Runner:
|
||||
- [x] durable naming collision-free and stable across replicas
|
||||
- [x] deliver group rules defined (pull consumers; `deliver_group` is rejected if configured)
|
||||
- [x] outbox relay exactly-once behavior verified under redelivery (unit tests exist; gated NATS e2e tests remain ignored-by-default)
|
||||
- [x] Aggregate:
|
||||
- [x] ad-hoc fetch consumer always unique and bounded
|
||||
- [x] best-effort deletion never targets unrelated consumers
|
||||
- [x] Add gated NATS integration tests and document env flags:
|
||||
- [x] Runner ignored tests
|
||||
- [x] Projection ignored tests
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
@@ -227,17 +233,17 @@ Replace Gateway → Projection HTTP proxy as the default path with a gRPC Query
|
||||
- New gRPC query tests pass (unit + integration).
|
||||
|
||||
### Tasks
|
||||
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
|
||||
- [ ] Implement Projection gRPC server for query execution
|
||||
- [ ] Implement Gateway gRPC client routing to Projection
|
||||
- [ ] deadlines
|
||||
- [ ] bounded retries (idempotent only)
|
||||
- [ ] context propagation
|
||||
- [ ] Preserve HTTP `/v1/query/*` as compatibility/debug:
|
||||
- [ ] route internally to gRPC or keep as legacy endpoint
|
||||
- [ ] Add tests:
|
||||
- [ ] authz + forwarding via gRPC
|
||||
- [ ] tenant isolation enforcement in Projection QueryService
|
||||
- [x] Define protobuf API: `projection.gateway.v1.QueryService`
|
||||
- [x] Implement Projection gRPC server for query execution
|
||||
- [x] Implement Gateway gRPC client routing to Projection
|
||||
- [x] deadlines
|
||||
- [x] bounded retries (idempotent only)
|
||||
- [x] context propagation
|
||||
- [x] Preserve HTTP `/v1/query/*` as compatibility/debug:
|
||||
- [x] route internally to gRPC
|
||||
- [x] Add tests:
|
||||
- [x] authz + forwarding via gRPC
|
||||
- [x] tenant isolation enforcement in Projection QueryService
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
@@ -257,14 +263,14 @@ Replace Gateway’s `/admin/runner/*` HTTP proxy usage with a first-class gRPC a
|
||||
- Runner drain/readiness semantics validated and tested.
|
||||
|
||||
### Tasks
|
||||
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
||||
- [ ] Implement Runner gRPC admin server
|
||||
- [ ] Implement Gateway gRPC client integration for admin operations
|
||||
- [ ] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
|
||||
- [ ] Add tests:
|
||||
- [ ] Gateway: rejects without rights
|
||||
- [ ] Gateway: rejects tenant spoof attempts
|
||||
- [ ] Runner: idempotency and drain semantics
|
||||
- [x] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
||||
- [x] Implement Runner gRPC admin server
|
||||
- [x] Implement Gateway gRPC client integration for admin operations
|
||||
- [x] Keep Runner HTTP admin endpoints optional for direct debugging, not required by Gateway
|
||||
- [x] Add tests:
|
||||
- [x] Gateway: rejects without rights
|
||||
- [x] Gateway: rejects tenant spoof attempts
|
||||
- [x] Runner: idempotency and drain semantics
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
@@ -284,20 +290,20 @@ Make Gateway upstream connection handling, retry behavior, and probe/fanout oper
|
||||
- Gated load/soak tests exist and are runnable.
|
||||
|
||||
### Tasks
|
||||
- [ ] Implement upstream channel pool
|
||||
- [ ] bounded LRU
|
||||
- [ ] TTL/eviction
|
||||
- [ ] fast-path reuse under load
|
||||
- [ ] Standardize retry profiles
|
||||
- [ ] read-only: limited retry with jitter
|
||||
- [ ] mutations: no retry unless idempotency key is present and semantics are safe
|
||||
- [ ] Standardize timeouts/deadlines:
|
||||
- [ ] edge timeout limits
|
||||
- [ ] internal per-service deadlines
|
||||
- [ ] Fanout controls:
|
||||
- [ ] concurrency limiters for probes/snapshots
|
||||
- [ ] short TTL caching where safe
|
||||
- [ ] Ensure probes carry context (correlation/trace) for observability.
|
||||
- [x] Implement upstream channel pool
|
||||
- [x] bounded LRU
|
||||
- [x] TTL/eviction
|
||||
- [x] fast-path reuse under load (cached gRPC channels)
|
||||
- [x] Standardize retry profiles
|
||||
- [x] read-only: limited retry with jitter (Gateway gRPC calls)
|
||||
- [x] mutations: no retry unless idempotency key is present and semantics are safe (Gateway does not retry mutations)
|
||||
- [x] Standardize timeouts/deadlines:
|
||||
- [x] edge timeout limits
|
||||
- [x] internal per-service deadlines
|
||||
- [x] Fanout controls:
|
||||
- [x] concurrency limiters for probes/snapshots
|
||||
- [x] short TTL caching where safe
|
||||
- [x] Ensure probes carry context (correlation/trace) for observability.
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
@@ -317,10 +323,10 @@ Ensure the “happy path” is: HTTP edge → Gateway → gRPC internal → NATS
|
||||
- End-to-end smoke tests pass (gated).
|
||||
|
||||
### Tasks
|
||||
- [ ] Remove Gateway HTTP query proxy usage (or keep only as explicit compatibility shim)
|
||||
- [ ] Remove Gateway runner admin HTTP proxy usage (or keep only as explicit compatibility shim)
|
||||
- [ ] Ensure Control UI + Control API rely only on standardized surfaces
|
||||
- [ ] Harden metrics and readiness probes to match the standard contract everywhere
|
||||
- [x] Remove Gateway HTTP query proxy usage (kept HTTP edge; Gateway routes internally to Projection gRPC)
|
||||
- [x] Remove Gateway runner admin HTTP proxy usage (kept HTTP edge; Gateway routes internally to RunnerAdmin gRPC)
|
||||
- [x] Ensure Control UI + Control API rely only on standardized surfaces
|
||||
- [x] Harden metrics and readiness probes to match the standard contract everywhere
|
||||
|
||||
### Required Tests
|
||||
- Workspace verification commands
|
||||
|
||||
Reference in New Issue
Block a user