# Gateway Transport Plan ## Purpose Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles: - Simplicity (few patterns, minimal bespoke conventions) - Ease of operation (consistent health/ready/metrics, consistent failure modes) - Frugality (bounded connections, bounded fanout, low overhead) - High performance (low tail latency, backpressure-aware, predictable routing) - Safety (tenant isolation, deny-by-default authz, consistent context propagation) ## Non-Negotiable Rules (Global) - Every cross-service request MUST carry tenant + trace context. - Every transport path MUST have explicit timeouts/deadlines and bounded retries. - Every milestone below is “stop-the-line” gated: - All tasks completed - All tests passing - Workspace lint/format/type checks passing - Required integration tests for the milestone passing (when gated by env, they must be runnable and documented) ## Current State (Baseline) - Gateway → Aggregate: gRPC command submission - Gateway → Projection: HTTP query proxy (`/v1/query/*`) - Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`) - Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent) ## Target Architecture (End State) - Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly) - Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack) - Async/event backbone: NATS JetStream remains for event/work distribution - `shared` is the single source of truth for: - Header names and propagation rules - Trace parsing/validation rules (`traceparent`, `trace-id`) - Request context representation (tenant/correlation/trace) ## Definitions ### Request Context Fields that must be consistently propagated: - `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`) - `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`) - `traceparent` (HTTP: `traceparent`, NATS: `traceparent`) - `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`) - `request_id` (HTTP: `x-request-id`, optional for NATS) ### Standard Health Endpoints (per service) - `GET /health` liveness - `GET /ready` readiness (includes tenant gating if applicable) - `GET /metrics` Prometheus ## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere) ### Goal Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes). ### Exit Criteria - A single shared contract exists for header names and trace parsing. - Gateway injects context into all upstream calls (including rebalance/health probes). - Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own. - Unit tests prove propagation behavior for each transport. - `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass. ### Tasks - [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible. - [ ] Add `shared` helpers for: - HTTP extract/inject - gRPC metadata extract/inject - NATS header extract/inject - [ ] Gateway: ensure context is injected into: - gRPC upstream requests to Aggregate - HTTP upstream requests to Projection - Runner admin proxy requests - Any “probe” calls (rebalance gates, fleet snapshots, health checks) - [ ] Projection/Runner/Aggregate: ensure NATS published messages include: - `tenant-id` - `x-correlation-id` + `correlation-id` - `traceparent` - `trace-id` (derived when possible) - [ ] Add transport-level tests: - [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved - [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved - [ ] NATS publish path: produced headers contain expected keys/values ### Required Tests - Unit tests for shared parsing/derivation utilities - Existing per-crate test suites - At least one per-service “transport contract” test verifying headers are present and correct ## Milestone 1: Internal RPC Standardization (Projection via gRPC) ### Goal Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use. ### Exit Criteria - A Projection gRPC service exists for query execution. - Gateway routes queries to Projection via gRPC by default. - Authorization semantics remain enforced in Gateway (deny-by-default). - Response shapes are stable and match the existing UI expectations. - All tests pass, including new gRPC query integration tests. ### Tasks - [ ] Define protobuf API: `projection.gateway.v1.QueryService` - [ ] Request includes tenant + view + query payload and metadata - [ ] Response includes result payload and standard context propagation - [ ] Implement Projection gRPC server: - [ ] Parse tenant/view/query - [ ] Execute query against current projection storage/query engine - [ ] Enforce tenant scope - [ ] Implement Gateway gRPC client path for queries: - [ ] Routing by tenant to Projection endpoint - [ ] Deadlines, bounded retries (idempotent only) - [ ] Context propagation (tenant/correlation/trace) - [ ] Keep HTTP `/v1/query/*`: - [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint - [ ] Add tests: - [ ] Gateway query authz + forwarding via gRPC - [ ] Projection gRPC query contract tests for tenant isolation ### Required Tests - New gRPC QueryService tests (unit + integration) - Existing query/authz tests in Gateway - Workspace fmt/clippy/test ## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC) ### Goal Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations. ### Exit Criteria - Runner exposes a gRPC admin service for the admin surface required by Control/Gateway. - Gateway uses gRPC to call Runner admin APIs. - Authentication/authorization remains in Gateway; Runner trusts Gateway boundary. - Admin operations are idempotent where appropriate and include audit hooks where required. - All tests pass and include negative/tenant-spoof cases. ### Tasks - [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin` - [ ] Drain/resume/status/reload/tenant-scoped controls - [ ] Standard error mapping - [ ] Implement Runner gRPC admin server: - [ ] Tenant gating enforced for tenant-scoped operations - [ ] Readiness/drain semantics aligned with platform contracts - [ ] Implement Gateway gRPC client integration: - [ ] Route to Runner endpoint via routing table - [ ] Enforce authz rights (e.g. `runner.admin`) - [ ] Context propagation - [ ] Keep HTTP `/admin/*` in Runner optional: - [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network - [ ] Tests: - [ ] Gateway: admin calls rejected without rights - [ ] Gateway: tenant spoof attempts rejected - [ ] Runner: idempotency and drain semantics validated ### Required Tests - gRPC RunnerAdmin unit/integration tests - Gateway proxy-to-gRPC tests - Workspace fmt/clippy/test ## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality) ### Goal Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes. ### Exit Criteria - Gateway maintains bounded upstream connection pools for gRPC endpoints. - All gRPC calls have deadlines; retries are only for idempotent operations. - All probe/fanout calls are bounded and do not cause thundering herds. - Load/soak tests show stable behavior under partial failure. ### Tasks - [ ] Implement a Gateway upstream channel pool: - [ ] LRU bounded by max endpoints - [ ] TTL/eviction strategy - [ ] Fast path reuse under load - [ ] Standardize retry profiles: - [ ] Read-only: short retry with jitter - [ ] Mutations: no automatic retry unless idempotency key present - [ ] Standardize timeouts: - [ ] Edge timeout limits - [ ] Internal per-service deadlines - [ ] Fanout controls: - [ ] Concurrency limiters for fleet snapshot/probes - [ ] Cache results where safe (short TTL) ### Required Tests - Unit tests for pool eviction/TTL - Gateway integration tests for deadline propagation - Gated load tests (document env + how to run) ## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths) ### Goal Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async. ### Exit Criteria - Gateway no longer depends on HTTP for Projection queries or Runner admin. - Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control. - All operational playbooks rely on standardized endpoints. ### Tasks - [ ] Remove Gateway’s HTTP query proxy usage (or keep only as compatibility shim). - [ ] Remove Gateway’s runner admin HTTP proxy usage (or keep only as compatibility shim). - [ ] Ensure Control UI + Control API use the standardized Gateway surfaces. - [ ] Harden metrics and health probes to always carry context. ### Required Tests - End-to-end smoke tests (gated) - Workspace fmt/clippy/test ## Verification Commands (Required at Each Milestone) - `cargo fmt --check` - `cargo clippy --workspace --all-targets -- -D warnings` - `cargo test --workspace` - `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`) ## Notes / Constraints - Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding. - Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.