9.8 KiB
Gateway Transport Plan
Purpose
Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
- Simplicity (few patterns, minimal bespoke conventions)
- Ease of operation (consistent health/ready/metrics, consistent failure modes)
- Frugality (bounded connections, bounded fanout, low overhead)
- High performance (low tail latency, backpressure-aware, predictable routing)
- Safety (tenant isolation, deny-by-default authz, consistent context propagation)
Non-Negotiable Rules (Global)
- Every cross-service request MUST carry tenant + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every milestone below is “stop-the-line” gated:
- All tasks completed
- All tests passing
- Workspace lint/format/type checks passing
- Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
Current State (Baseline)
- Gateway → Aggregate: gRPC command submission
- Gateway → Projection: HTTP query proxy (
/v1/query/*) - Gateway → Runner: HTTP proxy for admin endpoints (
/admin/runner/*) - Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
Target Architecture (End State)
- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
- Async/event backbone: NATS JetStream remains for event/work distribution
sharedis the single source of truth for:- Header names and propagation rules
- Trace parsing/validation rules (
traceparent,trace-id) - Request context representation (tenant/correlation/trace)
Definitions
Request Context
Fields that must be consistently propagated:
tenant_id(HTTP:x-tenant-id, NATS:tenant-id)correlation_id(HTTP:x-correlation-id, NATS:x-correlation-idandcorrelation-id)traceparent(HTTP:traceparent, NATS:traceparent)trace_id(derived fromtraceparentor provided explicitly; NATS:trace-id)request_id(HTTP:x-request-id, optional for NATS)
Standard Health Endpoints (per service)
GET /healthlivenessGET /readyreadiness (includes tenant gating if applicable)GET /metricsPrometheus
Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
Goal
Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
Exit Criteria
- A single shared contract exists for header names and trace parsing.
- Gateway injects context into all upstream calls (including rebalance/health probes).
- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
- Unit tests prove propagation behavior for each transport.
cargo fmt --check,cargo clippy --workspace --all-targets -- -D warnings,cargo test --workspaceall pass.
Tasks
- Standardize header constants in
sharedand remove string literals from Gateway and nodes where feasible. - Add
sharedhelpers for:- HTTP extract/inject
- gRPC metadata extract/inject
- NATS header extract/inject
- Gateway: ensure context is injected into:
- gRPC upstream requests to Aggregate
- HTTP upstream requests to Projection
- Runner admin proxy requests
- Any “probe” calls (rebalance gates, fleet snapshots, health checks)
- Projection/Runner/Aggregate: ensure NATS published messages include:
tenant-idx-correlation-id+correlation-idtraceparenttrace-id(derived when possible)
- Add transport-level tests:
- Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
- Gateway HTTP proxy path: incoming context → upstream headers preserved
- NATS publish path: produced headers contain expected keys/values
Required Tests
- Unit tests for shared parsing/derivation utilities
- Existing per-crate test suites
- At least one per-service “transport contract” test verifying headers are present and correct
Milestone 1: Internal RPC Standardization (Projection via gRPC)
Goal
Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
Exit Criteria
- A Projection gRPC service exists for query execution.
- Gateway routes queries to Projection via gRPC by default.
- Authorization semantics remain enforced in Gateway (deny-by-default).
- Response shapes are stable and match the existing UI expectations.
- All tests pass, including new gRPC query integration tests.
Tasks
- Define protobuf API:
projection.gateway.v1.QueryService- Request includes tenant + view + query payload and metadata
- Response includes result payload and standard context propagation
- Implement Projection gRPC server:
- Parse tenant/view/query
- Execute query against current projection storage/query engine
- Enforce tenant scope
- Implement Gateway gRPC client path for queries:
- Routing by tenant to Projection endpoint
- Deadlines, bounded retries (idempotent only)
- Context propagation (tenant/correlation/trace)
- Keep HTTP
/v1/query/*:- Either route to internal gRPC implementation or keep as legacy/debug endpoint
- Add tests:
- Gateway query authz + forwarding via gRPC
- Projection gRPC query contract tests for tenant isolation
Required Tests
- New gRPC QueryService tests (unit + integration)
- Existing query/authz tests in Gateway
- Workspace fmt/clippy/test
Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
Goal
Replace /admin/runner/* HTTP proxying with a first-class gRPC admin service for Runner operations.
Exit Criteria
- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
- Gateway uses gRPC to call Runner admin APIs.
- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
- Admin operations are idempotent where appropriate and include audit hooks where required.
- All tests pass and include negative/tenant-spoof cases.
Tasks
- Define protobuf API:
runner.admin.v1.RunnerAdmin- Drain/resume/status/reload/tenant-scoped controls
- Standard error mapping
- Implement Runner gRPC admin server:
- Tenant gating enforced for tenant-scoped operations
- Readiness/drain semantics aligned with platform contracts
- Implement Gateway gRPC client integration:
- Route to Runner endpoint via routing table
- Enforce authz rights (e.g.
runner.admin) - Context propagation
- Keep HTTP
/admin/*in Runner optional:- Either remove Gateway proxy usage or keep for direct debugging behind secure network
- Tests:
- Gateway: admin calls rejected without rights
- Gateway: tenant spoof attempts rejected
- Runner: idempotency and drain semantics validated
Required Tests
- gRPC RunnerAdmin unit/integration tests
- Gateway proxy-to-gRPC tests
- Workspace fmt/clippy/test
Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
Goal
Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
Exit Criteria
- Gateway maintains bounded upstream connection pools for gRPC endpoints.
- All gRPC calls have deadlines; retries are only for idempotent operations.
- All probe/fanout calls are bounded and do not cause thundering herds.
- Load/soak tests show stable behavior under partial failure.
Tasks
- Implement a Gateway upstream channel pool:
- LRU bounded by max endpoints
- TTL/eviction strategy
- Fast path reuse under load
- Standardize retry profiles:
- Read-only: short retry with jitter
- Mutations: no automatic retry unless idempotency key present
- Standardize timeouts:
- Edge timeout limits
- Internal per-service deadlines
- Fanout controls:
- Concurrency limiters for fleet snapshot/probes
- Cache results where safe (short TTL)
Required Tests
- Unit tests for pool eviction/TTL
- Gateway integration tests for deadline propagation
- Gated load tests (document env + how to run)
Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
Goal
Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
- All operational playbooks rely on standardized endpoints.
Tasks
- Remove Gateway’s HTTP query proxy usage (or keep only as compatibility shim).
- Remove Gateway’s runner admin HTTP proxy usage (or keep only as compatibility shim).
- Ensure Control UI + Control API use the standardized Gateway surfaces.
- Harden metrics and health probes to always carry context.
Required Tests
- End-to-end smoke tests (gated)
- Workspace fmt/clippy/test
Verification Commands (Required at Each Milestone)
cargo fmt --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspacenpm ci && npm run lint && npm run typecheck && npm run test && npm run build(incontrol/ui)
Notes / Constraints
- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.