217 lines
9.8 KiB
Markdown
217 lines
9.8 KiB
Markdown
# Gateway Transport Plan
|
||
|
||
## Purpose
|
||
Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
|
||
- Simplicity (few patterns, minimal bespoke conventions)
|
||
- Ease of operation (consistent health/ready/metrics, consistent failure modes)
|
||
- Frugality (bounded connections, bounded fanout, low overhead)
|
||
- High performance (low tail latency, backpressure-aware, predictable routing)
|
||
- Safety (tenant isolation, deny-by-default authz, consistent context propagation)
|
||
|
||
## Non-Negotiable Rules (Global)
|
||
- Every cross-service request MUST carry tenant + trace context.
|
||
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
|
||
- Every milestone below is “stop-the-line” gated:
|
||
- All tasks completed
|
||
- All tests passing
|
||
- Workspace lint/format/type checks passing
|
||
- Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
|
||
|
||
## Current State (Baseline)
|
||
- Gateway → Aggregate: gRPC command submission
|
||
- Gateway → Projection: HTTP query proxy (`/v1/query/*`)
|
||
- Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`)
|
||
- Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
|
||
|
||
## Target Architecture (End State)
|
||
- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
|
||
- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
|
||
- Async/event backbone: NATS JetStream remains for event/work distribution
|
||
- `shared` is the single source of truth for:
|
||
- Header names and propagation rules
|
||
- Trace parsing/validation rules (`traceparent`, `trace-id`)
|
||
- Request context representation (tenant/correlation/trace)
|
||
|
||
## Definitions
|
||
### Request Context
|
||
Fields that must be consistently propagated:
|
||
- `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`)
|
||
- `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`)
|
||
- `traceparent` (HTTP: `traceparent`, NATS: `traceparent`)
|
||
- `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`)
|
||
- `request_id` (HTTP: `x-request-id`, optional for NATS)
|
||
|
||
### Standard Health Endpoints (per service)
|
||
- `GET /health` liveness
|
||
- `GET /ready` readiness (includes tenant gating if applicable)
|
||
- `GET /metrics` Prometheus
|
||
|
||
## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
|
||
|
||
### Goal
|
||
Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
|
||
|
||
### Exit Criteria
|
||
- A single shared contract exists for header names and trace parsing.
|
||
- Gateway injects context into all upstream calls (including rebalance/health probes).
|
||
- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
|
||
- Unit tests prove propagation behavior for each transport.
|
||
- `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass.
|
||
|
||
### Tasks
|
||
- [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible.
|
||
- [ ] Add `shared` helpers for:
|
||
- HTTP extract/inject
|
||
- gRPC metadata extract/inject
|
||
- NATS header extract/inject
|
||
- [ ] Gateway: ensure context is injected into:
|
||
- gRPC upstream requests to Aggregate
|
||
- HTTP upstream requests to Projection
|
||
- Runner admin proxy requests
|
||
- Any “probe” calls (rebalance gates, fleet snapshots, health checks)
|
||
- [ ] Projection/Runner/Aggregate: ensure NATS published messages include:
|
||
- `tenant-id`
|
||
- `x-correlation-id` + `correlation-id`
|
||
- `traceparent`
|
||
- `trace-id` (derived when possible)
|
||
- [ ] Add transport-level tests:
|
||
- [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
|
||
- [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved
|
||
- [ ] NATS publish path: produced headers contain expected keys/values
|
||
|
||
### Required Tests
|
||
- Unit tests for shared parsing/derivation utilities
|
||
- Existing per-crate test suites
|
||
- At least one per-service “transport contract” test verifying headers are present and correct
|
||
|
||
## Milestone 1: Internal RPC Standardization (Projection via gRPC)
|
||
|
||
### Goal
|
||
Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
|
||
|
||
### Exit Criteria
|
||
- A Projection gRPC service exists for query execution.
|
||
- Gateway routes queries to Projection via gRPC by default.
|
||
- Authorization semantics remain enforced in Gateway (deny-by-default).
|
||
- Response shapes are stable and match the existing UI expectations.
|
||
- All tests pass, including new gRPC query integration tests.
|
||
|
||
### Tasks
|
||
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
|
||
- [ ] Request includes tenant + view + query payload and metadata
|
||
- [ ] Response includes result payload and standard context propagation
|
||
- [ ] Implement Projection gRPC server:
|
||
- [ ] Parse tenant/view/query
|
||
- [ ] Execute query against current projection storage/query engine
|
||
- [ ] Enforce tenant scope
|
||
- [ ] Implement Gateway gRPC client path for queries:
|
||
- [ ] Routing by tenant to Projection endpoint
|
||
- [ ] Deadlines, bounded retries (idempotent only)
|
||
- [ ] Context propagation (tenant/correlation/trace)
|
||
- [ ] Keep HTTP `/v1/query/*`:
|
||
- [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint
|
||
- [ ] Add tests:
|
||
- [ ] Gateway query authz + forwarding via gRPC
|
||
- [ ] Projection gRPC query contract tests for tenant isolation
|
||
|
||
### Required Tests
|
||
- New gRPC QueryService tests (unit + integration)
|
||
- Existing query/authz tests in Gateway
|
||
- Workspace fmt/clippy/test
|
||
|
||
## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
|
||
|
||
### Goal
|
||
Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations.
|
||
|
||
### Exit Criteria
|
||
- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
|
||
- Gateway uses gRPC to call Runner admin APIs.
|
||
- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
|
||
- Admin operations are idempotent where appropriate and include audit hooks where required.
|
||
- All tests pass and include negative/tenant-spoof cases.
|
||
|
||
### Tasks
|
||
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
|
||
- [ ] Drain/resume/status/reload/tenant-scoped controls
|
||
- [ ] Standard error mapping
|
||
- [ ] Implement Runner gRPC admin server:
|
||
- [ ] Tenant gating enforced for tenant-scoped operations
|
||
- [ ] Readiness/drain semantics aligned with platform contracts
|
||
- [ ] Implement Gateway gRPC client integration:
|
||
- [ ] Route to Runner endpoint via routing table
|
||
- [ ] Enforce authz rights (e.g. `runner.admin`)
|
||
- [ ] Context propagation
|
||
- [ ] Keep HTTP `/admin/*` in Runner optional:
|
||
- [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network
|
||
- [ ] Tests:
|
||
- [ ] Gateway: admin calls rejected without rights
|
||
- [ ] Gateway: tenant spoof attempts rejected
|
||
- [ ] Runner: idempotency and drain semantics validated
|
||
|
||
### Required Tests
|
||
- gRPC RunnerAdmin unit/integration tests
|
||
- Gateway proxy-to-gRPC tests
|
||
- Workspace fmt/clippy/test
|
||
|
||
## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
|
||
|
||
### Goal
|
||
Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
|
||
|
||
### Exit Criteria
|
||
- Gateway maintains bounded upstream connection pools for gRPC endpoints.
|
||
- All gRPC calls have deadlines; retries are only for idempotent operations.
|
||
- All probe/fanout calls are bounded and do not cause thundering herds.
|
||
- Load/soak tests show stable behavior under partial failure.
|
||
|
||
### Tasks
|
||
- [ ] Implement a Gateway upstream channel pool:
|
||
- [ ] LRU bounded by max endpoints
|
||
- [ ] TTL/eviction strategy
|
||
- [ ] Fast path reuse under load
|
||
- [ ] Standardize retry profiles:
|
||
- [ ] Read-only: short retry with jitter
|
||
- [ ] Mutations: no automatic retry unless idempotency key present
|
||
- [ ] Standardize timeouts:
|
||
- [ ] Edge timeout limits
|
||
- [ ] Internal per-service deadlines
|
||
- [ ] Fanout controls:
|
||
- [ ] Concurrency limiters for fleet snapshot/probes
|
||
- [ ] Cache results where safe (short TTL)
|
||
|
||
### Required Tests
|
||
- Unit tests for pool eviction/TTL
|
||
- Gateway integration tests for deadline propagation
|
||
- Gated load tests (document env + how to run)
|
||
|
||
## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
|
||
|
||
### Goal
|
||
Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
|
||
|
||
### Exit Criteria
|
||
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
|
||
- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
|
||
- All operational playbooks rely on standardized endpoints.
|
||
|
||
### Tasks
|
||
- [ ] Remove Gateway’s HTTP query proxy usage (or keep only as compatibility shim).
|
||
- [ ] Remove Gateway’s runner admin HTTP proxy usage (or keep only as compatibility shim).
|
||
- [ ] Ensure Control UI + Control API use the standardized Gateway surfaces.
|
||
- [ ] Harden metrics and health probes to always carry context.
|
||
|
||
### Required Tests
|
||
- End-to-end smoke tests (gated)
|
||
- Workspace fmt/clippy/test
|
||
|
||
## Verification Commands (Required at Each Milestone)
|
||
- `cargo fmt --check`
|
||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
||
- `cargo test --workspace`
|
||
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
|
||
|
||
## Notes / Constraints
|
||
- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
|
||
- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.
|