Files
cloudlysis/GATEWAY_TRANSPORT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

217 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gateway Transport Plan
## Purpose
Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
- Simplicity (few patterns, minimal bespoke conventions)
- Ease of operation (consistent health/ready/metrics, consistent failure modes)
- Frugality (bounded connections, bounded fanout, low overhead)
- High performance (low tail latency, backpressure-aware, predictable routing)
- Safety (tenant isolation, deny-by-default authz, consistent context propagation)
## Non-Negotiable Rules (Global)
- Every cross-service request MUST carry tenant + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every milestone below is “stop-the-line” gated:
- All tasks completed
- All tests passing
- Workspace lint/format/type checks passing
- Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
## Current State (Baseline)
- Gateway → Aggregate: gRPC command submission
- Gateway → Projection: HTTP query proxy (`/v1/query/*`)
- Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`)
- Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
## Target Architecture (End State)
- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
- Async/event backbone: NATS JetStream remains for event/work distribution
- `shared` is the single source of truth for:
- Header names and propagation rules
- Trace parsing/validation rules (`traceparent`, `trace-id`)
- Request context representation (tenant/correlation/trace)
## Definitions
### Request Context
Fields that must be consistently propagated:
- `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`)
- `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`)
- `traceparent` (HTTP: `traceparent`, NATS: `traceparent`)
- `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`)
- `request_id` (HTTP: `x-request-id`, optional for NATS)
### Standard Health Endpoints (per service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if applicable)
- `GET /metrics` Prometheus
## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
### Goal
Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
### Exit Criteria
- A single shared contract exists for header names and trace parsing.
- Gateway injects context into all upstream calls (including rebalance/health probes).
- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
- Unit tests prove propagation behavior for each transport.
- `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass.
### Tasks
- [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible.
- [ ] Add `shared` helpers for:
- HTTP extract/inject
- gRPC metadata extract/inject
- NATS header extract/inject
- [ ] Gateway: ensure context is injected into:
- gRPC upstream requests to Aggregate
- HTTP upstream requests to Projection
- Runner admin proxy requests
- Any “probe” calls (rebalance gates, fleet snapshots, health checks)
- [ ] Projection/Runner/Aggregate: ensure NATS published messages include:
- `tenant-id`
- `x-correlation-id` + `correlation-id`
- `traceparent`
- `trace-id` (derived when possible)
- [ ] Add transport-level tests:
- [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
- [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved
- [ ] NATS publish path: produced headers contain expected keys/values
### Required Tests
- Unit tests for shared parsing/derivation utilities
- Existing per-crate test suites
- At least one per-service “transport contract” test verifying headers are present and correct
## Milestone 1: Internal RPC Standardization (Projection via gRPC)
### Goal
Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- A Projection gRPC service exists for query execution.
- Gateway routes queries to Projection via gRPC by default.
- Authorization semantics remain enforced in Gateway (deny-by-default).
- Response shapes are stable and match the existing UI expectations.
- All tests pass, including new gRPC query integration tests.
### Tasks
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
- [ ] Request includes tenant + view + query payload and metadata
- [ ] Response includes result payload and standard context propagation
- [ ] Implement Projection gRPC server:
- [ ] Parse tenant/view/query
- [ ] Execute query against current projection storage/query engine
- [ ] Enforce tenant scope
- [ ] Implement Gateway gRPC client path for queries:
- [ ] Routing by tenant to Projection endpoint
- [ ] Deadlines, bounded retries (idempotent only)
- [ ] Context propagation (tenant/correlation/trace)
- [ ] Keep HTTP `/v1/query/*`:
- [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint
- [ ] Add tests:
- [ ] Gateway query authz + forwarding via gRPC
- [ ] Projection gRPC query contract tests for tenant isolation
### Required Tests
- New gRPC QueryService tests (unit + integration)
- Existing query/authz tests in Gateway
- Workspace fmt/clippy/test
## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
### Goal
Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations.
### Exit Criteria
- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
- Gateway uses gRPC to call Runner admin APIs.
- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
- Admin operations are idempotent where appropriate and include audit hooks where required.
- All tests pass and include negative/tenant-spoof cases.
### Tasks
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [ ] Drain/resume/status/reload/tenant-scoped controls
- [ ] Standard error mapping
- [ ] Implement Runner gRPC admin server:
- [ ] Tenant gating enforced for tenant-scoped operations
- [ ] Readiness/drain semantics aligned with platform contracts
- [ ] Implement Gateway gRPC client integration:
- [ ] Route to Runner endpoint via routing table
- [ ] Enforce authz rights (e.g. `runner.admin`)
- [ ] Context propagation
- [ ] Keep HTTP `/admin/*` in Runner optional:
- [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network
- [ ] Tests:
- [ ] Gateway: admin calls rejected without rights
- [ ] Gateway: tenant spoof attempts rejected
- [ ] Runner: idempotency and drain semantics validated
### Required Tests
- gRPC RunnerAdmin unit/integration tests
- Gateway proxy-to-gRPC tests
- Workspace fmt/clippy/test
## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
### Goal
Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
### Exit Criteria
- Gateway maintains bounded upstream connection pools for gRPC endpoints.
- All gRPC calls have deadlines; retries are only for idempotent operations.
- All probe/fanout calls are bounded and do not cause thundering herds.
- Load/soak tests show stable behavior under partial failure.
### Tasks
- [ ] Implement a Gateway upstream channel pool:
- [ ] LRU bounded by max endpoints
- [ ] TTL/eviction strategy
- [ ] Fast path reuse under load
- [ ] Standardize retry profiles:
- [ ] Read-only: short retry with jitter
- [ ] Mutations: no automatic retry unless idempotency key present
- [ ] Standardize timeouts:
- [ ] Edge timeout limits
- [ ] Internal per-service deadlines
- [ ] Fanout controls:
- [ ] Concurrency limiters for fleet snapshot/probes
- [ ] Cache results where safe (short TTL)
### Required Tests
- Unit tests for pool eviction/TTL
- Gateway integration tests for deadline propagation
- Gated load tests (document env + how to run)
## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
### Goal
Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
- All operational playbooks rely on standardized endpoints.
### Tasks
- [ ] Remove Gateways HTTP query proxy usage (or keep only as compatibility shim).
- [ ] Remove Gateways runner admin HTTP proxy usage (or keep only as compatibility shim).
- [ ] Ensure Control UI + Control API use the standardized Gateway surfaces.
- [ ] Harden metrics and health probes to always carry context.
### Required Tests
- End-to-end smoke tests (gated)
- Workspace fmt/clippy/test
## Verification Commands (Required at Each Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
## Notes / Constraints
- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.