Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s

This commit is contained in:
2026-03-30 11:40:42 +03:00
parent 7e7041cf8b
commit 1298d9a3df
246 changed files with 55434 additions and 0 deletions

216
GATEWAY_TRANSPORT_PLAN.md Normal file
View File

@@ -0,0 +1,216 @@
# Gateway Transport Plan
## Purpose
Standardize and optimize how the Gateway communicates with Aggregate, Projection, and Runner, and how nodes communicate via NATS JetStream, under these principles:
- Simplicity (few patterns, minimal bespoke conventions)
- Ease of operation (consistent health/ready/metrics, consistent failure modes)
- Frugality (bounded connections, bounded fanout, low overhead)
- High performance (low tail latency, backpressure-aware, predictable routing)
- Safety (tenant isolation, deny-by-default authz, consistent context propagation)
## Non-Negotiable Rules (Global)
- Every cross-service request MUST carry tenant + trace context.
- Every transport path MUST have explicit timeouts/deadlines and bounded retries.
- Every milestone below is “stop-the-line” gated:
- All tasks completed
- All tests passing
- Workspace lint/format/type checks passing
- Required integration tests for the milestone passing (when gated by env, they must be runnable and documented)
## Current State (Baseline)
- Gateway → Aggregate: gRPC command submission
- Gateway → Projection: HTTP query proxy (`/v1/query/*`)
- Gateway → Runner: HTTP proxy for admin endpoints (`/admin/runner/*`)
- Nodes ↔ NATS JetStream: events/workflow streams with headers for tenant/correlation/trace (now more consistent)
## Target Architecture (End State)
- Edge contract (clients ↔ Gateway): HTTP/JSON (stable, debuggable, browser + ops friendly)
- Internal RPC (Gateway ↔ services): gRPC for Aggregate + Projection + Runner (single internal RPC stack)
- Async/event backbone: NATS JetStream remains for event/work distribution
- `shared` is the single source of truth for:
- Header names and propagation rules
- Trace parsing/validation rules (`traceparent`, `trace-id`)
- Request context representation (tenant/correlation/trace)
## Definitions
### Request Context
Fields that must be consistently propagated:
- `tenant_id` (HTTP: `x-tenant-id`, NATS: `tenant-id`)
- `correlation_id` (HTTP: `x-correlation-id`, NATS: `x-correlation-id` and `correlation-id`)
- `traceparent` (HTTP: `traceparent`, NATS: `traceparent`)
- `trace_id` (derived from `traceparent` or provided explicitly; NATS: `trace-id`)
- `request_id` (HTTP: `x-request-id`, optional for NATS)
### Standard Health Endpoints (per service)
- `GET /health` liveness
- `GET /ready` readiness (includes tenant gating if applicable)
- `GET /metrics` Prometheus
## Milestone 0: Transport Contract Lock-in (Context + Headers Everywhere)
### Goal
Make context propagation and header naming consistent and enforceable across HTTP, gRPC, and NATS, including “background” Gateway calls (health checks, rebalance probes).
### Exit Criteria
- A single shared contract exists for header names and trace parsing.
- Gateway injects context into all upstream calls (including rebalance/health probes).
- Aggregate/Projection/Runner consistently emit/consume the standard context on all transport paths they own.
- Unit tests prove propagation behavior for each transport.
- `cargo fmt --check`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo test --workspace` all pass.
### Tasks
- [ ] Standardize header constants in `shared` and remove string literals from Gateway and nodes where feasible.
- [ ] Add `shared` helpers for:
- HTTP extract/inject
- gRPC metadata extract/inject
- NATS header extract/inject
- [ ] Gateway: ensure context is injected into:
- gRPC upstream requests to Aggregate
- HTTP upstream requests to Projection
- Runner admin proxy requests
- Any “probe” calls (rebalance gates, fleet snapshots, health checks)
- [ ] Projection/Runner/Aggregate: ensure NATS published messages include:
- `tenant-id`
- `x-correlation-id` + `correlation-id`
- `traceparent`
- `trace-id` (derived when possible)
- [ ] Add transport-level tests:
- [ ] Gateway gRPC path: incoming context → upstream metadata → response metadata preserved
- [ ] Gateway HTTP proxy path: incoming context → upstream headers preserved
- [ ] NATS publish path: produced headers contain expected keys/values
### Required Tests
- Unit tests for shared parsing/derivation utilities
- Existing per-crate test suites
- At least one per-service “transport contract” test verifying headers are present and correct
## Milestone 1: Internal RPC Standardization (Projection via gRPC)
### Goal
Eliminate Gateway → Projection HTTP proxy as the default path by introducing an internal gRPC Query service, keeping HTTP optional for human/debug use.
### Exit Criteria
- A Projection gRPC service exists for query execution.
- Gateway routes queries to Projection via gRPC by default.
- Authorization semantics remain enforced in Gateway (deny-by-default).
- Response shapes are stable and match the existing UI expectations.
- All tests pass, including new gRPC query integration tests.
### Tasks
- [ ] Define protobuf API: `projection.gateway.v1.QueryService`
- [ ] Request includes tenant + view + query payload and metadata
- [ ] Response includes result payload and standard context propagation
- [ ] Implement Projection gRPC server:
- [ ] Parse tenant/view/query
- [ ] Execute query against current projection storage/query engine
- [ ] Enforce tenant scope
- [ ] Implement Gateway gRPC client path for queries:
- [ ] Routing by tenant to Projection endpoint
- [ ] Deadlines, bounded retries (idempotent only)
- [ ] Context propagation (tenant/correlation/trace)
- [ ] Keep HTTP `/v1/query/*`:
- [ ] Either route to internal gRPC implementation or keep as legacy/debug endpoint
- [ ] Add tests:
- [ ] Gateway query authz + forwarding via gRPC
- [ ] Projection gRPC query contract tests for tenant isolation
### Required Tests
- New gRPC QueryService tests (unit + integration)
- Existing query/authz tests in Gateway
- Workspace fmt/clippy/test
## Milestone 2: Internal RPC Standardization (Runner Admin via gRPC)
### Goal
Replace `/admin/runner/*` HTTP proxying with a first-class gRPC admin service for Runner operations.
### Exit Criteria
- Runner exposes a gRPC admin service for the admin surface required by Control/Gateway.
- Gateway uses gRPC to call Runner admin APIs.
- Authentication/authorization remains in Gateway; Runner trusts Gateway boundary.
- Admin operations are idempotent where appropriate and include audit hooks where required.
- All tests pass and include negative/tenant-spoof cases.
### Tasks
- [ ] Define protobuf API: `runner.admin.v1.RunnerAdmin`
- [ ] Drain/resume/status/reload/tenant-scoped controls
- [ ] Standard error mapping
- [ ] Implement Runner gRPC admin server:
- [ ] Tenant gating enforced for tenant-scoped operations
- [ ] Readiness/drain semantics aligned with platform contracts
- [ ] Implement Gateway gRPC client integration:
- [ ] Route to Runner endpoint via routing table
- [ ] Enforce authz rights (e.g. `runner.admin`)
- [ ] Context propagation
- [ ] Keep HTTP `/admin/*` in Runner optional:
- [ ] Either remove Gateway proxy usage or keep for direct debugging behind secure network
- [ ] Tests:
- [ ] Gateway: admin calls rejected without rights
- [ ] Gateway: tenant spoof attempts rejected
- [ ] Runner: idempotency and drain semantics validated
### Required Tests
- gRPC RunnerAdmin unit/integration tests
- Gateway proxy-to-gRPC tests
- Workspace fmt/clippy/test
## Milestone 3: Connection + Retry Policy Unification (Performance + Frugality)
### Goal
Make upstream connection management and retry behavior consistent and bounded across Gateway and nodes.
### Exit Criteria
- Gateway maintains bounded upstream connection pools for gRPC endpoints.
- All gRPC calls have deadlines; retries are only for idempotent operations.
- All probe/fanout calls are bounded and do not cause thundering herds.
- Load/soak tests show stable behavior under partial failure.
### Tasks
- [ ] Implement a Gateway upstream channel pool:
- [ ] LRU bounded by max endpoints
- [ ] TTL/eviction strategy
- [ ] Fast path reuse under load
- [ ] Standardize retry profiles:
- [ ] Read-only: short retry with jitter
- [ ] Mutations: no automatic retry unless idempotency key present
- [ ] Standardize timeouts:
- [ ] Edge timeout limits
- [ ] Internal per-service deadlines
- [ ] Fanout controls:
- [ ] Concurrency limiters for fleet snapshot/probes
- [ ] Cache results where safe (short TTL)
### Required Tests
- Unit tests for pool eviction/TTL
- Gateway integration tests for deadline propagation
- Gated load tests (document env + how to run)
## Milestone 4: Transport Simplification Cleanup (Remove Legacy Paths)
### Goal
Remove or de-prioritize legacy HTTP internal paths so the “happy path” uses: HTTP edge → Gateway → gRPC internal → NATS async.
### Exit Criteria
- Gateway no longer depends on HTTP for Projection queries or Runner admin.
- Legacy endpoints are either removed or explicitly marked “debug-only” and not used by Gateway/Control.
- All operational playbooks rely on standardized endpoints.
### Tasks
- [ ] Remove Gateways HTTP query proxy usage (or keep only as compatibility shim).
- [ ] Remove Gateways runner admin HTTP proxy usage (or keep only as compatibility shim).
- [ ] Ensure Control UI + Control API use the standardized Gateway surfaces.
- [ ] Harden metrics and health probes to always carry context.
### Required Tests
- End-to-end smoke tests (gated)
- Workspace fmt/clippy/test
## Verification Commands (Required at Each Milestone)
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `npm ci && npm run lint && npm run typecheck && npm run test && npm run build` (in `control/ui`)
## Notes / Constraints
- Do not break wire compatibility for NATS subjects or event payloads; evolve via optional fields and tolerant decoding.
- Keep tenant isolation rules enforced at the Gateway boundary and re-validated at nodes where it is safety-critical.