16 KiB
Development Plan: Control Plane (Admin UI + Observability + Production Ops)
Overview
This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes:
- Tasks with clear deliverables
- Test Requirements (unit tests + tautological tests + integration tests where applicable)
- Dependencies on previous milestones
Development Approach:
- Complete one milestone at a time
- Write tests before implementation (TDD where applicable)
- All tests must pass before moving to the next milestone
- Mark tasks complete with
[x]as you progress
This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: gateway/DEVELOPMENT_PLAN.md, runner/DEVELOPMENT_PLAN.md).
Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)
Goal: Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently.
Tasks
- 0.1 Define canonical local commands for the repo
- UI:
npm run lintnpm run typechecknpm run testnpm run build
- Control Plane API:
cargo testcargo fmt --checkcargo clippy -- -D warningscargo run -- --help
- Docker/Swarm:
docker compose configvalidation for local stacks (if used)docker stack deploy ...smoke validation for Swarm (gated, see Tests)
- UI:
- 0.2 Add a minimal CI workflow that runs the same commands as 0.1
- 0.3 Define integration-test gating conventions
- Docker/Swarm integration tests:
- Mark as ignored by default and run only when
CONTROL_TEST_DOCKER=1is set - Example:
CONTROL_TEST_DOCKER=1 cargo test -- --ignored
- Mark as ignored by default and run only when
- NATS-dependent integration tests:
- Mark as ignored by default and run only when
CONTROL_TEST_NATS_URLis set - Example:
CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored
- Mark as ignored by default and run only when
- Docker/Swarm integration tests:
- 0.4 Define baseline operational invariants (checklist for later milestones)
- No privileged action without RBAC + audit event
- No multi-step operation without idempotency key + job record
- Always propagate
tenant_id(when applicable) end-to-end - Always propagate request/flow identifiers end-to-end (logs + downstream calls):
x-request-id(per HTTP request)x-correlation-id(per user-visible flow/job; generated by the Gateway when missing)traceparent(W3C trace context; started by the Gateway when missing)
- Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds)
- No tenant-level metrics without bounded cardinality rules
Tests
- T0.1 Tautological test: test harness runs for both subprojects (UI + API)
- T0.2 Lint + typecheck + unit tests pass
- T0.3 Docker config validation passes (compose/stack linting tests)
Milestone 1: Admin UI Foundation (UltraBase UX Reuse)
Goal: Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure.
Dependencies
- Milestone 0 (repo bootstrap)
Exit Criteria
- Admin UI builds successfully and passes unit/type checks
- UI navigation skeleton matches the PRD information architecture
Tasks
- 1.1 Initialize Admin UI project (Vite + React + TypeScript)
- Choose and wire lint/typecheck/test/build tooling to match the canonical commands in 0.1
- Adopt the baseline dependencies used by UltraBase control-plane admin UI where available
- Establish UI module layout for: components, pages, routes, API client, auth/session utilities
- 1.2 Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly)
- Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs
- 1.3 Implement navigation skeleton and empty pages (route wiring only)
- Overview
- Tenants
- Users
- Sessions
- Roles & Permissions
- Config
- Definitions
- Scale & Placement
- Deployments
- Observability
- Audit Log
- Settings
- 1.3a Add correlation-first investigation affordances in the UI skeleton
- Global search box that accepts
x-request-id,x-correlation-id, ortrace_id - “Investigate” links that open Grafana Explore prefilled for:
- Loki query scoped to
x-correlation-id(andx-request-idwhen available) - Tempo trace view when a
trace_idis present
- Loki query scoped to
- Ensure jobs and audit log rows display and copy the relevant ids
- Global search box that accepts
- 1.4 Implement API client stub with consistent error handling and request-id propagation
- Send
x-request-idon every request (generate one when missing) - Send
x-correlation-idwhen continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses - Send
traceparentwhen continuing an existing trace; otherwise omit and use the Gateway-started trace - Echo
x-request-idandx-correlation-idon responses and surface them in error UX - Persist the most recent ids in the UI so operators can copy/paste them into support tickets
- Send
Tests
- T1.1 UI typecheck passes
- T1.2 UI build passes
- T1.3 Routing smoke test: each route renders without runtime errors (headless DOM test)
Milestone 2: Control Plane API Foundation (BFF / Admin API)
Goal: Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics.
Dependencies
- Milestone 0 (repo bootstrap)
Exit Criteria
- Control plane API runs as a container and exposes
/health,/ready,/metrics - Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints
Tasks
- 2.1 Initialize Control Plane API service
- Rust (Axum + Tokio + tracing) to align with node ecosystem
- Baseline endpoints:
GET /health,GET /ready,GET /metrics
- 2.2 Add request logging and correlation identifiers
x-request-idpropagation and structured logs (match Gateway conventions)- Propagate
x-correlation-idandtraceparenton outbound calls - Log fields:
request_id,correlation_id,trace_id,principal_id,tenant_id(when applicable) - Never log Authorization headers or tokens
- 2.3 Implement authentication and authorization boundary
- Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens)
- Extract principal identity from token claims (at minimum:
sub,session_id) - Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state)
- Align
x-tenant-idsemantics with Gateway:- Tenant-scoped endpoints require
x-tenant-idand must reject missing/invalid values with 400 - Platform-scoped endpoints must not depend on
x-tenant-id
- Tenant-scoped endpoints require
- Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state:
- Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes
- 2.4 Define “job” model for multi-step operations (API contract)
POST /admin/v1/jobs/*returnsjob_idGET /admin/v1/jobs/{job_id}returns status + structured steps + errors- Require an idempotency key for job creation (
Idempotency-Keyheader), and make repeated creates safe
Tests
- T2.1
GET /healthandGET /readyreturn 200 - T2.2 Unauthorized admin calls return 401/403 consistently
- T2.3
x-tenant-idbehavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes) - T2.4 Tautological tests: core state types are Send + Sync
Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)
Goal: Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics.
Dependencies
- Milestone 0 (repo bootstrap)
Exit Criteria
- Grafana starts with provisioned datasources and dashboards
- vmagent scrapes platform services and VictoriaMetrics can query ingested series
- Loki is available for log queries (when logs are enabled)
Tasks
- 3.1 Add observability deployment assets modeled after UltraBase
- Grafana provisioning for datasources and dashboards
- vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable)
- Loki configuration (and optional promtail)
- 3.1a Add distributed tracing backend and wiring
- Tempo (or compatible tracing backend) as a Grafana datasource
- OTLP receiver path (collector/agent) so platform services can emit traces
- Grafana Explore is provisioned so operators can jump from logs to traces
- Require the Gateway to accept and propagate
x-correlation-idandtraceparentto upstreams, and to includecorrelation_idandtrace_idin request spans/log fields
- 3.2 Implement the base dashboard set from the PRD
- Operations overview
- HTTP detail (Gateway route-level)
- Logs (Loki)
- Traces (Tempo)
- Event bus / JetStream
- Workers (Runner)
- Storage (libmdbx + node disk)
- Cluster / Orchestrator
- 3.3 Add the chosen production-operability dashboards and document required instrumentation
- Noisy Neighbor & Tenant Health
- API Regression & Deployment
- Storage & Event Bus Bottlenecks
- Infrastructure Exhaustion
- Standardize build/version labeling across services for correlation (
*_build_info{service,version,git_sha}=1)
Tests
- T3.1 Grafana provisioning files are syntactically valid
- T3.2 vmagent config parses and includes all required scrape jobs
- T3.3 Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated)
- T3.4 Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state
Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)
Goal: Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics.
Dependencies
- Milestone 1 (Admin UI foundation)
- Milestone 2 (Control Plane API foundation)
Exit Criteria
- Admin UI can list tenants and show current placement per service kind
- Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback
Tasks
- 4.1 Implement placement read APIs
- Read effective placement from NATS KV (and fallback file for development)
- Match the Gateway routing config model (placement maps + shard directories + revision semantics)
- Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (
aggregate_placement,projection_placement,runner_placement)
- 4.2 Implement fleet “health snapshot” APIs
- Query
/health,/ready,/metricsfrom each service endpoint - Normalize into a stable UI response shape
- Query
- 4.3 Implement Admin UI pages:
- Scale & Placement (read-only)
- Tenants (read-only with placement summary)
- Fleet/Topology views (read-only)
Tests
- T4.1 Placement config parsing and snapshot endpoints work
- T4.2 KV watcher hot-reload swaps placement atomically
- T4.3 UI pages render with mocked API responses (component-level tests)
Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs
Goal: Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload.
Dependencies
- Milestone 4 (read-only ops)
Exit Criteria
- All operational mutations are executed as jobs with audit events
- Every mutation supports preflight planning and clear post-conditions
Tasks
- 5.1 Implement job orchestration primitives in the API
- step model, retries, cancellation, timeouts
- per-tenant locking to avoid concurrent conflicting operations
- 5.2 Implement drain workflow (per service kind where supported)
- Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge)
- Aggregate/projection drain semantics via admin endpoints where available
- Align drain/readiness semantics with the rebalancing contract in external_prd.md
- 5.3 Implement migration workflow
- Plan: drain tenant → update placement → reload routing/config
- Block unsafe migrations (health/lag/inflight thresholds)
- 5.4 Implement UI mutation flows
- modal confirmation + reason required
- job progress view and audit linkage
Tests
- T5.1 Job idempotency: repeated calls with same idempotency key do not duplicate effects
- T5.2 Migration plan preflight produces a deterministic action plan
- T5.3 Safety gates prevent drain/migrate when invariants fail
Milestone 6: Deployments + Regression Tooling (Swarm-Aware)
Goal: Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation.
Dependencies
- Milestone 3 (observability baseline)
- Milestone 5 (job orchestration)
Exit Criteria
- Deployments can be initiated (or at least observed) via the control plane
- Grafana shows deploy markers; dashboards can compare old vs new versions
Tasks
- 6.1 Implement Swarm integration (read-only first, then mutations)
- list services, tasks, images, versions
- watch update events (start/finish/fail)
- 6.2 Implement deployment annotations/events
- write Grafana annotations (or emit a deploy event metric) for vertical markers
- 6.3 Implement “API Regression & Deployment” dashboard wiring prerequisites
- enforce build/version labeling (
*_build_info{service,version,git_sha}=1pattern) - ensure scrape relabeling includes
image_tagwhere possible
- enforce build/version labeling (
- 6.4 UI pages
- Deployments list + detail
- Per-service “what changed” and “rollback” actions (guarded)
Tests
- T6.1 Swarm client abstraction can be mocked and produces deterministic results
- T6.2 Annotation writer produces expected Grafana payloads
- T6.3 Version labels are present on all services in a metrics snapshot test
Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)
Goal: Provide a complete Swarm deployment definition for the platform: services in ../ plus the control plane components and the observability stack.
Dependencies
- Milestone 1 (Admin UI foundation)
- Milestone 2 (Control Plane API foundation)
- Milestone 3 (Observability baseline)
- Milestone 5 (safe mutations baseline)
Exit Criteria
docker stack deploybrings up:- Gateway + Aggregate + Projection + Runner (from
../) - Control Plane API + Admin UI
- VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail)
- Gateway + Aggregate + Projection + Runner (from
- All services are reachable via overlay networks and pass health checks
- Smoke and integration tests pass end-to-end (gated, but required before milestone completion)
Tasks
- 7.1 Define Swarm networks, secrets, and configs
- overlay network segmentation (public vs internal)
- secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning)
- 7.2 Define Swarm stack files
- base platform stack (gateway/aggregate/projection/runner)
- control plane stack (api + ui)
- observability stack (vm/vmagent/grafana/loki/promtail)
- 7.3 Define placement constraints and scaling defaults
- node labels for tenant ranges and infrastructure roles
- replica defaults and update policies
- 7.4 Define deployment verification and rollback playbooks (as executable checks)
- post-deploy checks:
/health,/ready,/metrics, dashboard provisioning - rollbacks: service update rollback hooks and job safety checks
- post-deploy checks:
Tests
- T7.1 Stack YAML parses and validates (unit test)
- T7.2 Swarm smoke test (requires
CONTROL_TEST_DOCKER=1)- deploy stacks
- wait for healthy state
- verify Grafana dashboards provisioned and VictoriaMetrics receives samples
- T7.3 End-to-end “control plane can see the fleet” test (requires docker)
- UI/API can query placement + health snapshots for all services