# Development Plan: Control Plane (Admin UI + Observability + Production Ops) ## Overview This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes: - **Tasks** with clear deliverables - **Test Requirements** (unit tests + tautological tests + integration tests where applicable) - **Dependencies** on previous milestones **Development Approach:** 1. Complete one milestone at a time 2. Write tests before implementation (TDD where applicable) 3. All tests must pass before moving to the next milestone 4. Mark tasks complete with `[x]` as you progress This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: [gateway/DEVELOPMENT_PLAN.md](file:///Users/vlad/Developer/cloudlysis/gateway/DEVELOPMENT_PLAN.md), [runner/DEVELOPMENT_PLAN.md](file:///Users/vlad/Developer/cloudlysis/runner/DEVELOPMENT_PLAN.md)). --- ## Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails) **Goal:** Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently. ### Tasks - [x] **0.1** Define canonical local commands for the repo - UI: - `npm run lint` - `npm run typecheck` - `npm run test` - `npm run build` - Control Plane API: - `cargo test` - `cargo fmt --check` - `cargo clippy -- -D warnings` - `cargo run -- --help` - Docker/Swarm: - `docker compose config` validation for local stacks (if used) - `docker stack deploy ...` smoke validation for Swarm (gated, see Tests) - [x] **0.2** Add a minimal CI workflow that runs the same commands as **0.1** - [x] **0.3** Define integration-test gating conventions - Docker/Swarm integration tests: - Mark as ignored by default and run only when `CONTROL_TEST_DOCKER=1` is set - Example: `CONTROL_TEST_DOCKER=1 cargo test -- --ignored` - NATS-dependent integration tests: - Mark as ignored by default and run only when `CONTROL_TEST_NATS_URL` is set - Example: `CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored` - [x] **0.4** Define baseline operational invariants (checklist for later milestones) - No privileged action without RBAC + audit event - No multi-step operation without idempotency key + job record - Always propagate `tenant_id` (when applicable) end-to-end - Always propagate request/flow identifiers end-to-end (logs + downstream calls): - `x-request-id` (per HTTP request) - `x-correlation-id` (per user-visible flow/job; generated by the Gateway when missing) - `traceparent` (W3C trace context; started by the Gateway when missing) - Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds) - No tenant-level metrics without bounded cardinality rules ### Tests - [x] **T0.1** Tautological test: test harness runs for both subprojects (UI + API) - [x] **T0.2** Lint + typecheck + unit tests pass - [x] **T0.3** Docker config validation passes (compose/stack linting tests) --- ## Milestone 1: Admin UI Foundation (UltraBase UX Reuse) **Goal:** Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure. ### Dependencies - Milestone 0 (repo bootstrap) ### Exit Criteria - Admin UI builds successfully and passes unit/type checks - UI navigation skeleton matches the PRD information architecture ### Tasks - [x] **1.1** Initialize Admin UI project (Vite + React + TypeScript) - Choose and wire lint/typecheck/test/build tooling to match the canonical commands in **0.1** - Adopt the baseline dependencies used by UltraBase control-plane admin UI where available - Establish UI module layout for: components, pages, routes, API client, auth/session utilities - [x] **1.2** Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly) - Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs - [x] **1.3** Implement navigation skeleton and empty pages (route wiring only) - Overview - Tenants - Users - Sessions - Roles & Permissions - Config - Definitions - Scale & Placement - Deployments - Observability - Audit Log - Settings - [x] **1.3a** Add correlation-first investigation affordances in the UI skeleton - Global search box that accepts `x-request-id`, `x-correlation-id`, or `trace_id` - “Investigate” links that open Grafana Explore prefilled for: - Loki query scoped to `x-correlation-id` (and `x-request-id` when available) - Tempo trace view when a `trace_id` is present - Ensure jobs and audit log rows display and copy the relevant ids - [x] **1.4** Implement API client stub with consistent error handling and request-id propagation - Send `x-request-id` on every request (generate one when missing) - Send `x-correlation-id` when continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses - Send `traceparent` when continuing an existing trace; otherwise omit and use the Gateway-started trace - Echo `x-request-id` and `x-correlation-id` on responses and surface them in error UX - Persist the most recent ids in the UI so operators can copy/paste them into support tickets ### Tests - [x] **T1.1** UI typecheck passes - [x] **T1.2** UI build passes - [x] **T1.3** Routing smoke test: each route renders without runtime errors (headless DOM test) --- ## Milestone 2: Control Plane API Foundation (BFF / Admin API) **Goal:** Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics. ### Dependencies - Milestone 0 (repo bootstrap) ### Exit Criteria - Control plane API runs as a container and exposes `/health`, `/ready`, `/metrics` - Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints ### Tasks - [x] **2.1** Initialize Control Plane API service - Rust (Axum + Tokio + tracing) to align with node ecosystem - Baseline endpoints: `GET /health`, `GET /ready`, `GET /metrics` - [x] **2.2** Add request logging and correlation identifiers - `x-request-id` propagation and structured logs (match Gateway conventions) - Propagate `x-correlation-id` and `traceparent` on outbound calls - Log fields: `request_id`, `correlation_id`, `trace_id`, `principal_id`, `tenant_id` (when applicable) - Never log Authorization headers or tokens - [x] **2.3** Implement authentication and authorization boundary - Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens) - Extract principal identity from token claims (at minimum: `sub`, `session_id`) - Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state) - Align `x-tenant-id` semantics with Gateway: - Tenant-scoped endpoints require `x-tenant-id` and must reject missing/invalid values with 400 - Platform-scoped endpoints must not depend on `x-tenant-id` - Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state: - Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes - [x] **2.4** Define “job” model for multi-step operations (API contract) - `POST /admin/v1/jobs/*` returns `job_id` - `GET /admin/v1/jobs/{job_id}` returns status + structured steps + errors - Require an idempotency key for job creation (`Idempotency-Key` header), and make repeated creates safe ### Tests - [x] **T2.1** `GET /health` and `GET /ready` return 200 - [x] **T2.2** Unauthorized admin calls return 401/403 consistently - [x] **T2.3** `x-tenant-id` behavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes) - [x] **T2.4** Tautological tests: core state types are Send + Sync --- ## Milestone 3: Observability Stack Baseline (VM + Loki + Grafana) **Goal:** Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics. ### Dependencies - Milestone 0 (repo bootstrap) ### Exit Criteria - Grafana starts with provisioned datasources and dashboards - vmagent scrapes platform services and VictoriaMetrics can query ingested series - Loki is available for log queries (when logs are enabled) ### Tasks - [x] **3.1** Add observability deployment assets modeled after UltraBase - Grafana provisioning for datasources and dashboards - vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable) - Loki configuration (and optional promtail) - [x] **3.1a** Add distributed tracing backend and wiring - Tempo (or compatible tracing backend) as a Grafana datasource - OTLP receiver path (collector/agent) so platform services can emit traces - Grafana Explore is provisioned so operators can jump from logs to traces - Require the Gateway to accept and propagate `x-correlation-id` and `traceparent` to upstreams, and to include `correlation_id` and `trace_id` in request spans/log fields - [x] **3.2** Implement the base dashboard set from the PRD - Operations overview - HTTP detail (Gateway route-level) - Logs (Loki) - Traces (Tempo) - Event bus / JetStream - Workers (Runner) - Storage (libmdbx + node disk) - Cluster / Orchestrator - [x] **3.3** Add the chosen production-operability dashboards and document required instrumentation - Noisy Neighbor & Tenant Health - API Regression & Deployment - Storage & Event Bus Bottlenecks - Infrastructure Exhaustion - Standardize build/version labeling across services for correlation (`*_build_info{service,version,git_sha}=1`) ### Tests - [x] **T3.1** Grafana provisioning files are syntactically valid - [x] **T3.2** vmagent config parses and includes all required scrape jobs - [x] **T3.3** Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated) - [x] **T3.4** Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state --- ## Milestone 4: Tenant + Placement Visibility (Read-Only Ops First) **Goal:** Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics. ### Dependencies - Milestone 1 (Admin UI foundation) - Milestone 2 (Control Plane API foundation) ### Exit Criteria - Admin UI can list tenants and show current placement per service kind - Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback ### Tasks - [x] **4.1** Implement placement read APIs - Read effective placement from NATS KV (and fallback file for development) - Match the Gateway routing config model (placement maps + shard directories + revision semantics) - Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (`aggregate_placement`, `projection_placement`, `runner_placement`) - [x] **4.2** Implement fleet “health snapshot” APIs - Query `/health`, `/ready`, `/metrics` from each service endpoint - Normalize into a stable UI response shape - [x] **4.3** Implement Admin UI pages: - Scale & Placement (read-only) - Tenants (read-only with placement summary) - Fleet/Topology views (read-only) ### Tests - [x] **T4.1** Placement config parsing and snapshot endpoints work - [x] **T4.2** KV watcher hot-reload swaps placement atomically - [x] **T4.3** UI pages render with mocked API responses (component-level tests) --- ## Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs **Goal:** Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload. ### Dependencies - Milestone 4 (read-only ops) ### Exit Criteria - All operational mutations are executed as jobs with audit events - Every mutation supports preflight planning and clear post-conditions ### Tasks - [x] **5.1** Implement job orchestration primitives in the API - step model, retries, cancellation, timeouts - per-tenant locking to avoid concurrent conflicting operations - [x] **5.2** Implement drain workflow (per service kind where supported) - Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge) - Aggregate/projection drain semantics via admin endpoints where available - Align drain/readiness semantics with the rebalancing contract in [external_prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/external_prd.md) - [x] **5.3** Implement migration workflow - Plan: drain tenant → update placement → reload routing/config - Block unsafe migrations (health/lag/inflight thresholds) - [x] **5.4** Implement UI mutation flows - modal confirmation + reason required - job progress view and audit linkage ### Tests - [x] **T5.1** Job idempotency: repeated calls with same idempotency key do not duplicate effects - [x] **T5.2** Migration plan preflight produces a deterministic action plan - [x] **T5.3** Safety gates prevent drain/migrate when invariants fail --- ## Milestone 6: Deployments + Regression Tooling (Swarm-Aware) **Goal:** Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation. ### Dependencies - Milestone 3 (observability baseline) - Milestone 5 (job orchestration) ### Exit Criteria - Deployments can be initiated (or at least observed) via the control plane - Grafana shows deploy markers; dashboards can compare old vs new versions ### Tasks - [x] **6.1** Implement Swarm integration (read-only first, then mutations) - list services, tasks, images, versions - watch update events (start/finish/fail) - [x] **6.2** Implement deployment annotations/events - write Grafana annotations (or emit a deploy event metric) for vertical markers - [x] **6.3** Implement “API Regression & Deployment” dashboard wiring prerequisites - enforce build/version labeling (`*_build_info{service,version,git_sha}=1` pattern) - ensure scrape relabeling includes `image_tag` where possible - [x] **6.4** UI pages - Deployments list + detail - Per-service “what changed” and “rollback” actions (guarded) ### Tests - [x] **T6.1** Swarm client abstraction can be mocked and produces deterministic results - [x] **T6.2** Annotation writer produces expected Grafana payloads - [x] **T6.3** Version labels are present on all services in a metrics snapshot test --- ## Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane) **Goal:** Provide a complete Swarm deployment definition for the platform: services in `../` plus the control plane components and the observability stack. ### Dependencies - Milestone 1 (Admin UI foundation) - Milestone 2 (Control Plane API foundation) - Milestone 3 (Observability baseline) - Milestone 5 (safe mutations baseline) ### Exit Criteria - `docker stack deploy` brings up: - Gateway + Aggregate + Projection + Runner (from `../`) - Control Plane API + Admin UI - VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail) - All services are reachable via overlay networks and pass health checks - Smoke and integration tests pass end-to-end (gated, but required before milestone completion) ### Tasks - [x] **7.1** Define Swarm networks, secrets, and configs - overlay network segmentation (public vs internal) - secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning) - [x] **7.2** Define Swarm stack files - base platform stack (gateway/aggregate/projection/runner) - control plane stack (api + ui) - observability stack (vm/vmagent/grafana/loki/promtail) - [x] **7.3** Define placement constraints and scaling defaults - node labels for tenant ranges and infrastructure roles - replica defaults and update policies - [x] **7.4** Define deployment verification and rollback playbooks (as executable checks) - post-deploy checks: `/health`, `/ready`, `/metrics`, dashboard provisioning - rollbacks: service update rollback hooks and job safety checks ### Tests - [x] **T7.1** Stack YAML parses and validates (unit test) - [x] **T7.2** Swarm smoke test (requires `CONTROL_TEST_DOCKER=1`) - deploy stacks - wait for healthy state - verify Grafana dashboards provisioned and VictoriaMetrics receives samples - [x] **T7.3** End-to-end “control plane can see the fleet” test (requires docker) - UI/API can query placement + health snapshots for all services --- ## Milestone 8: Config Registry + Safe Change Management (Plan/Apply/Rollback) **Goal:** Make configuration first-class, versioned, validated, and safely mutable from the control plane, while keeping production and development sources consistent. ### Dependencies - Milestone 2 (Control Plane API foundation) - Milestone 5 (safe mutations baseline) - Milestone 7 (Swarm deployment baseline) ### Exit Criteria - Operators can list, view, validate, and safely apply config changes with audit + idempotent jobs - Config changes have revision semantics and are roll-backable - Gatekeeper safety checks prevent applying invalid or unsafe configs ### Tasks - [x] **8.1** Inventory and classify configuration surfaces (platform-wide) - classify as: static boot config (env/secrets), dynamic runtime config (KV), large immutable artifacts (S3/docs) - map current sources per domain: - Gateway routing config (`config/routing/dev.json` / production KV) - Placement config (`config/placement/dev.json` / production KV) - Runner definitions (effects/sagas) (documents/S3) and activation config (KV) - Observability provisioning (Swarm configs + repo-managed assets) - Control plane feature flags (KV) - [~] **8.2** Define a Config Registry contract in the Control API - **Implemented (initial)**: - config identity: `{domain}` (routing|placement) - metadata: `revision` (KV revision when using NATS), and `source` info (file vs nats) - storage policy per config: `source=dev_file | nats_kv` - **Still needed**: - `{domain, name, scope}` and richer metadata (`updated_at`, `updated_by`, `sha256`) - history API for KV-backed configs - [x] **8.3** Implement config storage abstraction (dev + prod) - dev: file-backed, atomic write (tmp + rename), hot-reload where applicable - prod: NATS KV for dynamic configs (revisioned values + watch streams) - consistent error model: decode/validate/source errors are distinguishable and safe - [x] **8.4** Add read-only config APIs - `GET /admin/v1/config` list domains - `GET /admin/v1/config/{domain}` fetch current value + revision + source - (history not implemented yet) - [~] **8.5** Add validate/plan/apply/rollback mutation workflows as jobs - **Implemented**: - `POST /admin/v1/jobs/config/validate` (job, idempotency key required) - `POST /admin/v1/jobs/config/apply` (job, idempotency key required, backup + apply) - `POST /admin/v1/jobs/config/rollback` (job, idempotency key required, restore last backup) - per-domain locking to avoid concurrent config mutations - **Still needed**: - `POST /admin/v1/plan/config/apply` deterministic plan (diff + impacted services) - richer post-conditions (routing resolution sampling, fleet consistency checks, etc.) - [~] **8.6** Implement initial config domains end-to-end - **Gateway routing config**: - implemented: schema validation via JSON decode - still needed: semantic validation (tenant entries/shard directories/endpoints URL parsing) + sampled routing verification - **Placement config**: - implemented: schema validation via JSON decode - still needed: semantic validation (targets non-empty, etc.) + fleet snapshot consistency checks - [x] **8.7** Implement Admin UI “Config” page for safe operations - list + view configs with revision/sha/audit linkage - editor for JSON (and YAML when supported by the domain) - validate button (server-side) and apply/rollback flows as jobs with reason required ### Tests - [x] **T8.1** Unit tests: config decode/encode stability for each config domain - routing/placement decode is enforced by server-side validate job (schema-level) - [ ] **T8.2** Unit tests: validation rejects unsafe configs with stable error codes/messages - [ ] **T8.3** Unit tests: plan generation is deterministic for same inputs - [x] **T8.4** Integration tests (env-gated): - NATS KV config apply + rollback via Control API (requires `CONTROL_TEST_NATS=1` + `CONTROL_TEST_NATS_URL`) - (Gateway route-resolution E2E verification still pending) - [x] **T8.5** UI tests: config page renders, validate/apply/rollback flows navigate to job progress --- ## Milestone 9: Control Node Management (Inventory, Drift, and Safer Ops) **Goal:** Improve how the control plane understands and manages the live control node and platform state: node inventory, config drift detection, and safer operational guardrails. ### Dependencies - Milestone 7 (Swarm deployment baseline) - Milestone 8 (config registry + safe change management) ### Exit Criteria - Control plane provides a reliable “what is running vs what should be running” view - Config drift is detectable and actionable - Core operational actions are guarded by preflight checks and produce audit trails ### Tasks - [x] **9.1** Define a “desired vs observed” model for platform state - desired: Swarm stacks + config registry revisions - observed: live service/task state + effective runtime configs - drift categories: missing, extra, version mismatch, config mismatch, unhealthy - [~] **9.2** Improve Swarm observation fidelity - implemented (initial): docker-cli-backed Swarm observation (`CONTROL_SWARM_MODE=docker`) - still needed: direct Docker API client (avoid shelling out), richer normalization, and wiring into production stacks - keep file source as a dev fallback for deterministic tests - normalize service identity: `{service, image_tag, git_sha, updated_at}` - [x] **9.3** Add drift APIs and UI views - `GET /admin/v1/platform/drift` returns drift summary + actionable items - UI: “Platform Drift” page with filters and links to remediate jobs - [ ] **9.4** Add safer operational guardrails as reusable checks - preflight checks for: - service unhealthy / crashloop - tenant migration safety thresholds (lag/inflight) - config apply safety (impact radius, sampled verify) - consistent failure modes: clear reason + audit entry, no partial side effects - [ ] **9.5** Add operational playbooks as executable checks - post-deploy verification suite callable as an idempotent job - rollback verification suite callable as an idempotent job ### Tests - [x] **T9.1** Unit tests: drift classification for synthetic desired/observed fixtures - [x] **T9.2** Integration tests (docker-gated): drift view detects intentional mismatches in a local Swarm - requires `CONTROL_TEST_DOCKER=1` and an active local Swarm node - [x] **T9.3** UI tests: drift page renders in route smoke test