Files
cloudlysis/control/DEVELOPMENT_PLAN.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

16 KiB

Development Plan: Control Plane (Admin UI + Observability + Production Ops)

Overview

This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes:

  • Tasks with clear deliverables
  • Test Requirements (unit tests + tautological tests + integration tests where applicable)
  • Dependencies on previous milestones

Development Approach:

  1. Complete one milestone at a time
  2. Write tests before implementation (TDD where applicable)
  3. All tests must pass before moving to the next milestone
  4. Mark tasks complete with [x] as you progress

This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: gateway/DEVELOPMENT_PLAN.md, runner/DEVELOPMENT_PLAN.md).


Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)

Goal: Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently.

Tasks

  • 0.1 Define canonical local commands for the repo
    • UI:
      • npm run lint
      • npm run typecheck
      • npm run test
      • npm run build
    • Control Plane API:
      • cargo test
      • cargo fmt --check
      • cargo clippy -- -D warnings
      • cargo run -- --help
    • Docker/Swarm:
      • docker compose config validation for local stacks (if used)
      • docker stack deploy ... smoke validation for Swarm (gated, see Tests)
  • 0.2 Add a minimal CI workflow that runs the same commands as 0.1
  • 0.3 Define integration-test gating conventions
    • Docker/Swarm integration tests:
      • Mark as ignored by default and run only when CONTROL_TEST_DOCKER=1 is set
      • Example: CONTROL_TEST_DOCKER=1 cargo test -- --ignored
    • NATS-dependent integration tests:
      • Mark as ignored by default and run only when CONTROL_TEST_NATS_URL is set
      • Example: CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored
  • 0.4 Define baseline operational invariants (checklist for later milestones)
    • No privileged action without RBAC + audit event
    • No multi-step operation without idempotency key + job record
    • Always propagate tenant_id (when applicable) end-to-end
    • Always propagate request/flow identifiers end-to-end (logs + downstream calls):
      • x-request-id (per HTTP request)
      • x-correlation-id (per user-visible flow/job; generated by the Gateway when missing)
      • traceparent (W3C trace context; started by the Gateway when missing)
    • Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds)
    • No tenant-level metrics without bounded cardinality rules

Tests

  • T0.1 Tautological test: test harness runs for both subprojects (UI + API)
  • T0.2 Lint + typecheck + unit tests pass
  • T0.3 Docker config validation passes (compose/stack linting tests)

Milestone 1: Admin UI Foundation (UltraBase UX Reuse)

Goal: Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure.

Dependencies

  • Milestone 0 (repo bootstrap)

Exit Criteria

  • Admin UI builds successfully and passes unit/type checks
  • UI navigation skeleton matches the PRD information architecture

Tasks

  • 1.1 Initialize Admin UI project (Vite + React + TypeScript)
    • Choose and wire lint/typecheck/test/build tooling to match the canonical commands in 0.1
    • Adopt the baseline dependencies used by UltraBase control-plane admin UI where available
    • Establish UI module layout for: components, pages, routes, API client, auth/session utilities
  • 1.2 Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly)
    • Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs
  • 1.3 Implement navigation skeleton and empty pages (route wiring only)
    • Overview
    • Tenants
    • Users
    • Sessions
    • Roles & Permissions
    • Config
    • Definitions
    • Scale & Placement
    • Deployments
    • Observability
    • Audit Log
    • Settings
  • 1.3a Add correlation-first investigation affordances in the UI skeleton
    • Global search box that accepts x-request-id, x-correlation-id, or trace_id
    • “Investigate” links that open Grafana Explore prefilled for:
      • Loki query scoped to x-correlation-id (and x-request-id when available)
      • Tempo trace view when a trace_id is present
    • Ensure jobs and audit log rows display and copy the relevant ids
  • 1.4 Implement API client stub with consistent error handling and request-id propagation
    • Send x-request-id on every request (generate one when missing)
    • Send x-correlation-id when continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses
    • Send traceparent when continuing an existing trace; otherwise omit and use the Gateway-started trace
    • Echo x-request-id and x-correlation-id on responses and surface them in error UX
    • Persist the most recent ids in the UI so operators can copy/paste them into support tickets

Tests

  • T1.1 UI typecheck passes
  • T1.2 UI build passes
  • T1.3 Routing smoke test: each route renders without runtime errors (headless DOM test)

Milestone 2: Control Plane API Foundation (BFF / Admin API)

Goal: Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics.

Dependencies

  • Milestone 0 (repo bootstrap)

Exit Criteria

  • Control plane API runs as a container and exposes /health, /ready, /metrics
  • Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints

Tasks

  • 2.1 Initialize Control Plane API service
    • Rust (Axum + Tokio + tracing) to align with node ecosystem
    • Baseline endpoints: GET /health, GET /ready, GET /metrics
  • 2.2 Add request logging and correlation identifiers
    • x-request-id propagation and structured logs (match Gateway conventions)
    • Propagate x-correlation-id and traceparent on outbound calls
    • Log fields: request_id, correlation_id, trace_id, principal_id, tenant_id (when applicable)
    • Never log Authorization headers or tokens
  • 2.3 Implement authentication and authorization boundary
    • Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens)
    • Extract principal identity from token claims (at minimum: sub, session_id)
    • Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state)
    • Align x-tenant-id semantics with Gateway:
      • Tenant-scoped endpoints require x-tenant-id and must reject missing/invalid values with 400
      • Platform-scoped endpoints must not depend on x-tenant-id
    • Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state:
      • Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes
  • 2.4 Define “job” model for multi-step operations (API contract)
    • POST /admin/v1/jobs/* returns job_id
    • GET /admin/v1/jobs/{job_id} returns status + structured steps + errors
    • Require an idempotency key for job creation (Idempotency-Key header), and make repeated creates safe

Tests

  • T2.1 GET /health and GET /ready return 200
  • T2.2 Unauthorized admin calls return 401/403 consistently
  • T2.3 x-tenant-id behavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes)
  • T2.4 Tautological tests: core state types are Send + Sync

Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)

Goal: Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics.

Dependencies

  • Milestone 0 (repo bootstrap)

Exit Criteria

  • Grafana starts with provisioned datasources and dashboards
  • vmagent scrapes platform services and VictoriaMetrics can query ingested series
  • Loki is available for log queries (when logs are enabled)

Tasks

  • 3.1 Add observability deployment assets modeled after UltraBase
    • Grafana provisioning for datasources and dashboards
    • vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable)
    • Loki configuration (and optional promtail)
  • 3.1a Add distributed tracing backend and wiring
    • Tempo (or compatible tracing backend) as a Grafana datasource
    • OTLP receiver path (collector/agent) so platform services can emit traces
    • Grafana Explore is provisioned so operators can jump from logs to traces
    • Require the Gateway to accept and propagate x-correlation-id and traceparent to upstreams, and to include correlation_id and trace_id in request spans/log fields
  • 3.2 Implement the base dashboard set from the PRD
    • Operations overview
    • HTTP detail (Gateway route-level)
    • Logs (Loki)
    • Traces (Tempo)
    • Event bus / JetStream
    • Workers (Runner)
    • Storage (libmdbx + node disk)
    • Cluster / Orchestrator
  • 3.3 Add the chosen production-operability dashboards and document required instrumentation
    • Noisy Neighbor & Tenant Health
    • API Regression & Deployment
    • Storage & Event Bus Bottlenecks
    • Infrastructure Exhaustion
    • Standardize build/version labeling across services for correlation (*_build_info{service,version,git_sha}=1)

Tests

  • T3.1 Grafana provisioning files are syntactically valid
  • T3.2 vmagent config parses and includes all required scrape jobs
  • T3.3 Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated)
  • T3.4 Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state

Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)

Goal: Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics.

Dependencies

  • Milestone 1 (Admin UI foundation)
  • Milestone 2 (Control Plane API foundation)

Exit Criteria

  • Admin UI can list tenants and show current placement per service kind
  • Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback

Tasks

  • 4.1 Implement placement read APIs
    • Read effective placement from NATS KV (and fallback file for development)
    • Match the Gateway routing config model (placement maps + shard directories + revision semantics)
    • Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (aggregate_placement, projection_placement, runner_placement)
  • 4.2 Implement fleet “health snapshot” APIs
    • Query /health, /ready, /metrics from each service endpoint
    • Normalize into a stable UI response shape
  • 4.3 Implement Admin UI pages:
    • Scale & Placement (read-only)
    • Tenants (read-only with placement summary)
    • Fleet/Topology views (read-only)

Tests

  • T4.1 Placement config parsing and snapshot endpoints work
  • T4.2 KV watcher hot-reload swaps placement atomically
  • T4.3 UI pages render with mocked API responses (component-level tests)

Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs

Goal: Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload.

Dependencies

  • Milestone 4 (read-only ops)

Exit Criteria

  • All operational mutations are executed as jobs with audit events
  • Every mutation supports preflight planning and clear post-conditions

Tasks

  • 5.1 Implement job orchestration primitives in the API
    • step model, retries, cancellation, timeouts
    • per-tenant locking to avoid concurrent conflicting operations
  • 5.2 Implement drain workflow (per service kind where supported)
    • Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge)
    • Aggregate/projection drain semantics via admin endpoints where available
    • Align drain/readiness semantics with the rebalancing contract in external_prd.md
  • 5.3 Implement migration workflow
    • Plan: drain tenant → update placement → reload routing/config
    • Block unsafe migrations (health/lag/inflight thresholds)
  • 5.4 Implement UI mutation flows
    • modal confirmation + reason required
    • job progress view and audit linkage

Tests

  • T5.1 Job idempotency: repeated calls with same idempotency key do not duplicate effects
  • T5.2 Migration plan preflight produces a deterministic action plan
  • T5.3 Safety gates prevent drain/migrate when invariants fail

Milestone 6: Deployments + Regression Tooling (Swarm-Aware)

Goal: Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation.

Dependencies

  • Milestone 3 (observability baseline)
  • Milestone 5 (job orchestration)

Exit Criteria

  • Deployments can be initiated (or at least observed) via the control plane
  • Grafana shows deploy markers; dashboards can compare old vs new versions

Tasks

  • 6.1 Implement Swarm integration (read-only first, then mutations)
    • list services, tasks, images, versions
    • watch update events (start/finish/fail)
  • 6.2 Implement deployment annotations/events
    • write Grafana annotations (or emit a deploy event metric) for vertical markers
  • 6.3 Implement “API Regression & Deployment” dashboard wiring prerequisites
    • enforce build/version labeling (*_build_info{service,version,git_sha}=1 pattern)
    • ensure scrape relabeling includes image_tag where possible
  • 6.4 UI pages
    • Deployments list + detail
    • Per-service “what changed” and “rollback” actions (guarded)

Tests

  • T6.1 Swarm client abstraction can be mocked and produces deterministic results
  • T6.2 Annotation writer produces expected Grafana payloads
  • T6.3 Version labels are present on all services in a metrics snapshot test

Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)

Goal: Provide a complete Swarm deployment definition for the platform: services in ../ plus the control plane components and the observability stack.

Dependencies

  • Milestone 1 (Admin UI foundation)
  • Milestone 2 (Control Plane API foundation)
  • Milestone 3 (Observability baseline)
  • Milestone 5 (safe mutations baseline)

Exit Criteria

  • docker stack deploy brings up:
    • Gateway + Aggregate + Projection + Runner (from ../)
    • Control Plane API + Admin UI
    • VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail)
  • All services are reachable via overlay networks and pass health checks
  • Smoke and integration tests pass end-to-end (gated, but required before milestone completion)

Tasks

  • 7.1 Define Swarm networks, secrets, and configs
    • overlay network segmentation (public vs internal)
    • secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning)
  • 7.2 Define Swarm stack files
    • base platform stack (gateway/aggregate/projection/runner)
    • control plane stack (api + ui)
    • observability stack (vm/vmagent/grafana/loki/promtail)
  • 7.3 Define placement constraints and scaling defaults
    • node labels for tenant ranges and infrastructure roles
    • replica defaults and update policies
  • 7.4 Define deployment verification and rollback playbooks (as executable checks)
    • post-deploy checks: /health, /ready, /metrics, dashboard provisioning
    • rollbacks: service update rollback hooks and job safety checks

Tests

  • T7.1 Stack YAML parses and validates (unit test)
  • T7.2 Swarm smoke test (requires CONTROL_TEST_DOCKER=1)
    • deploy stacks
    • wait for healthy state
    • verify Grafana dashboards provisioned and VictoriaMetrics receives samples
  • T7.3 End-to-end “control plane can see the fleet” test (requires docker)
    • UI/API can query placement + health snapshots for all services