Files
cloudlysis/control/DEVELOPMENT_PLAN.md
Vlad Durnea 2595e7f1c5
Some checks failed
ci / ui (push) Failing after 28s
ci / rust (push) Failing after 2m40s
images / build-and-push (push) Failing after 19s
feat(billing): implement tenant subscription entitlements system (milestones 0-6)
2026-03-30 18:41:23 +03:00

23 KiB

Development Plan: Control Plane (Admin UI + Observability + Production Ops)

Overview

This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes:

  • Tasks with clear deliverables
  • Test Requirements (unit tests + tautological tests + integration tests where applicable)
  • Dependencies on previous milestones

Development Approach:

  1. Complete one milestone at a time
  2. Write tests before implementation (TDD where applicable)
  3. All tests must pass before moving to the next milestone
  4. Mark tasks complete with [x] as you progress

This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: gateway/DEVELOPMENT_PLAN.md, runner/DEVELOPMENT_PLAN.md).


Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)

Goal: Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently.

Tasks

  • 0.1 Define canonical local commands for the repo
    • UI:
      • npm run lint
      • npm run typecheck
      • npm run test
      • npm run build
    • Control Plane API:
      • cargo test
      • cargo fmt --check
      • cargo clippy -- -D warnings
      • cargo run -- --help
    • Docker/Swarm:
      • docker compose config validation for local stacks (if used)
      • docker stack deploy ... smoke validation for Swarm (gated, see Tests)
  • 0.2 Add a minimal CI workflow that runs the same commands as 0.1
  • 0.3 Define integration-test gating conventions
    • Docker/Swarm integration tests:
      • Mark as ignored by default and run only when CONTROL_TEST_DOCKER=1 is set
      • Example: CONTROL_TEST_DOCKER=1 cargo test -- --ignored
    • NATS-dependent integration tests:
      • Mark as ignored by default and run only when CONTROL_TEST_NATS_URL is set
      • Example: CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored
  • 0.4 Define baseline operational invariants (checklist for later milestones)
    • No privileged action without RBAC + audit event
    • No multi-step operation without idempotency key + job record
    • Always propagate tenant_id (when applicable) end-to-end
    • Always propagate request/flow identifiers end-to-end (logs + downstream calls):
      • x-request-id (per HTTP request)
      • x-correlation-id (per user-visible flow/job; generated by the Gateway when missing)
      • traceparent (W3C trace context; started by the Gateway when missing)
    • Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds)
    • No tenant-level metrics without bounded cardinality rules

Tests

  • T0.1 Tautological test: test harness runs for both subprojects (UI + API)
  • T0.2 Lint + typecheck + unit tests pass
  • T0.3 Docker config validation passes (compose/stack linting tests)

Milestone 1: Admin UI Foundation (UltraBase UX Reuse)

Goal: Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure.

Dependencies

  • Milestone 0 (repo bootstrap)

Exit Criteria

  • Admin UI builds successfully and passes unit/type checks
  • UI navigation skeleton matches the PRD information architecture

Tasks

  • 1.1 Initialize Admin UI project (Vite + React + TypeScript)
    • Choose and wire lint/typecheck/test/build tooling to match the canonical commands in 0.1
    • Adopt the baseline dependencies used by UltraBase control-plane admin UI where available
    • Establish UI module layout for: components, pages, routes, API client, auth/session utilities
  • 1.2 Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly)
    • Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs
  • 1.3 Implement navigation skeleton and empty pages (route wiring only)
    • Overview
    • Tenants
    • Users
    • Sessions
    • Roles & Permissions
    • Config
    • Definitions
    • Scale & Placement
    • Deployments
    • Observability
    • Audit Log
    • Settings
  • 1.3a Add correlation-first investigation affordances in the UI skeleton
    • Global search box that accepts x-request-id, x-correlation-id, or trace_id
    • “Investigate” links that open Grafana Explore prefilled for:
      • Loki query scoped to x-correlation-id (and x-request-id when available)
      • Tempo trace view when a trace_id is present
    • Ensure jobs and audit log rows display and copy the relevant ids
  • 1.4 Implement API client stub with consistent error handling and request-id propagation
    • Send x-request-id on every request (generate one when missing)
    • Send x-correlation-id when continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses
    • Send traceparent when continuing an existing trace; otherwise omit and use the Gateway-started trace
    • Echo x-request-id and x-correlation-id on responses and surface them in error UX
    • Persist the most recent ids in the UI so operators can copy/paste them into support tickets

Tests

  • T1.1 UI typecheck passes
  • T1.2 UI build passes
  • T1.3 Routing smoke test: each route renders without runtime errors (headless DOM test)

Milestone 2: Control Plane API Foundation (BFF / Admin API)

Goal: Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics.

Dependencies

  • Milestone 0 (repo bootstrap)

Exit Criteria

  • Control plane API runs as a container and exposes /health, /ready, /metrics
  • Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints

Tasks

  • 2.1 Initialize Control Plane API service
    • Rust (Axum + Tokio + tracing) to align with node ecosystem
    • Baseline endpoints: GET /health, GET /ready, GET /metrics
  • 2.2 Add request logging and correlation identifiers
    • x-request-id propagation and structured logs (match Gateway conventions)
    • Propagate x-correlation-id and traceparent on outbound calls
    • Log fields: request_id, correlation_id, trace_id, principal_id, tenant_id (when applicable)
    • Never log Authorization headers or tokens
  • 2.3 Implement authentication and authorization boundary
    • Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens)
    • Extract principal identity from token claims (at minimum: sub, session_id)
    • Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state)
    • Align x-tenant-id semantics with Gateway:
      • Tenant-scoped endpoints require x-tenant-id and must reject missing/invalid values with 400
      • Platform-scoped endpoints must not depend on x-tenant-id
    • Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state:
      • Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes
  • 2.4 Define “job” model for multi-step operations (API contract)
    • POST /admin/v1/jobs/* returns job_id
    • GET /admin/v1/jobs/{job_id} returns status + structured steps + errors
    • Require an idempotency key for job creation (Idempotency-Key header), and make repeated creates safe

Tests

  • T2.1 GET /health and GET /ready return 200
  • T2.2 Unauthorized admin calls return 401/403 consistently
  • T2.3 x-tenant-id behavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes)
  • T2.4 Tautological tests: core state types are Send + Sync

Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)

Goal: Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics.

Dependencies

  • Milestone 0 (repo bootstrap)

Exit Criteria

  • Grafana starts with provisioned datasources and dashboards
  • vmagent scrapes platform services and VictoriaMetrics can query ingested series
  • Loki is available for log queries (when logs are enabled)

Tasks

  • 3.1 Add observability deployment assets modeled after UltraBase
    • Grafana provisioning for datasources and dashboards
    • vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable)
    • Loki configuration (and optional promtail)
  • 3.1a Add distributed tracing backend and wiring
    • Tempo (or compatible tracing backend) as a Grafana datasource
    • OTLP receiver path (collector/agent) so platform services can emit traces
    • Grafana Explore is provisioned so operators can jump from logs to traces
    • Require the Gateway to accept and propagate x-correlation-id and traceparent to upstreams, and to include correlation_id and trace_id in request spans/log fields
  • 3.2 Implement the base dashboard set from the PRD
    • Operations overview
    • HTTP detail (Gateway route-level)
    • Logs (Loki)
    • Traces (Tempo)
    • Event bus / JetStream
    • Workers (Runner)
    • Storage (libmdbx + node disk)
    • Cluster / Orchestrator
  • 3.3 Add the chosen production-operability dashboards and document required instrumentation
    • Noisy Neighbor & Tenant Health
    • API Regression & Deployment
    • Storage & Event Bus Bottlenecks
    • Infrastructure Exhaustion
    • Standardize build/version labeling across services for correlation (*_build_info{service,version,git_sha}=1)

Tests

  • T3.1 Grafana provisioning files are syntactically valid
  • T3.2 vmagent config parses and includes all required scrape jobs
  • T3.3 Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated)
  • T3.4 Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state

Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)

Goal: Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics.

Dependencies

  • Milestone 1 (Admin UI foundation)
  • Milestone 2 (Control Plane API foundation)

Exit Criteria

  • Admin UI can list tenants and show current placement per service kind
  • Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback

Tasks

  • 4.1 Implement placement read APIs
    • Read effective placement from NATS KV (and fallback file for development)
    • Match the Gateway routing config model (placement maps + shard directories + revision semantics)
    • Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (aggregate_placement, projection_placement, runner_placement)
  • 4.2 Implement fleet “health snapshot” APIs
    • Query /health, /ready, /metrics from each service endpoint
    • Normalize into a stable UI response shape
  • 4.3 Implement Admin UI pages:
    • Scale & Placement (read-only)
    • Tenants (read-only with placement summary)
    • Fleet/Topology views (read-only)

Tests

  • T4.1 Placement config parsing and snapshot endpoints work
  • T4.2 KV watcher hot-reload swaps placement atomically
  • T4.3 UI pages render with mocked API responses (component-level tests)

Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs

Goal: Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload.

Dependencies

  • Milestone 4 (read-only ops)

Exit Criteria

  • All operational mutations are executed as jobs with audit events
  • Every mutation supports preflight planning and clear post-conditions

Tasks

  • 5.1 Implement job orchestration primitives in the API
    • step model, retries, cancellation, timeouts
    • per-tenant locking to avoid concurrent conflicting operations
  • 5.2 Implement drain workflow (per service kind where supported)
    • Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge)
    • Aggregate/projection drain semantics via admin endpoints where available
    • Align drain/readiness semantics with the rebalancing contract in external_prd.md
  • 5.3 Implement migration workflow
    • Plan: drain tenant → update placement → reload routing/config
    • Block unsafe migrations (health/lag/inflight thresholds)
  • 5.4 Implement UI mutation flows
    • modal confirmation + reason required
    • job progress view and audit linkage

Tests

  • T5.1 Job idempotency: repeated calls with same idempotency key do not duplicate effects
  • T5.2 Migration plan preflight produces a deterministic action plan
  • T5.3 Safety gates prevent drain/migrate when invariants fail

Milestone 6: Deployments + Regression Tooling (Swarm-Aware)

Goal: Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation.

Dependencies

  • Milestone 3 (observability baseline)
  • Milestone 5 (job orchestration)

Exit Criteria

  • Deployments can be initiated (or at least observed) via the control plane
  • Grafana shows deploy markers; dashboards can compare old vs new versions

Tasks

  • 6.1 Implement Swarm integration (read-only first, then mutations)
    • list services, tasks, images, versions
    • watch update events (start/finish/fail)
  • 6.2 Implement deployment annotations/events
    • write Grafana annotations (or emit a deploy event metric) for vertical markers
  • 6.3 Implement “API Regression & Deployment” dashboard wiring prerequisites
    • enforce build/version labeling (*_build_info{service,version,git_sha}=1 pattern)
    • ensure scrape relabeling includes image_tag where possible
  • 6.4 UI pages
    • Deployments list + detail
    • Per-service “what changed” and “rollback” actions (guarded)

Tests

  • T6.1 Swarm client abstraction can be mocked and produces deterministic results
  • T6.2 Annotation writer produces expected Grafana payloads
  • T6.3 Version labels are present on all services in a metrics snapshot test

Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)

Goal: Provide a complete Swarm deployment definition for the platform: services in ../ plus the control plane components and the observability stack.

Dependencies

  • Milestone 1 (Admin UI foundation)
  • Milestone 2 (Control Plane API foundation)
  • Milestone 3 (Observability baseline)
  • Milestone 5 (safe mutations baseline)

Exit Criteria

  • docker stack deploy brings up:
    • Gateway + Aggregate + Projection + Runner (from ../)
    • Control Plane API + Admin UI
    • VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail)
  • All services are reachable via overlay networks and pass health checks
  • Smoke and integration tests pass end-to-end (gated, but required before milestone completion)

Tasks

  • 7.1 Define Swarm networks, secrets, and configs
    • overlay network segmentation (public vs internal)
    • secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning)
  • 7.2 Define Swarm stack files
    • base platform stack (gateway/aggregate/projection/runner)
    • control plane stack (api + ui)
    • observability stack (vm/vmagent/grafana/loki/promtail)
  • 7.3 Define placement constraints and scaling defaults
    • node labels for tenant ranges and infrastructure roles
    • replica defaults and update policies
  • 7.4 Define deployment verification and rollback playbooks (as executable checks)
    • post-deploy checks: /health, /ready, /metrics, dashboard provisioning
    • rollbacks: service update rollback hooks and job safety checks

Tests

  • T7.1 Stack YAML parses and validates (unit test)
  • T7.2 Swarm smoke test (requires CONTROL_TEST_DOCKER=1)
    • deploy stacks
    • wait for healthy state
    • verify Grafana dashboards provisioned and VictoriaMetrics receives samples
  • T7.3 End-to-end “control plane can see the fleet” test (requires docker)
    • UI/API can query placement + health snapshots for all services

Milestone 8: Config Registry + Safe Change Management (Plan/Apply/Rollback)

Goal: Make configuration first-class, versioned, validated, and safely mutable from the control plane, while keeping production and development sources consistent.

Dependencies

  • Milestone 2 (Control Plane API foundation)
  • Milestone 5 (safe mutations baseline)
  • Milestone 7 (Swarm deployment baseline)

Exit Criteria

  • Operators can list, view, validate, and safely apply config changes with audit + idempotent jobs
  • Config changes have revision semantics and are roll-backable
  • Gatekeeper safety checks prevent applying invalid or unsafe configs

Tasks

  • 8.1 Inventory and classify configuration surfaces (platform-wide)
    • classify as: static boot config (env/secrets), dynamic runtime config (KV), large immutable artifacts (S3/docs)
    • map current sources per domain:
      • Gateway routing config (config/routing/dev.json / production KV)
      • Placement config (config/placement/dev.json / production KV)
      • Runner definitions (effects/sagas) (documents/S3) and activation config (KV)
      • Observability provisioning (Swarm configs + repo-managed assets)
      • Control plane feature flags (KV)
  • [~] 8.2 Define a Config Registry contract in the Control API
    • Implemented (initial):
      • config identity: {domain} (routing|placement)
      • metadata: revision (KV revision when using NATS), and source info (file vs nats)
      • storage policy per config: source=dev_file | nats_kv
    • Still needed:
      • {domain, name, scope} and richer metadata (updated_at, updated_by, sha256)
      • history API for KV-backed configs
  • 8.3 Implement config storage abstraction (dev + prod)
    • dev: file-backed, atomic write (tmp + rename), hot-reload where applicable
    • prod: NATS KV for dynamic configs (revisioned values + watch streams)
    • consistent error model: decode/validate/source errors are distinguishable and safe
  • 8.4 Add read-only config APIs
    • GET /admin/v1/config list domains
    • GET /admin/v1/config/{domain} fetch current value + revision + source
    • (history not implemented yet)
  • [~] 8.5 Add validate/plan/apply/rollback mutation workflows as jobs
    • Implemented:
      • POST /admin/v1/jobs/config/validate (job, idempotency key required)
      • POST /admin/v1/jobs/config/apply (job, idempotency key required, backup + apply)
      • POST /admin/v1/jobs/config/rollback (job, idempotency key required, restore last backup)
      • per-domain locking to avoid concurrent config mutations
    • Still needed:
      • POST /admin/v1/plan/config/apply deterministic plan (diff + impacted services)
      • richer post-conditions (routing resolution sampling, fleet consistency checks, etc.)
  • [~] 8.6 Implement initial config domains end-to-end
    • Gateway routing config:
      • implemented: schema validation via JSON decode
      • still needed: semantic validation (tenant entries/shard directories/endpoints URL parsing) + sampled routing verification
    • Placement config:
      • implemented: schema validation via JSON decode
      • still needed: semantic validation (targets non-empty, etc.) + fleet snapshot consistency checks
  • 8.7 Implement Admin UI “Config” page for safe operations
    • list + view configs with revision/sha/audit linkage
    • editor for JSON (and YAML when supported by the domain)
    • validate button (server-side) and apply/rollback flows as jobs with reason required

Tests

  • T8.1 Unit tests: config decode/encode stability for each config domain
    • routing/placement decode is enforced by server-side validate job (schema-level)
  • T8.2 Unit tests: validation rejects unsafe configs with stable error codes/messages
  • T8.3 Unit tests: plan generation is deterministic for same inputs
  • T8.4 Integration tests (env-gated):
    • NATS KV config apply + rollback via Control API (requires CONTROL_TEST_NATS=1 + CONTROL_TEST_NATS_URL)
    • (Gateway route-resolution E2E verification still pending)
  • T8.5 UI tests: config page renders, validate/apply/rollback flows navigate to job progress

Milestone 9: Control Node Management (Inventory, Drift, and Safer Ops)

Goal: Improve how the control plane understands and manages the live control node and platform state: node inventory, config drift detection, and safer operational guardrails.

Dependencies

  • Milestone 7 (Swarm deployment baseline)
  • Milestone 8 (config registry + safe change management)

Exit Criteria

  • Control plane provides a reliable “what is running vs what should be running” view
  • Config drift is detectable and actionable
  • Core operational actions are guarded by preflight checks and produce audit trails

Tasks

  • 9.1 Define a “desired vs observed” model for platform state
    • desired: Swarm stacks + config registry revisions
    • observed: live service/task state + effective runtime configs
    • drift categories: missing, extra, version mismatch, config mismatch, unhealthy
  • [~] 9.2 Improve Swarm observation fidelity
    • implemented (initial): docker-cli-backed Swarm observation (CONTROL_SWARM_MODE=docker)
    • still needed: direct Docker API client (avoid shelling out), richer normalization, and wiring into production stacks
    • keep file source as a dev fallback for deterministic tests
    • normalize service identity: {service, image_tag, git_sha, updated_at}
  • 9.3 Add drift APIs and UI views
    • GET /admin/v1/platform/drift returns drift summary + actionable items
    • UI: “Platform Drift” page with filters and links to remediate jobs
  • 9.4 Add safer operational guardrails as reusable checks
    • preflight checks for:
      • service unhealthy / crashloop
      • tenant migration safety thresholds (lag/inflight)
      • config apply safety (impact radius, sampled verify)
    • consistent failure modes: clear reason + audit entry, no partial side effects
  • 9.5 Add operational playbooks as executable checks
    • post-deploy verification suite callable as an idempotent job
    • rollback verification suite callable as an idempotent job

Tests

  • T9.1 Unit tests: drift classification for synthetic desired/observed fixtures
  • T9.2 Integration tests (docker-gated): drift view detects intentional mismatches in a local Swarm
    • requires CONTROL_TEST_DOCKER=1 and an active local Swarm node
  • T9.3 UI tests: drift page renders in route smoke test