madapes/cloudlysis

Fork 0

Files

Vlad Durnea 1298d9a3df

ci / rust (push) Failing after 2m34s

Details

ci / ui (push) Failing after 30s

Details

Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets

2026-03-30 11:40:42 +03:00

16 KiB

Raw Blame History

Development Plan: Control Plane (Admin UI + Observability + Production Ops)

Overview

This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes:

Tasks with clear deliverables
Test Requirements (unit tests + tautological tests + integration tests where applicable)
Dependencies on previous milestones

Development Approach:

Complete one milestone at a time
Write tests before implementation (TDD where applicable)
All tests must pass before moving to the next milestone
Mark tasks complete with [x] as you progress

This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: gateway/DEVELOPMENT_PLAN.md, runner/DEVELOPMENT_PLAN.md).

Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)

Goal: Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently.

Tasks

0.1 Define canonical local commands for the repo
- UI:
  - npm run lint
  - npm run typecheck
  - npm run test
  - npm run build
- Control Plane API:
  - cargo test
  - cargo fmt --check
  - cargo clippy -- -D warnings
  - cargo run -- --help
- Docker/Swarm:
  - docker compose config validation for local stacks (if used)
  - docker stack deploy ... smoke validation for Swarm (gated, see Tests)
0.2 Add a minimal CI workflow that runs the same commands as 0.1
0.3 Define integration-test gating conventions
- Docker/Swarm integration tests:
  - Mark as ignored by default and run only when CONTROL_TEST_DOCKER=1 is set
  - Example: CONTROL_TEST_DOCKER=1 cargo test -- --ignored
- NATS-dependent integration tests:
  - Mark as ignored by default and run only when CONTROL_TEST_NATS_URL is set
  - Example: CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored
0.4 Define baseline operational invariants (checklist for later milestones)
- No privileged action without RBAC + audit event
- No multi-step operation without idempotency key + job record
- Always propagate tenant_id (when applicable) end-to-end
- Always propagate request/flow identifiers end-to-end (logs + downstream calls):
  - x-request-id (per HTTP request)
  - x-correlation-id (per user-visible flow/job; generated by the Gateway when missing)
  - traceparent (W3C trace context; started by the Gateway when missing)
- Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds)
- No tenant-level metrics without bounded cardinality rules

Tests

T0.1 Tautological test: test harness runs for both subprojects (UI + API)
T0.2 Lint + typecheck + unit tests pass
T0.3 Docker config validation passes (compose/stack linting tests)

Milestone 1: Admin UI Foundation (UltraBase UX Reuse)

Goal: Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure.

Dependencies

Milestone 0 (repo bootstrap)

Exit Criteria

Admin UI builds successfully and passes unit/type checks
UI navigation skeleton matches the PRD information architecture

Tasks

1.1 Initialize Admin UI project (Vite + React + TypeScript)
- Choose and wire lint/typecheck/test/build tooling to match the canonical commands in 0.1
- Adopt the baseline dependencies used by UltraBase control-plane admin UI where available
- Establish UI module layout for: components, pages, routes, API client, auth/session utilities
1.2 Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly)
- Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs
1.3 Implement navigation skeleton and empty pages (route wiring only)
- Overview
- Tenants
- Users
- Sessions
- Roles & Permissions
- Config
- Definitions
- Scale & Placement
- Deployments
- Observability
- Audit Log
- Settings
1.3a Add correlation-first investigation affordances in the UI skeleton
- Global search box that accepts x-request-id, x-correlation-id, or trace_id
- “Investigate” links that open Grafana Explore prefilled for:
  - Loki query scoped to x-correlation-id (and x-request-id when available)
  - Tempo trace view when a trace_id is present
- Ensure jobs and audit log rows display and copy the relevant ids
1.4 Implement API client stub with consistent error handling and request-id propagation
- Send x-request-id on every request (generate one when missing)
- Send x-correlation-id when continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses
- Send traceparent when continuing an existing trace; otherwise omit and use the Gateway-started trace
- Echo x-request-id and x-correlation-id on responses and surface them in error UX
- Persist the most recent ids in the UI so operators can copy/paste them into support tickets

Tests

T1.1 UI typecheck passes
T1.2 UI build passes
T1.3 Routing smoke test: each route renders without runtime errors (headless DOM test)

Milestone 2: Control Plane API Foundation (BFF / Admin API)

Goal: Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics.

Dependencies

Milestone 0 (repo bootstrap)

Exit Criteria

Control plane API runs as a container and exposes /health, /ready, /metrics
Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints

Tasks

2.1 Initialize Control Plane API service
- Rust (Axum + Tokio + tracing) to align with node ecosystem
- Baseline endpoints: GET /health, GET /ready, GET /metrics
2.2 Add request logging and correlation identifiers
- x-request-id propagation and structured logs (match Gateway conventions)
- Propagate x-correlation-id and traceparent on outbound calls
- Log fields: request_id, correlation_id, trace_id, principal_id, tenant_id (when applicable)
- Never log Authorization headers or tokens
2.3 Implement authentication and authorization boundary
- Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens)
- Extract principal identity from token claims (at minimum: sub, session_id)
- Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state)
- Align x-tenant-id semantics with Gateway:
  - Tenant-scoped endpoints require x-tenant-id and must reject missing/invalid values with 400
  - Platform-scoped endpoints must not depend on x-tenant-id
- Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state:
  - Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes
2.4 Define “job” model for multi-step operations (API contract)
- POST /admin/v1/jobs/* returns job_id
- GET /admin/v1/jobs/{job_id} returns status + structured steps + errors
- Require an idempotency key for job creation (Idempotency-Key header), and make repeated creates safe

Tests

T2.1 GET /health and GET /ready return 200
T2.2 Unauthorized admin calls return 401/403 consistently
T2.3 x-tenant-id behavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes)
T2.4 Tautological tests: core state types are Send + Sync

Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)

Goal: Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics.

Dependencies

Milestone 0 (repo bootstrap)

Exit Criteria

Grafana starts with provisioned datasources and dashboards
vmagent scrapes platform services and VictoriaMetrics can query ingested series
Loki is available for log queries (when logs are enabled)

Tasks

3.1 Add observability deployment assets modeled after UltraBase
- Grafana provisioning for datasources and dashboards
- vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable)
- Loki configuration (and optional promtail)
3.1a Add distributed tracing backend and wiring
- Tempo (or compatible tracing backend) as a Grafana datasource
- OTLP receiver path (collector/agent) so platform services can emit traces
- Grafana Explore is provisioned so operators can jump from logs to traces
- Require the Gateway to accept and propagate x-correlation-id and traceparent to upstreams, and to include correlation_id and trace_id in request spans/log fields
3.2 Implement the base dashboard set from the PRD
- Operations overview
- HTTP detail (Gateway route-level)
- Logs (Loki)
- Traces (Tempo)
- Event bus / JetStream
- Workers (Runner)
- Storage (libmdbx + node disk)
- Cluster / Orchestrator
3.3 Add the chosen production-operability dashboards and document required instrumentation
- Noisy Neighbor & Tenant Health
- API Regression & Deployment
- Storage & Event Bus Bottlenecks
- Infrastructure Exhaustion
- Standardize build/version labeling across services for correlation (*_build_info{service,version,git_sha}=1)

Tests

T3.1 Grafana provisioning files are syntactically valid
T3.2 vmagent config parses and includes all required scrape jobs
T3.3 Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated)
T3.4 Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state

Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)

Goal: Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics.

Dependencies

Milestone 1 (Admin UI foundation)
Milestone 2 (Control Plane API foundation)

Exit Criteria

Admin UI can list tenants and show current placement per service kind
Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback

Tasks

4.1 Implement placement read APIs
- Read effective placement from NATS KV (and fallback file for development)
- Match the Gateway routing config model (placement maps + shard directories + revision semantics)
- Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (aggregate_placement, projection_placement, runner_placement)
4.2 Implement fleet “health snapshot” APIs
- Query /health, /ready, /metrics from each service endpoint
- Normalize into a stable UI response shape
4.3 Implement Admin UI pages:
- Scale & Placement (read-only)
- Tenants (read-only with placement summary)
- Fleet/Topology views (read-only)

Tests

T4.1 Placement config parsing and snapshot endpoints work
T4.2 KV watcher hot-reload swaps placement atomically
T4.3 UI pages render with mocked API responses (component-level tests)

Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs

Goal: Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload.

Dependencies

Milestone 4 (read-only ops)

Exit Criteria

All operational mutations are executed as jobs with audit events
Every mutation supports preflight planning and clear post-conditions

Tasks

5.1 Implement job orchestration primitives in the API
- step model, retries, cancellation, timeouts
- per-tenant locking to avoid concurrent conflicting operations
5.2 Implement drain workflow (per service kind where supported)
- Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge)
- Aggregate/projection drain semantics via admin endpoints where available
- Align drain/readiness semantics with the rebalancing contract in external_prd.md
5.3 Implement migration workflow
- Plan: drain tenant → update placement → reload routing/config
- Block unsafe migrations (health/lag/inflight thresholds)
5.4 Implement UI mutation flows
- modal confirmation + reason required
- job progress view and audit linkage

Tests

T5.1 Job idempotency: repeated calls with same idempotency key do not duplicate effects
T5.2 Migration plan preflight produces a deterministic action plan
T5.3 Safety gates prevent drain/migrate when invariants fail

Milestone 6: Deployments + Regression Tooling (Swarm-Aware)

Goal: Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation.

Dependencies

Milestone 3 (observability baseline)
Milestone 5 (job orchestration)

Exit Criteria

Deployments can be initiated (or at least observed) via the control plane
Grafana shows deploy markers; dashboards can compare old vs new versions

Tasks

6.1 Implement Swarm integration (read-only first, then mutations)
- list services, tasks, images, versions
- watch update events (start/finish/fail)
6.2 Implement deployment annotations/events
- write Grafana annotations (or emit a deploy event metric) for vertical markers
6.3 Implement “API Regression & Deployment” dashboard wiring prerequisites
- enforce build/version labeling (*_build_info{service,version,git_sha}=1 pattern)
- ensure scrape relabeling includes image_tag where possible
6.4 UI pages
- Deployments list + detail
- Per-service “what changed” and “rollback” actions (guarded)

Tests

T6.1 Swarm client abstraction can be mocked and produces deterministic results
T6.2 Annotation writer produces expected Grafana payloads
T6.3 Version labels are present on all services in a metrics snapshot test

Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)

Goal: Provide a complete Swarm deployment definition for the platform: services in ../ plus the control plane components and the observability stack.

Dependencies

Milestone 1 (Admin UI foundation)
Milestone 2 (Control Plane API foundation)
Milestone 3 (Observability baseline)
Milestone 5 (safe mutations baseline)

Exit Criteria

docker stack deploy brings up:
- Gateway + Aggregate + Projection + Runner (from ../)
- Control Plane API + Admin UI
- VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail)
All services are reachable via overlay networks and pass health checks
Smoke and integration tests pass end-to-end (gated, but required before milestone completion)

Tasks

7.1 Define Swarm networks, secrets, and configs
- overlay network segmentation (public vs internal)
- secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning)
7.2 Define Swarm stack files
- base platform stack (gateway/aggregate/projection/runner)
- control plane stack (api + ui)
- observability stack (vm/vmagent/grafana/loki/promtail)
7.3 Define placement constraints and scaling defaults
- node labels for tenant ranges and infrastructure roles
- replica defaults and update policies
7.4 Define deployment verification and rollback playbooks (as executable checks)
- post-deploy checks: /health, /ready, /metrics, dashboard provisioning
- rollbacks: service update rollback hooks and job safety checks

Tests

T7.1 Stack YAML parses and validates (unit test)
T7.2 Swarm smoke test (requires CONTROL_TEST_DOCKER=1)
- deploy stacks
- wait for healthy state
- verify Grafana dashboards provisioned and VictoriaMetrics receives samples
T7.3 End-to-end “control plane can see the fleet” test (requires docker)
- UI/API can query placement + health snapshots for all services

16 KiB Raw Blame History

Development Plan: Control Plane (Admin UI + Observability + Production Ops)

Overview

Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)

Tasks

Tests

Milestone 1: Admin UI Foundation (UltraBase UX Reuse)

Dependencies

Exit Criteria

Tasks

Tests

Milestone 2: Control Plane API Foundation (BFF / Admin API)

Dependencies

Exit Criteria

Tasks

Tests

Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)

Dependencies

Exit Criteria

Tasks

Tests

Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)

Dependencies

Exit Criteria

Tasks

Tests

Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs

Dependencies

Exit Criteria

Tasks

Tests

Milestone 6: Deployments + Regression Tooling (Swarm-Aware)

Dependencies

Exit Criteria

Tasks

Tests

Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)

Dependencies

Exit Criteria

Tasks

Tests

16 KiB

Raw Blame History