Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
This commit is contained in:
341
control/DEVELOPMENT_PLAN.md
Normal file
341
control/DEVELOPMENT_PLAN.md
Normal file
@@ -0,0 +1,341 @@
|
||||
# Development Plan: Control Plane (Admin UI + Observability + Production Ops)
|
||||
|
||||
## Overview
|
||||
|
||||
This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes:
|
||||
- **Tasks** with clear deliverables
|
||||
- **Test Requirements** (unit tests + tautological tests + integration tests where applicable)
|
||||
- **Dependencies** on previous milestones
|
||||
|
||||
**Development Approach:**
|
||||
1. Complete one milestone at a time
|
||||
2. Write tests before implementation (TDD where applicable)
|
||||
3. All tests must pass before moving to the next milestone
|
||||
4. Mark tasks complete with `[x]` as you progress
|
||||
|
||||
This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: [gateway/DEVELOPMENT_PLAN.md](file:///Users/vlad/Developer/cloudlysis/gateway/DEVELOPMENT_PLAN.md), [runner/DEVELOPMENT_PLAN.md](file:///Users/vlad/Developer/cloudlysis/runner/DEVELOPMENT_PLAN.md)).
|
||||
|
||||
---
|
||||
|
||||
## Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)
|
||||
|
||||
**Goal:** Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently.
|
||||
|
||||
### Tasks
|
||||
- [x] **0.1** Define canonical local commands for the repo
|
||||
- UI:
|
||||
- `npm run lint`
|
||||
- `npm run typecheck`
|
||||
- `npm run test`
|
||||
- `npm run build`
|
||||
- Control Plane API:
|
||||
- `cargo test`
|
||||
- `cargo fmt --check`
|
||||
- `cargo clippy -- -D warnings`
|
||||
- `cargo run -- --help`
|
||||
- Docker/Swarm:
|
||||
- `docker compose config` validation for local stacks (if used)
|
||||
- `docker stack deploy ...` smoke validation for Swarm (gated, see Tests)
|
||||
- [x] **0.2** Add a minimal CI workflow that runs the same commands as **0.1**
|
||||
- [x] **0.3** Define integration-test gating conventions
|
||||
- Docker/Swarm integration tests:
|
||||
- Mark as ignored by default and run only when `CONTROL_TEST_DOCKER=1` is set
|
||||
- Example: `CONTROL_TEST_DOCKER=1 cargo test -- --ignored`
|
||||
- NATS-dependent integration tests:
|
||||
- Mark as ignored by default and run only when `CONTROL_TEST_NATS_URL` is set
|
||||
- Example: `CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored`
|
||||
- [x] **0.4** Define baseline operational invariants (checklist for later milestones)
|
||||
- No privileged action without RBAC + audit event
|
||||
- No multi-step operation without idempotency key + job record
|
||||
- Always propagate `tenant_id` (when applicable) end-to-end
|
||||
- Always propagate request/flow identifiers end-to-end (logs + downstream calls):
|
||||
- `x-request-id` (per HTTP request)
|
||||
- `x-correlation-id` (per user-visible flow/job; generated by the Gateway when missing)
|
||||
- `traceparent` (W3C trace context; started by the Gateway when missing)
|
||||
- Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds)
|
||||
- No tenant-level metrics without bounded cardinality rules
|
||||
|
||||
### Tests
|
||||
- [x] **T0.1** Tautological test: test harness runs for both subprojects (UI + API)
|
||||
- [x] **T0.2** Lint + typecheck + unit tests pass
|
||||
- [x] **T0.3** Docker config validation passes (compose/stack linting tests)
|
||||
|
||||
---
|
||||
|
||||
## Milestone 1: Admin UI Foundation (UltraBase UX Reuse)
|
||||
|
||||
**Goal:** Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 0 (repo bootstrap)
|
||||
|
||||
### Exit Criteria
|
||||
- Admin UI builds successfully and passes unit/type checks
|
||||
- UI navigation skeleton matches the PRD information architecture
|
||||
|
||||
### Tasks
|
||||
- [x] **1.1** Initialize Admin UI project (Vite + React + TypeScript)
|
||||
- Choose and wire lint/typecheck/test/build tooling to match the canonical commands in **0.1**
|
||||
- Adopt the baseline dependencies used by UltraBase control-plane admin UI where available
|
||||
- Establish UI module layout for: components, pages, routes, API client, auth/session utilities
|
||||
- [x] **1.2** Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly)
|
||||
- Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs
|
||||
- [x] **1.3** Implement navigation skeleton and empty pages (route wiring only)
|
||||
- Overview
|
||||
- Tenants
|
||||
- Users
|
||||
- Sessions
|
||||
- Roles & Permissions
|
||||
- Config
|
||||
- Definitions
|
||||
- Scale & Placement
|
||||
- Deployments
|
||||
- Observability
|
||||
- Audit Log
|
||||
- Settings
|
||||
- [x] **1.3a** Add correlation-first investigation affordances in the UI skeleton
|
||||
- Global search box that accepts `x-request-id`, `x-correlation-id`, or `trace_id`
|
||||
- “Investigate” links that open Grafana Explore prefilled for:
|
||||
- Loki query scoped to `x-correlation-id` (and `x-request-id` when available)
|
||||
- Tempo trace view when a `trace_id` is present
|
||||
- Ensure jobs and audit log rows display and copy the relevant ids
|
||||
- [x] **1.4** Implement API client stub with consistent error handling and request-id propagation
|
||||
- Send `x-request-id` on every request (generate one when missing)
|
||||
- Send `x-correlation-id` when continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses
|
||||
- Send `traceparent` when continuing an existing trace; otherwise omit and use the Gateway-started trace
|
||||
- Echo `x-request-id` and `x-correlation-id` on responses and surface them in error UX
|
||||
- Persist the most recent ids in the UI so operators can copy/paste them into support tickets
|
||||
|
||||
### Tests
|
||||
- [x] **T1.1** UI typecheck passes
|
||||
- [x] **T1.2** UI build passes
|
||||
- [x] **T1.3** Routing smoke test: each route renders without runtime errors (headless DOM test)
|
||||
|
||||
---
|
||||
|
||||
## Milestone 2: Control Plane API Foundation (BFF / Admin API)
|
||||
|
||||
**Goal:** Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 0 (repo bootstrap)
|
||||
|
||||
### Exit Criteria
|
||||
- Control plane API runs as a container and exposes `/health`, `/ready`, `/metrics`
|
||||
- Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints
|
||||
|
||||
### Tasks
|
||||
- [x] **2.1** Initialize Control Plane API service
|
||||
- Rust (Axum + Tokio + tracing) to align with node ecosystem
|
||||
- Baseline endpoints: `GET /health`, `GET /ready`, `GET /metrics`
|
||||
- [x] **2.2** Add request logging and correlation identifiers
|
||||
- `x-request-id` propagation and structured logs (match Gateway conventions)
|
||||
- Propagate `x-correlation-id` and `traceparent` on outbound calls
|
||||
- Log fields: `request_id`, `correlation_id`, `trace_id`, `principal_id`, `tenant_id` (when applicable)
|
||||
- Never log Authorization headers or tokens
|
||||
- [x] **2.3** Implement authentication and authorization boundary
|
||||
- Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens)
|
||||
- Extract principal identity from token claims (at minimum: `sub`, `session_id`)
|
||||
- Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state)
|
||||
- Align `x-tenant-id` semantics with Gateway:
|
||||
- Tenant-scoped endpoints require `x-tenant-id` and must reject missing/invalid values with 400
|
||||
- Platform-scoped endpoints must not depend on `x-tenant-id`
|
||||
- Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state:
|
||||
- Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes
|
||||
- [x] **2.4** Define “job” model for multi-step operations (API contract)
|
||||
- `POST /admin/v1/jobs/*` returns `job_id`
|
||||
- `GET /admin/v1/jobs/{job_id}` returns status + structured steps + errors
|
||||
- Require an idempotency key for job creation (`Idempotency-Key` header), and make repeated creates safe
|
||||
|
||||
### Tests
|
||||
- [x] **T2.1** `GET /health` and `GET /ready` return 200
|
||||
- [x] **T2.2** Unauthorized admin calls return 401/403 consistently
|
||||
- [x] **T2.3** `x-tenant-id` behavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes)
|
||||
- [x] **T2.4** Tautological tests: core state types are Send + Sync
|
||||
|
||||
---
|
||||
|
||||
## Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)
|
||||
|
||||
**Goal:** Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 0 (repo bootstrap)
|
||||
|
||||
### Exit Criteria
|
||||
- Grafana starts with provisioned datasources and dashboards
|
||||
- vmagent scrapes platform services and VictoriaMetrics can query ingested series
|
||||
- Loki is available for log queries (when logs are enabled)
|
||||
|
||||
### Tasks
|
||||
- [x] **3.1** Add observability deployment assets modeled after UltraBase
|
||||
- Grafana provisioning for datasources and dashboards
|
||||
- vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable)
|
||||
- Loki configuration (and optional promtail)
|
||||
- [x] **3.1a** Add distributed tracing backend and wiring
|
||||
- Tempo (or compatible tracing backend) as a Grafana datasource
|
||||
- OTLP receiver path (collector/agent) so platform services can emit traces
|
||||
- Grafana Explore is provisioned so operators can jump from logs to traces
|
||||
- Require the Gateway to accept and propagate `x-correlation-id` and `traceparent` to upstreams, and to include `correlation_id` and `trace_id` in request spans/log fields
|
||||
- [x] **3.2** Implement the base dashboard set from the PRD
|
||||
- Operations overview
|
||||
- HTTP detail (Gateway route-level)
|
||||
- Logs (Loki)
|
||||
- Traces (Tempo)
|
||||
- Event bus / JetStream
|
||||
- Workers (Runner)
|
||||
- Storage (libmdbx + node disk)
|
||||
- Cluster / Orchestrator
|
||||
- [x] **3.3** Add the chosen production-operability dashboards and document required instrumentation
|
||||
- Noisy Neighbor & Tenant Health
|
||||
- API Regression & Deployment
|
||||
- Storage & Event Bus Bottlenecks
|
||||
- Infrastructure Exhaustion
|
||||
- Standardize build/version labeling across services for correlation (`*_build_info{service,version,git_sha}=1`)
|
||||
|
||||
### Tests
|
||||
- [x] **T3.1** Grafana provisioning files are syntactically valid
|
||||
- [x] **T3.2** vmagent config parses and includes all required scrape jobs
|
||||
- [x] **T3.3** Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated)
|
||||
- [x] **T3.4** Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state
|
||||
|
||||
---
|
||||
|
||||
## Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)
|
||||
|
||||
**Goal:** Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 1 (Admin UI foundation)
|
||||
- Milestone 2 (Control Plane API foundation)
|
||||
|
||||
### Exit Criteria
|
||||
- Admin UI can list tenants and show current placement per service kind
|
||||
- Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback
|
||||
|
||||
### Tasks
|
||||
- [x] **4.1** Implement placement read APIs
|
||||
- Read effective placement from NATS KV (and fallback file for development)
|
||||
- Match the Gateway routing config model (placement maps + shard directories + revision semantics)
|
||||
- Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (`aggregate_placement`, `projection_placement`, `runner_placement`)
|
||||
- [x] **4.2** Implement fleet “health snapshot” APIs
|
||||
- Query `/health`, `/ready`, `/metrics` from each service endpoint
|
||||
- Normalize into a stable UI response shape
|
||||
- [x] **4.3** Implement Admin UI pages:
|
||||
- Scale & Placement (read-only)
|
||||
- Tenants (read-only with placement summary)
|
||||
- Fleet/Topology views (read-only)
|
||||
|
||||
### Tests
|
||||
- [x] **T4.1** Placement config parsing and snapshot endpoints work
|
||||
- [x] **T4.2** KV watcher hot-reload swaps placement atomically
|
||||
- [x] **T4.3** UI pages render with mocked API responses (component-level tests)
|
||||
|
||||
---
|
||||
|
||||
## Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs
|
||||
|
||||
**Goal:** Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 4 (read-only ops)
|
||||
|
||||
### Exit Criteria
|
||||
- All operational mutations are executed as jobs with audit events
|
||||
- Every mutation supports preflight planning and clear post-conditions
|
||||
|
||||
### Tasks
|
||||
- [x] **5.1** Implement job orchestration primitives in the API
|
||||
- step model, retries, cancellation, timeouts
|
||||
- per-tenant locking to avoid concurrent conflicting operations
|
||||
- [x] **5.2** Implement drain workflow (per service kind where supported)
|
||||
- Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge)
|
||||
- Aggregate/projection drain semantics via admin endpoints where available
|
||||
- Align drain/readiness semantics with the rebalancing contract in [external_prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/external_prd.md)
|
||||
- [x] **5.3** Implement migration workflow
|
||||
- Plan: drain tenant → update placement → reload routing/config
|
||||
- Block unsafe migrations (health/lag/inflight thresholds)
|
||||
- [x] **5.4** Implement UI mutation flows
|
||||
- modal confirmation + reason required
|
||||
- job progress view and audit linkage
|
||||
|
||||
### Tests
|
||||
- [x] **T5.1** Job idempotency: repeated calls with same idempotency key do not duplicate effects
|
||||
- [x] **T5.2** Migration plan preflight produces a deterministic action plan
|
||||
- [x] **T5.3** Safety gates prevent drain/migrate when invariants fail
|
||||
|
||||
---
|
||||
|
||||
## Milestone 6: Deployments + Regression Tooling (Swarm-Aware)
|
||||
|
||||
**Goal:** Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 3 (observability baseline)
|
||||
- Milestone 5 (job orchestration)
|
||||
|
||||
### Exit Criteria
|
||||
- Deployments can be initiated (or at least observed) via the control plane
|
||||
- Grafana shows deploy markers; dashboards can compare old vs new versions
|
||||
|
||||
### Tasks
|
||||
- [x] **6.1** Implement Swarm integration (read-only first, then mutations)
|
||||
- list services, tasks, images, versions
|
||||
- watch update events (start/finish/fail)
|
||||
- [x] **6.2** Implement deployment annotations/events
|
||||
- write Grafana annotations (or emit a deploy event metric) for vertical markers
|
||||
- [x] **6.3** Implement “API Regression & Deployment” dashboard wiring prerequisites
|
||||
- enforce build/version labeling (`*_build_info{service,version,git_sha}=1` pattern)
|
||||
- ensure scrape relabeling includes `image_tag` where possible
|
||||
- [x] **6.4** UI pages
|
||||
- Deployments list + detail
|
||||
- Per-service “what changed” and “rollback” actions (guarded)
|
||||
|
||||
### Tests
|
||||
- [x] **T6.1** Swarm client abstraction can be mocked and produces deterministic results
|
||||
- [x] **T6.2** Annotation writer produces expected Grafana payloads
|
||||
- [x] **T6.3** Version labels are present on all services in a metrics snapshot test
|
||||
|
||||
---
|
||||
|
||||
## Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)
|
||||
|
||||
**Goal:** Provide a complete Swarm deployment definition for the platform: services in `../` plus the control plane components and the observability stack.
|
||||
|
||||
### Dependencies
|
||||
- Milestone 1 (Admin UI foundation)
|
||||
- Milestone 2 (Control Plane API foundation)
|
||||
- Milestone 3 (Observability baseline)
|
||||
- Milestone 5 (safe mutations baseline)
|
||||
|
||||
### Exit Criteria
|
||||
- `docker stack deploy` brings up:
|
||||
- Gateway + Aggregate + Projection + Runner (from `../`)
|
||||
- Control Plane API + Admin UI
|
||||
- VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail)
|
||||
- All services are reachable via overlay networks and pass health checks
|
||||
- Smoke and integration tests pass end-to-end (gated, but required before milestone completion)
|
||||
|
||||
### Tasks
|
||||
- [x] **7.1** Define Swarm networks, secrets, and configs
|
||||
- overlay network segmentation (public vs internal)
|
||||
- secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning)
|
||||
- [x] **7.2** Define Swarm stack files
|
||||
- base platform stack (gateway/aggregate/projection/runner)
|
||||
- control plane stack (api + ui)
|
||||
- observability stack (vm/vmagent/grafana/loki/promtail)
|
||||
- [x] **7.3** Define placement constraints and scaling defaults
|
||||
- node labels for tenant ranges and infrastructure roles
|
||||
- replica defaults and update policies
|
||||
- [x] **7.4** Define deployment verification and rollback playbooks (as executable checks)
|
||||
- post-deploy checks: `/health`, `/ready`, `/metrics`, dashboard provisioning
|
||||
- rollbacks: service update rollback hooks and job safety checks
|
||||
|
||||
### Tests
|
||||
- [x] **T7.1** Stack YAML parses and validates (unit test)
|
||||
- [x] **T7.2** Swarm smoke test (requires `CONTROL_TEST_DOCKER=1`)
|
||||
- deploy stacks
|
||||
- wait for healthy state
|
||||
- verify Grafana dashboards provisioned and VictoriaMetrics receives samples
|
||||
- [x] **T7.3** End-to-end “control plane can see the fleet” test (requires docker)
|
||||
- UI/API can query placement + health snapshots for all services
|
||||
Reference in New Issue
Block a user