Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s

This commit is contained in:
2026-03-30 11:40:42 +03:00
parent 7e7041cf8b
commit 1298d9a3df
246 changed files with 55434 additions and 0 deletions

43
control/.gitignore vendored Normal file
View File

@@ -0,0 +1,43 @@
/target/
/target-*/
**/target/
*.rs.bk
*.pdb
*.dSYM/
*.orig
*.rej
*.log
*.swp
*.swo
*~
.DS_Store
.idea/
.vscode/
.env
.env.*
.envrc
.direnv/
docker-compose.override.yml
*.mdbx
*.mdbx-*
*.mdbx-lock
*.mdbx.dat
*.mdbx.lck
*.mdb
*.db
/data/
/tmp/
/ui/node_modules/
/ui/dist/
/ui/dist-ssr/
/ui/.eslintcache
/ui/.vite/
/coverage/
lcov.info
*.profraw
*.profdata

341
control/DEVELOPMENT_PLAN.md Normal file
View File

@@ -0,0 +1,341 @@
# Development Plan: Control Plane (Admin UI + Observability + Production Ops)
## Overview
This plan breaks down the Control Plane implementation into milestones ordered by dependency. Each milestone includes:
- **Tasks** with clear deliverables
- **Test Requirements** (unit tests + tautological tests + integration tests where applicable)
- **Dependencies** on previous milestones
**Development Approach:**
1. Complete one milestone at a time
2. Write tests before implementation (TDD where applicable)
3. All tests must pass before moving to the next milestone
4. Mark tasks complete with `[x]` as you progress
This plan is intentionally aligned with the style and gating discipline used in sibling repos (see: [gateway/DEVELOPMENT_PLAN.md](file:///Users/vlad/Developer/cloudlysis/gateway/DEVELOPMENT_PLAN.md), [runner/DEVELOPMENT_PLAN.md](file:///Users/vlad/Developer/cloudlysis/runner/DEVELOPMENT_PLAN.md)).
---
## Milestone 0: Repo Bootstrap (Dev Ergonomics + Guardrails)
**Goal:** Establish canonical commands, CI entrypoints, and integration-test gating so later milestones can be executed and verified consistently.
### Tasks
- [x] **0.1** Define canonical local commands for the repo
- UI:
- `npm run lint`
- `npm run typecheck`
- `npm run test`
- `npm run build`
- Control Plane API:
- `cargo test`
- `cargo fmt --check`
- `cargo clippy -- -D warnings`
- `cargo run -- --help`
- Docker/Swarm:
- `docker compose config` validation for local stacks (if used)
- `docker stack deploy ...` smoke validation for Swarm (gated, see Tests)
- [x] **0.2** Add a minimal CI workflow that runs the same commands as **0.1**
- [x] **0.3** Define integration-test gating conventions
- Docker/Swarm integration tests:
- Mark as ignored by default and run only when `CONTROL_TEST_DOCKER=1` is set
- Example: `CONTROL_TEST_DOCKER=1 cargo test -- --ignored`
- NATS-dependent integration tests:
- Mark as ignored by default and run only when `CONTROL_TEST_NATS_URL` is set
- Example: `CONTROL_TEST_NATS_URL=nats://127.0.0.1:4222 cargo test -- --ignored`
- [x] **0.4** Define baseline operational invariants (checklist for later milestones)
- No privileged action without RBAC + audit event
- No multi-step operation without idempotency key + job record
- Always propagate `tenant_id` (when applicable) end-to-end
- Always propagate request/flow identifiers end-to-end (logs + downstream calls):
- `x-request-id` (per HTTP request)
- `x-correlation-id` (per user-visible flow/job; generated by the Gateway when missing)
- `traceparent` (W3C trace context; started by the Gateway when missing)
- Secrets never appear in logs (Authorization headers, tokens, credentials, Grafana admin creds)
- No tenant-level metrics without bounded cardinality rules
### Tests
- [x] **T0.1** Tautological test: test harness runs for both subprojects (UI + API)
- [x] **T0.2** Lint + typecheck + unit tests pass
- [x] **T0.3** Docker config validation passes (compose/stack linting tests)
---
## Milestone 1: Admin UI Foundation (UltraBase UX Reuse)
**Goal:** Bring up the Admin UI with the UltraBase component system and navigation skeleton, adapted to Cloudlysis page structure.
### Dependencies
- Milestone 0 (repo bootstrap)
### Exit Criteria
- Admin UI builds successfully and passes unit/type checks
- UI navigation skeleton matches the PRD information architecture
### Tasks
- [x] **1.1** Initialize Admin UI project (Vite + React + TypeScript)
- Choose and wire lint/typecheck/test/build tooling to match the canonical commands in **0.1**
- Adopt the baseline dependencies used by UltraBase control-plane admin UI where available
- Establish UI module layout for: components, pages, routes, API client, auth/session utilities
- [x] **1.2** Reuse UltraBase UI primitives and styling tokens (adapted, not forked blindly)
- Buttons, inputs, tables, dropdowns, modal, toast, breadcrumbs
- [x] **1.3** Implement navigation skeleton and empty pages (route wiring only)
- Overview
- Tenants
- Users
- Sessions
- Roles & Permissions
- Config
- Definitions
- Scale & Placement
- Deployments
- Observability
- Audit Log
- Settings
- [x] **1.3a** Add correlation-first investigation affordances in the UI skeleton
- Global search box that accepts `x-request-id`, `x-correlation-id`, or `trace_id`
- “Investigate” links that open Grafana Explore prefilled for:
- Loki query scoped to `x-correlation-id` (and `x-request-id` when available)
- Tempo trace view when a `trace_id` is present
- Ensure jobs and audit log rows display and copy the relevant ids
- [x] **1.4** Implement API client stub with consistent error handling and request-id propagation
- Send `x-request-id` on every request (generate one when missing)
- Send `x-correlation-id` when continuing an existing UI flow; otherwise omit and use the Gateway-generated value returned in responses
- Send `traceparent` when continuing an existing trace; otherwise omit and use the Gateway-started trace
- Echo `x-request-id` and `x-correlation-id` on responses and surface them in error UX
- Persist the most recent ids in the UI so operators can copy/paste them into support tickets
### Tests
- [x] **T1.1** UI typecheck passes
- [x] **T1.2** UI build passes
- [x] **T1.3** Routing smoke test: each route renders without runtime errors (headless DOM test)
---
## Milestone 2: Control Plane API Foundation (BFF / Admin API)
**Goal:** Provide the minimal API surface required for the Admin UI to authenticate, read core state, and display health/metrics.
### Dependencies
- Milestone 0 (repo bootstrap)
### Exit Criteria
- Control plane API runs as a container and exposes `/health`, `/ready`, `/metrics`
- Auth integration contract is defined (Gateway as source of truth) and enforced on admin endpoints
### Tasks
- [x] **2.1** Initialize Control Plane API service
- Rust (Axum + Tokio + tracing) to align with node ecosystem
- Baseline endpoints: `GET /health`, `GET /ready`, `GET /metrics`
- [x] **2.2** Add request logging and correlation identifiers
- `x-request-id` propagation and structured logs (match Gateway conventions)
- Propagate `x-correlation-id` and `traceparent` on outbound calls
- Log fields: `request_id`, `correlation_id`, `trace_id`, `principal_id`, `tenant_id` (when applicable)
- Never log Authorization headers or tokens
- [x] **2.3** Implement authentication and authorization boundary
- Validate Gateway-issued access tokens (same signing config as Gateway; Control does not mint tokens)
- Extract principal identity from token claims (at minimum: `sub`, `session_id`)
- Enforce permissions at the API boundary (deny-by-default, rights strings stored in Gateway IAM state)
- Align `x-tenant-id` semantics with Gateway:
- Tenant-scoped endpoints require `x-tenant-id` and must reject missing/invalid values with 400
- Platform-scoped endpoints must not depend on `x-tenant-id`
- Prefer proxying to Gateway for IAM CRUD instead of duplicating identity/RBAC state:
- Control API may expose a thin BFF surface, but must preserve Gateway status codes and error text for pass-through routes
- [x] **2.4** Define “job” model for multi-step operations (API contract)
- `POST /admin/v1/jobs/*` returns `job_id`
- `GET /admin/v1/jobs/{job_id}` returns status + structured steps + errors
- Require an idempotency key for job creation (`Idempotency-Key` header), and make repeated creates safe
### Tests
- [x] **T2.1** `GET /health` and `GET /ready` return 200
- [x] **T2.2** Unauthorized admin calls return 401/403 consistently
- [x] **T2.3** `x-tenant-id` behavior matches Gateway rules (400 on missing/invalid for tenant-scoped routes)
- [x] **T2.4** Tautological tests: core state types are Send + Sync
---
## Milestone 3: Observability Stack Baseline (VM + Loki + Grafana)
**Goal:** Include a production-grade observability stack with version-controlled provisioning and Cloudlysis dashboard placeholders wired to existing service metrics.
### Dependencies
- Milestone 0 (repo bootstrap)
### Exit Criteria
- Grafana starts with provisioned datasources and dashboards
- vmagent scrapes platform services and VictoriaMetrics can query ingested series
- Loki is available for log queries (when logs are enabled)
### Tasks
- [x] **3.1** Add observability deployment assets modeled after UltraBase
- Grafana provisioning for datasources and dashboards
- vmagent scrape configs for Cloudlysis services + node/Swarm exporters (where applicable)
- Loki configuration (and optional promtail)
- [x] **3.1a** Add distributed tracing backend and wiring
- Tempo (or compatible tracing backend) as a Grafana datasource
- OTLP receiver path (collector/agent) so platform services can emit traces
- Grafana Explore is provisioned so operators can jump from logs to traces
- Require the Gateway to accept and propagate `x-correlation-id` and `traceparent` to upstreams, and to include `correlation_id` and `trace_id` in request spans/log fields
- [x] **3.2** Implement the base dashboard set from the PRD
- Operations overview
- HTTP detail (Gateway route-level)
- Logs (Loki)
- Traces (Tempo)
- Event bus / JetStream
- Workers (Runner)
- Storage (libmdbx + node disk)
- Cluster / Orchestrator
- [x] **3.3** Add the chosen production-operability dashboards and document required instrumentation
- Noisy Neighbor & Tenant Health
- API Regression & Deployment
- Storage & Event Bus Bottlenecks
- Infrastructure Exhaustion
- Standardize build/version labeling across services for correlation (`*_build_info{service,version,git_sha}=1`)
### Tests
- [x] **T3.1** Grafana provisioning files are syntactically valid
- [x] **T3.2** vmagent config parses and includes all required scrape jobs
- [x] **T3.3** Tempo (or chosen tracing backend) reaches healthy state in the stack smoke test (gated)
- [x] **T3.4** Container startup smoke test (compose or Swarm, gated): Grafana + VictoriaMetrics + Loki reach healthy state
---
## Milestone 4: Tenant + Placement Visibility (Read-Only Ops First)
**Goal:** Provide safe, read-only visibility into tenant placement and runtime health across Aggregate/Projection/Runner/Gateway, matching existing placement semantics.
### Dependencies
- Milestone 1 (Admin UI foundation)
- Milestone 2 (Control Plane API foundation)
### Exit Criteria
- Admin UI can list tenants and show current placement per service kind
- Placement is sourced from the production control-plane substrate (NATS KV) with a development fallback
### Tasks
- [x] **4.1** Implement placement read APIs
- Read effective placement from NATS KV (and fallback file for development)
- Match the Gateway routing config model (placement maps + shard directories + revision semantics)
- Support per-service-kind placement maps (Aggregate, Projection, Runner) using the same naming conventions used elsewhere (`aggregate_placement`, `projection_placement`, `runner_placement`)
- [x] **4.2** Implement fleet “health snapshot” APIs
- Query `/health`, `/ready`, `/metrics` from each service endpoint
- Normalize into a stable UI response shape
- [x] **4.3** Implement Admin UI pages:
- Scale & Placement (read-only)
- Tenants (read-only with placement summary)
- Fleet/Topology views (read-only)
### Tests
- [x] **T4.1** Placement config parsing and snapshot endpoints work
- [x] **T4.2** KV watcher hot-reload swaps placement atomically
- [x] **T4.3** UI pages render with mocked API responses (component-level tests)
---
## Milestone 5: Safe Mutations (Drain, Migrate, Reload) via Idempotent Jobs
**Goal:** Implement the first high-impact operational workflows with strict guardrails: tenant drain, placement update, and reload.
### Dependencies
- Milestone 4 (read-only ops)
### Exit Criteria
- All operational mutations are executed as jobs with audit events
- Every mutation supports preflight planning and clear post-conditions
### Tasks
- [x] **5.1** Implement job orchestration primitives in the API
- step model, retries, cancellation, timeouts
- per-tenant locking to avoid concurrent conflicting operations
- [x] **5.2** Implement drain workflow (per service kind where supported)
- Runner tenant drain semantics (stop acquiring new work, wait for inflight to converge)
- Aggregate/projection drain semantics via admin endpoints where available
- Align drain/readiness semantics with the rebalancing contract in [external_prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/external_prd.md)
- [x] **5.3** Implement migration workflow
- Plan: drain tenant → update placement → reload routing/config
- Block unsafe migrations (health/lag/inflight thresholds)
- [x] **5.4** Implement UI mutation flows
- modal confirmation + reason required
- job progress view and audit linkage
### Tests
- [x] **T5.1** Job idempotency: repeated calls with same idempotency key do not duplicate effects
- [x] **T5.2** Migration plan preflight produces a deterministic action plan
- [x] **T5.3** Safety gates prevent drain/migrate when invariants fail
---
## Milestone 6: Deployments + Regression Tooling (Swarm-Aware)
**Goal:** Make deployments and regressions observable and controllable from the control plane, with strong “what changed when” correlation.
### Dependencies
- Milestone 3 (observability baseline)
- Milestone 5 (job orchestration)
### Exit Criteria
- Deployments can be initiated (or at least observed) via the control plane
- Grafana shows deploy markers; dashboards can compare old vs new versions
### Tasks
- [x] **6.1** Implement Swarm integration (read-only first, then mutations)
- list services, tasks, images, versions
- watch update events (start/finish/fail)
- [x] **6.2** Implement deployment annotations/events
- write Grafana annotations (or emit a deploy event metric) for vertical markers
- [x] **6.3** Implement “API Regression & Deployment” dashboard wiring prerequisites
- enforce build/version labeling (`*_build_info{service,version,git_sha}=1` pattern)
- ensure scrape relabeling includes `image_tag` where possible
- [x] **6.4** UI pages
- Deployments list + detail
- Per-service “what changed” and “rollback” actions (guarded)
### Tests
- [x] **T6.1** Swarm client abstraction can be mocked and produces deterministic results
- [x] **T6.2** Annotation writer produces expected Grafana payloads
- [x] **T6.3** Version labels are present on all services in a metrics snapshot test
---
## Milestone 7: Full Docker Swarm Deployment (Platform + Observability + Control Plane)
**Goal:** Provide a complete Swarm deployment definition for the platform: services in `../` plus the control plane components and the observability stack.
### Dependencies
- Milestone 1 (Admin UI foundation)
- Milestone 2 (Control Plane API foundation)
- Milestone 3 (Observability baseline)
- Milestone 5 (safe mutations baseline)
### Exit Criteria
- `docker stack deploy` brings up:
- Gateway + Aggregate + Projection + Runner (from `../`)
- Control Plane API + Admin UI
- VictoriaMetrics + vmagent + Grafana + Loki (+ optional promtail)
- All services are reachable via overlay networks and pass health checks
- Smoke and integration tests pass end-to-end (gated, but required before milestone completion)
### Tasks
- [x] **7.1** Define Swarm networks, secrets, and configs
- overlay network segmentation (public vs internal)
- secrets for auth/signing keys, NATS credentials (if used), Grafana admin creds (or provisioning)
- [x] **7.2** Define Swarm stack files
- base platform stack (gateway/aggregate/projection/runner)
- control plane stack (api + ui)
- observability stack (vm/vmagent/grafana/loki/promtail)
- [x] **7.3** Define placement constraints and scaling defaults
- node labels for tenant ranges and infrastructure roles
- replica defaults and update policies
- [x] **7.4** Define deployment verification and rollback playbooks (as executable checks)
- post-deploy checks: `/health`, `/ready`, `/metrics`, dashboard provisioning
- rollbacks: service update rollback hooks and job safety checks
### Tests
- [x] **T7.1** Stack YAML parses and validates (unit test)
- [x] **T7.2** Swarm smoke test (requires `CONTROL_TEST_DOCKER=1`)
- deploy stacks
- wait for healthy state
- verify Grafana dashboards provisioned and VictoriaMetrics receives samples
- [x] **T7.3** End-to-end “control plane can see the fleet” test (requires docker)
- UI/API can query placement + health snapshots for all services

25
control/api/Cargo.toml Normal file
View File

@@ -0,0 +1,25 @@
[package]
name = "api"
version = "0.1.0"
edition = "2024"
publish = ["madapes"]
[dependencies]
axum = "0.8.6"
clap = { version = "4.5.48", features = ["derive", "env"] }
jsonwebtoken = "9.3.1"
metrics = "0.23.0"
metrics-exporter-prometheus = "0.16.0"
reqwest = { version = "0.12.23", default-features = false, features = ["json", "rustls-tls"] }
serde = { version = "1.0.228", features = ["derive"] }
serde_json = "1.0.149"
thiserror = "2.0.16"
tokio = { version = "1.45.0", features = ["macros", "net", "process", "rt-multi-thread", "signal", "time"] }
tower-http = { version = "0.6.6", features = ["trace"] }
tracing = "0.1.41"
tracing-subscriber = { version = "0.3.20", features = ["env-filter"] }
uuid = { version = "1.18.1", features = ["serde", "v4"] }
[dev-dependencies]
serde_yaml = "0.9.34"
tower = "0.5.2"

417
control/api/src/admin.rs Normal file
View File

@@ -0,0 +1,417 @@
use crate::{
AppState, RequestIds,
auth::{Principal, has_permission},
fleet,
job_engine::{JobEngine, StartJobError},
jobs::{Job, JobStatus, JobStep},
placement::{PlacementResponse, ServiceKind},
swarm::{SwarmService, SwarmTask},
};
use axum::{
Json, Router,
extract::{Extension, Path, State},
http::{HeaderMap, StatusCode},
response::IntoResponse,
routing::{get, post},
};
use serde::Deserialize;
use std::time::{SystemTime, UNIX_EPOCH};
use uuid::Uuid;
const HEADER_IDEMPOTENCY_KEY: &str = "idempotency-key";
const HEADER_TENANT_ID: &str = "x-tenant-id";
pub fn admin_router() -> Router<AppState> {
Router::new()
.route("/whoami", get(whoami))
.route("/platform/info", get(platform_info))
.route("/fleet/snapshot", get(fleet_snapshot))
.route("/tenants", get(list_tenants))
.route("/placement/{kind}", get(get_placement))
.route("/tenants/echo", get(tenant_echo))
.route("/jobs/echo", post(create_echo_job))
.route("/jobs/{job_id}", get(get_job))
.route("/jobs/{job_id}/cancel", post(cancel_job))
.route("/jobs/tenant/drain", post(start_tenant_drain))
.route("/jobs/tenant/migrate", post(start_tenant_migrate))
.route("/plan/tenant/migrate", post(plan_tenant_migrate))
.route("/audit", get(list_audit))
.route("/swarm/services", get(list_swarm_services))
.route("/swarm/services/{name}/tasks", get(list_swarm_tasks))
}
async fn whoami(Extension(principal): Extension<Principal>) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
(
StatusCode::OK,
Json(serde_json::json!({
"sub": principal.sub,
"session_id": principal.session_id,
"permissions": principal.permissions,
})),
)
.into_response()
}
async fn platform_info(Extension(principal): Extension<Principal>) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
(
StatusCode::OK,
Json(serde_json::json!({
"service": "control-api",
})),
)
.into_response()
}
async fn fleet_snapshot(
State(state): State<AppState>,
Extension(principal): Extension<Principal>,
Extension(request_ids): Extension<RequestIds>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let services =
fleet::snapshot_with_context(&state.http, &state.fleet_services, Some(&request_ids)).await;
(
StatusCode::OK,
Json(serde_json::json!({ "services": services })),
)
.into_response()
}
async fn get_placement(
State(state): State<AppState>,
Path(kind): Path<String>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let kind = match kind.as_str() {
"aggregate" => ServiceKind::Aggregate,
"projection" => ServiceKind::Projection,
"runner" => ServiceKind::Runner,
_ => return StatusCode::NOT_FOUND.into_response(),
};
let resp: PlacementResponse = state.placement.get_for_kind(kind);
(StatusCode::OK, Json(resp)).into_response()
}
async fn list_tenants(
State(state): State<AppState>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let tenants = state.placement.tenant_summaries();
(
StatusCode::OK,
Json(serde_json::json!({ "tenants": tenants })),
)
.into_response()
}
async fn tenant_echo(
headers: HeaderMap,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let tenant_id = headers
.get(HEADER_TENANT_ID)
.and_then(|v| v.to_str().ok())
.ok_or(StatusCode::BAD_REQUEST)
.and_then(|s| Uuid::parse_str(s).map_err(|_| StatusCode::BAD_REQUEST));
match tenant_id {
Ok(tenant_id) => (
StatusCode::OK,
Json(serde_json::json!({
"tenant_id": tenant_id,
})),
)
.into_response(),
Err(status) => status.into_response(),
}
}
async fn create_echo_job(
State(state): State<AppState>,
headers: HeaderMap,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:write") {
return StatusCode::FORBIDDEN.into_response();
}
let key = headers
.get(HEADER_IDEMPOTENCY_KEY)
.and_then(|v| v.to_str().ok())
.ok_or(StatusCode::BAD_REQUEST);
let key = match key {
Ok(k) if !k.is_empty() => k,
_ => return StatusCode::BAD_REQUEST.into_response(),
};
let now = now_ms();
let job_id = Uuid::new_v4();
let job = Job {
job_id,
status: JobStatus::Succeeded,
steps: vec![JobStep {
name: "echo".to_string(),
status: JobStatus::Succeeded,
attempts: 1,
error: None,
}],
error: None,
created_at_ms: now,
started_at_ms: Some(now),
finished_at_ms: Some(now),
};
let job_id = state.jobs.insert_idempotent(key, job);
state.audit.record(crate::audit::AuditEvent {
ts_ms: now,
principal_sub: principal.sub.clone(),
action: "job.echo".to_string(),
tenant_id: None,
reason: "echo".to_string(),
job_id: Some(job_id),
});
(
StatusCode::OK,
Json(serde_json::json!({
"job_id": job_id,
})),
)
.into_response()
}
async fn get_job(
State(state): State<AppState>,
Path(job_id): Path<Uuid>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
match state.jobs.get(job_id) {
Some(job) => (StatusCode::OK, Json(job)).into_response(),
None => StatusCode::NOT_FOUND.into_response(),
}
}
#[derive(Debug, Deserialize)]
struct TenantDrainRequest {
tenant_id: Uuid,
reason: String,
}
#[derive(Debug, Deserialize)]
struct TenantMigrateRequest {
tenant_id: Uuid,
runner_target: String,
reason: String,
}
async fn start_tenant_drain(
State(state): State<AppState>,
headers: HeaderMap,
Extension(principal): Extension<Principal>,
Json(body): Json<TenantDrainRequest>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:write") {
return StatusCode::FORBIDDEN.into_response();
}
let key = headers
.get(HEADER_IDEMPOTENCY_KEY)
.and_then(|v| v.to_str().ok())
.ok_or(StatusCode::BAD_REQUEST);
let key = match key {
Ok(k) if !k.is_empty() => k,
_ => return StatusCode::BAD_REQUEST.into_response(),
};
let engine = JobEngine::new(
state.jobs.clone(),
state.audit.clone(),
state.tenant_locks.clone(),
);
let job_id = match engine.start_tenant_drain(
state.clone(),
&principal,
body.tenant_id,
body.reason,
key,
) {
Ok(id) => id,
Err(StartJobError::TenantLocked) => return StatusCode::CONFLICT.into_response(),
};
(
StatusCode::OK,
Json(serde_json::json!({ "job_id": job_id })),
)
.into_response()
}
async fn start_tenant_migrate(
State(state): State<AppState>,
headers: HeaderMap,
Extension(principal): Extension<Principal>,
Json(body): Json<TenantMigrateRequest>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:write") {
return StatusCode::FORBIDDEN.into_response();
}
let key = headers
.get(HEADER_IDEMPOTENCY_KEY)
.and_then(|v| v.to_str().ok())
.ok_or(StatusCode::BAD_REQUEST);
let key = match key {
Ok(k) if !k.is_empty() => k,
_ => return StatusCode::BAD_REQUEST.into_response(),
};
let engine = JobEngine::new(
state.jobs.clone(),
state.audit.clone(),
state.tenant_locks.clone(),
);
let job_id = match engine.start_tenant_migrate(
state.clone(),
&principal,
body.tenant_id,
body.runner_target,
body.reason,
key,
) {
Ok(id) => id,
Err(StartJobError::TenantLocked) => return StatusCode::CONFLICT.into_response(),
};
(
StatusCode::OK,
Json(serde_json::json!({ "job_id": job_id })),
)
.into_response()
}
async fn cancel_job(
State(state): State<AppState>,
Path(job_id): Path<Uuid>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:write") {
return StatusCode::FORBIDDEN.into_response();
}
if state.jobs.request_cancel(job_id) {
state.audit.record(crate::audit::AuditEvent {
ts_ms: now_ms(),
principal_sub: principal.sub.clone(),
action: "job.cancel".to_string(),
tenant_id: None,
reason: "cancel requested".to_string(),
job_id: Some(job_id),
});
StatusCode::OK.into_response()
} else {
StatusCode::NOT_FOUND.into_response()
}
}
fn now_ms() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_millis() as u64
}
async fn list_audit(
State(state): State<AppState>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let events = state.audit.list_recent(200);
(
StatusCode::OK,
Json(serde_json::json!({ "events": events })),
)
.into_response()
}
async fn plan_tenant_migrate(
Extension(principal): Extension<Principal>,
Json(body): Json<TenantMigrateRequest>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:write") {
return StatusCode::FORBIDDEN.into_response();
}
let _ = (body.tenant_id, body.runner_target, body.reason);
(
StatusCode::OK,
Json(serde_json::json!({
"steps": ["preflight", "drain", "update_placement", "reload", "verify"]
})),
)
.into_response()
}
async fn list_swarm_services(
State(state): State<AppState>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let services: Vec<SwarmService> = state.swarm.list_services();
(
StatusCode::OK,
Json(serde_json::json!({ "services": services })),
)
.into_response()
}
async fn list_swarm_tasks(
State(state): State<AppState>,
Path(name): Path<String>,
Extension(principal): Extension<Principal>,
) -> impl IntoResponse {
if !has_permission(&principal, "control:read") {
return StatusCode::FORBIDDEN.into_response();
}
let tasks: Vec<SwarmTask> = state.swarm.list_tasks(&name);
(
StatusCode::OK,
Json(serde_json::json!({ "service": name, "tasks": tasks })),
)
.into_response()
}

31
control/api/src/audit.rs Normal file
View File

@@ -0,0 +1,31 @@
use serde::{Deserialize, Serialize};
use std::sync::{Arc, Mutex};
use uuid::Uuid;
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct AuditEvent {
pub ts_ms: u64,
pub principal_sub: String,
pub action: String,
pub tenant_id: Option<Uuid>,
pub reason: String,
pub job_id: Option<Uuid>,
}
#[derive(Clone, Default)]
pub struct AuditStore {
inner: Arc<Mutex<Vec<AuditEvent>>>,
}
impl AuditStore {
pub fn record(&self, event: AuditEvent) {
let mut events = self.inner.lock().expect("audit lock poisoned");
events.push(event);
}
pub fn list_recent(&self, limit: usize) -> Vec<AuditEvent> {
let events = self.inner.lock().expect("audit lock poisoned");
let start = events.len().saturating_sub(limit);
events[start..].to_vec()
}
}

78
control/api/src/auth.rs Normal file
View File

@@ -0,0 +1,78 @@
use crate::AppState;
use axum::{
extract::State,
http::{Request, StatusCode},
middleware::Next,
response::{IntoResponse, Response},
};
use jsonwebtoken::{Algorithm, DecodingKey, Validation, decode};
use serde::{Deserialize, Serialize};
#[derive(Clone)]
pub struct AuthConfig {
pub hs256_secret: Option<Vec<u8>>,
}
#[derive(Clone, Debug)]
pub struct Principal {
pub sub: String,
pub session_id: String,
pub permissions: Vec<String>,
}
#[derive(Debug, Serialize, Deserialize)]
struct Claims {
sub: String,
session_id: String,
permissions: Vec<String>,
exp: usize,
}
pub async fn auth_middleware(
State(state): State<AppState>,
mut req: Request<axum::body::Body>,
next: Next,
) -> Response {
match authenticate(
&state.auth,
req.headers().get(axum::http::header::AUTHORIZATION),
) {
Ok(principal) => {
req.extensions_mut().insert(principal);
next.run(req).await
}
Err(status) => status.into_response(),
}
}
fn authenticate(
cfg: &AuthConfig,
auth_header: Option<&axum::http::HeaderValue>,
) -> Result<Principal, StatusCode> {
let secret = cfg
.hs256_secret
.as_ref()
.ok_or(StatusCode::SERVICE_UNAVAILABLE)?;
let header = auth_header.ok_or(StatusCode::UNAUTHORIZED)?;
let header_str = header.to_str().map_err(|_| StatusCode::UNAUTHORIZED)?;
let token = header_str
.strip_prefix("Bearer ")
.ok_or(StatusCode::UNAUTHORIZED)?;
let mut validation = Validation::new(Algorithm::HS256);
validation.required_spec_claims.insert("exp".to_string());
let data = decode::<Claims>(token, &DecodingKey::from_secret(secret), &validation)
.map_err(|_| StatusCode::UNAUTHORIZED)?;
Ok(Principal {
sub: data.claims.sub,
session_id: data.claims.session_id,
permissions: data.claims.permissions,
})
}
pub fn has_permission(principal: &Principal, permission: &str) -> bool {
principal.permissions.iter().any(|p| p == permission)
}

View File

@@ -0,0 +1,57 @@
use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct BuildInfo {
pub service: String,
pub version: String,
pub git_sha: String,
}
pub fn extract_build_info(metrics: &str) -> Vec<BuildInfo> {
let mut out = Vec::new();
for line in metrics.lines() {
let line = line.trim();
if line.is_empty() || line.starts_with('#') {
continue;
}
let Some((metric_and_labels, value)) = line.split_once(' ') else {
continue;
};
if value.trim() != "1" {
continue;
}
if !metric_and_labels.ends_with('}') {
continue;
}
let Some((name, labels)) = metric_and_labels.split_once('{') else {
continue;
};
if !name.ends_with("_build_info") {
continue;
}
let labels = labels.trim_end_matches('}');
let mut service = None;
let mut version = None;
let mut git_sha = None;
for part in labels.split(',') {
let Some((k, v)) = part.split_once('=') else {
continue;
};
let v = v.trim().trim_matches('"');
match k.trim() {
"service" => service = Some(v.to_string()),
"version" => version = Some(v.to_string()),
"git_sha" => git_sha = Some(v.to_string()),
_ => {}
}
}
if let (Some(service), Some(version), Some(git_sha)) = (service, version, git_sha) {
out.push(BuildInfo {
service,
version,
git_sha,
});
}
}
out
}

View File

@@ -0,0 +1,42 @@
use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct GrafanaAnnotation {
pub time: i64,
pub tags: Vec<String>,
pub text: String,
}
pub fn build_grafana_deploy_annotation(args: DeployAnnotationArgs) -> GrafanaAnnotation {
let mut tags = vec![
"cloudlysis".to_string(),
"deploy".to_string(),
format!("service:{}", args.service),
];
if let Some(v) = args.version {
tags.push(format!("version:{v}"));
}
if let Some(sha) = args.git_sha {
tags.push(format!("git_sha:{sha}"));
}
let text = match (args.version, args.git_sha) {
(Some(v), Some(sha)) => format!("deploy {} v={} git_sha={sha}", args.service, v),
(Some(v), None) => format!("deploy {} v={}", args.service, v),
(None, Some(sha)) => format!("deploy {} git_sha={sha}", args.service),
(None, None) => format!("deploy {}", args.service),
};
GrafanaAnnotation {
time: args.time_ms,
tags,
text,
}
}
pub struct DeployAnnotationArgs<'a> {
pub service: &'a str,
pub version: Option<&'a str>,
pub git_sha: Option<&'a str>,
pub time_ms: i64,
}

67
control/api/src/fleet.rs Normal file
View File

@@ -0,0 +1,67 @@
use serde::{Deserialize, Serialize};
use std::time::Duration;
use crate::RequestIds;
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct FleetService {
pub name: String,
pub base_url: String,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct FleetServiceSnapshot {
pub name: String,
pub base_url: String,
pub health_ok: bool,
pub ready_ok: bool,
pub metrics_ok: bool,
}
pub async fn snapshot(
client: &reqwest::Client,
services: &[FleetService],
) -> Vec<FleetServiceSnapshot> {
snapshot_with_context(client, services, None).await
}
pub async fn snapshot_with_context(
client: &reqwest::Client,
services: &[FleetService],
ctx: Option<&RequestIds>,
) -> Vec<FleetServiceSnapshot> {
let mut out = Vec::with_capacity(services.len());
for svc in services {
let base = svc.base_url.trim_end_matches('/');
let health_ok = get_ok(client, &format!("{base}/health"), ctx).await;
let ready_ok = get_ok(client, &format!("{base}/ready"), ctx).await;
let metrics_ok = get_ok(client, &format!("{base}/metrics"), ctx).await;
out.push(FleetServiceSnapshot {
name: svc.name.clone(),
base_url: svc.base_url.clone(),
health_ok,
ready_ok,
metrics_ok,
});
}
out
}
async fn get_ok(client: &reqwest::Client, url: &str, ctx: Option<&RequestIds>) -> bool {
let mut req = client.get(url).timeout(Duration::from_secs(2));
if let Some(ctx) = ctx {
req = req.header("x-request-id", &ctx.request_id);
if let Some(cid) = &ctx.correlation_id {
req = req.header("x-correlation-id", cid);
}
if let Some(tp) = &ctx.traceparent {
req = req.header("traceparent", tp);
}
}
let res = req.send().await;
match res {
Ok(r) => r.status().is_success(),
Err(_) => false,
}
}

View File

@@ -0,0 +1,348 @@
use crate::{
AppState, Principal,
audit::{AuditEvent, AuditStore},
fleet,
jobs::{Job, JobStatus, JobStep, JobStore},
};
use std::{
collections::HashMap,
sync::{Arc, Mutex},
time::{Duration, SystemTime, UNIX_EPOCH},
};
use uuid::Uuid;
#[derive(Clone, Default)]
pub struct TenantLocks {
inner: Arc<Mutex<HashMap<Uuid, Uuid>>>,
}
impl TenantLocks {
pub fn try_lock(&self, tenant_id: Uuid, job_id: Uuid) -> bool {
let mut map = self.inner.lock().expect("tenant locks poisoned");
if map.contains_key(&tenant_id) {
return false;
}
map.insert(tenant_id, job_id);
true
}
pub fn unlock(&self, tenant_id: Uuid, job_id: Uuid) {
let mut map = self.inner.lock().expect("tenant locks poisoned");
if map.get(&tenant_id).copied() == Some(job_id) {
map.remove(&tenant_id);
}
}
}
#[derive(Clone)]
pub struct JobEngine {
pub jobs: JobStore,
pub audit: AuditStore,
pub tenant_locks: TenantLocks,
pub step_timeout: Duration,
}
impl JobEngine {
pub fn new(jobs: JobStore, audit: AuditStore, tenant_locks: TenantLocks) -> Self {
Self {
jobs,
audit,
tenant_locks,
step_timeout: Duration::from_millis(500),
}
}
pub fn start_tenant_drain(
&self,
state: AppState,
principal: &Principal,
tenant_id: Uuid,
reason: String,
idempotency_key: &str,
) -> Result<Uuid, StartJobError> {
if let Some(existing) = self.jobs.get_idempotent(idempotency_key) {
return Ok(existing);
}
let job_id = Uuid::new_v4();
if !self.tenant_locks.try_lock(tenant_id, job_id) {
return Err(StartJobError::TenantLocked);
}
let now = now_ms();
let job = Job {
job_id,
status: JobStatus::Pending,
steps: vec![step("preflight"), step("drain"), step("verify")],
error: None,
created_at_ms: now,
started_at_ms: None,
finished_at_ms: None,
};
let inserted = self.jobs.insert_idempotent(idempotency_key, job);
self.audit.record(AuditEvent {
ts_ms: now,
principal_sub: principal.sub.clone(),
action: "tenant.drain".to_string(),
tenant_id: Some(tenant_id),
reason,
job_id: Some(inserted),
});
let engine = self.clone();
tokio::spawn(async move {
engine
.run_job(state, inserted, Some(tenant_id), RunSpec::Drain)
.await;
});
Ok(inserted)
}
pub fn start_tenant_migrate(
&self,
state: AppState,
principal: &Principal,
tenant_id: Uuid,
runner_target: String,
reason: String,
idempotency_key: &str,
) -> Result<Uuid, StartJobError> {
if let Some(existing) = self.jobs.get_idempotent(idempotency_key) {
return Ok(existing);
}
let job_id = Uuid::new_v4();
if !self.tenant_locks.try_lock(tenant_id, job_id) {
return Err(StartJobError::TenantLocked);
}
let now = now_ms();
let job = Job {
job_id,
status: JobStatus::Pending,
steps: vec![
step("preflight"),
step("drain"),
step("update_placement"),
step("reload"),
step("verify"),
],
error: None,
created_at_ms: now,
started_at_ms: None,
finished_at_ms: None,
};
let inserted = self.jobs.insert_idempotent(idempotency_key, job);
self.audit.record(AuditEvent {
ts_ms: now,
principal_sub: principal.sub.clone(),
action: "tenant.migrate".to_string(),
tenant_id: Some(tenant_id),
reason,
job_id: Some(inserted),
});
let engine = self.clone();
tokio::spawn(async move {
engine
.run_job(
state,
inserted,
Some(tenant_id),
RunSpec::Migrate { runner_target },
)
.await;
});
Ok(inserted)
}
async fn run_job(&self, state: AppState, job_id: Uuid, tenant_id: Option<Uuid>, spec: RunSpec) {
self.jobs.update(job_id, |j| {
j.status = JobStatus::Running;
j.started_at_ms = Some(now_ms());
});
let mut ok = true;
for idx in 0.. {
if self.jobs.cancel_requested(job_id) {
ok = false;
self.jobs.update(job_id, |j| {
j.status = JobStatus::Cancelled;
j.finished_at_ms = Some(now_ms());
j.error = Some("cancelled".to_string());
for step in &mut j.steps {
if step.status == JobStatus::Pending || step.status == JobStatus::Running {
step.status = JobStatus::Cancelled;
}
}
});
break;
}
let step_name = {
let Some(job) = self.jobs.get(job_id) else {
break;
};
let Some(step) = job.steps.get(idx) else {
break;
};
step.name.clone()
};
self.jobs.update(job_id, |j| {
if let Some(step) = j.steps.get_mut(idx) {
step.status = JobStatus::Running;
step.attempts += 1;
}
});
let r = tokio::time::timeout(
self.step_timeout,
run_step(&state, &spec, &step_name, tenant_id),
)
.await;
match r {
Ok(Ok(())) => {
self.jobs.update(job_id, |j| {
if let Some(step) = j.steps.get_mut(idx) {
step.status = JobStatus::Succeeded;
step.error = None;
}
});
}
Ok(Err(e)) => {
ok = false;
self.jobs.update(job_id, |j| {
if let Some(step) = j.steps.get_mut(idx) {
step.status = JobStatus::Failed;
step.error = Some(e.clone());
}
j.status = JobStatus::Failed;
j.error = Some(e);
j.finished_at_ms = Some(now_ms());
});
break;
}
Err(_) => {
ok = false;
self.jobs.update(job_id, |j| {
if let Some(step) = j.steps.get_mut(idx) {
step.status = JobStatus::Failed;
step.error = Some("step timeout".to_string());
}
j.status = JobStatus::Failed;
j.error = Some("step timeout".to_string());
j.finished_at_ms = Some(now_ms());
});
break;
}
}
if !ok {
break;
}
let done = match self.jobs.get(job_id) {
Some(job) => idx + 1 >= job.steps.len(),
None => true,
};
if done {
break;
}
}
if ok {
self.jobs.update(job_id, |j| {
j.status = JobStatus::Succeeded;
j.finished_at_ms = Some(now_ms());
});
}
if let Some(tid) = tenant_id {
self.tenant_locks.unlock(tid, job_id);
}
}
}
#[derive(Debug)]
pub enum StartJobError {
TenantLocked,
}
#[derive(Clone)]
enum RunSpec {
Drain,
Migrate { runner_target: String },
}
fn step(name: &str) -> JobStep {
JobStep {
name: name.to_string(),
status: JobStatus::Pending,
attempts: 0,
error: None,
}
}
fn now_ms() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_millis() as u64
}
async fn run_step(
state: &AppState,
spec: &RunSpec,
step: &str,
tenant_id: Option<Uuid>,
) -> Result<(), String> {
match step {
"preflight" => {
let snapshots = fleet::snapshot(&state.http, &state.fleet_services).await;
if snapshots.iter().any(|s| !s.ready_ok) {
return Err("preflight failed: fleet not ready".to_string());
}
Ok(())
}
"drain" => {
tokio::time::sleep(Duration::from_millis(50)).await;
Ok(())
}
"update_placement" => match spec {
RunSpec::Migrate { runner_target } => {
let tenant_id = tenant_id.ok_or_else(|| "missing tenant_id".to_string())?;
state
.placement
.update_runner_target(tenant_id, runner_target.clone())
.map(|_| ())
}
_ => Ok(()),
},
"reload" => {
let _ = state.placement.tenant_summaries();
Ok(())
}
"verify" => match spec {
RunSpec::Migrate { runner_target } => {
let tenant_id = tenant_id.ok_or_else(|| "missing tenant_id".to_string())?;
let summaries = state.placement.tenant_summaries();
let found = summaries
.iter()
.find(|t| t.tenant_id == tenant_id)
.map(|t| t.runner_targets.iter().any(|x| x == runner_target))
.unwrap_or(false);
if !found {
return Err("verify failed: placement not updated".to_string());
}
Ok(())
}
_ => Ok(()),
},
_ => Ok(()),
}
}

122
control/api/src/jobs.rs Normal file
View File

@@ -0,0 +1,122 @@
use serde::{Deserialize, Serialize};
use std::{
collections::HashMap,
sync::{
Arc, Mutex,
atomic::{AtomicBool, Ordering},
},
};
use uuid::Uuid;
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum JobStatus {
Pending,
Running,
Succeeded,
Failed,
Cancelled,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Job {
pub job_id: Uuid,
pub status: JobStatus,
pub steps: Vec<JobStep>,
pub error: Option<String>,
pub created_at_ms: u64,
pub started_at_ms: Option<u64>,
pub finished_at_ms: Option<u64>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct JobStep {
pub name: String,
pub status: JobStatus,
pub attempts: u32,
pub error: Option<String>,
}
struct JobRecord {
job: Mutex<Job>,
cancel: AtomicBool,
}
#[derive(Clone, Default)]
pub struct JobStore {
inner: Arc<Inner>,
}
#[derive(Default)]
struct Inner {
jobs: Mutex<HashMap<Uuid, Arc<JobRecord>>>,
idempotency: Mutex<HashMap<String, Uuid>>,
}
impl JobStore {
pub fn get(&self, job_id: Uuid) -> Option<Job> {
let jobs = self.inner.jobs.lock().ok()?;
let rec = jobs.get(&job_id)?.clone();
rec.job.lock().ok().map(|j| j.clone())
}
pub fn get_idempotent(&self, key: &str) -> Option<Uuid> {
let map = self.inner.idempotency.lock().ok()?;
map.get(key).copied()
}
pub fn insert_idempotent(&self, key: &str, job: Job) -> Uuid {
let mut idempotency = self
.inner
.idempotency
.lock()
.expect("idempotency lock poisoned");
if let Some(existing) = idempotency.get(key) {
return *existing;
}
let job_id = job.job_id;
let rec = Arc::new(JobRecord {
job: Mutex::new(job),
cancel: AtomicBool::new(false),
});
self.inner
.jobs
.lock()
.expect("jobs lock poisoned")
.insert(job_id, rec);
idempotency.insert(key.to_string(), job_id);
job_id
}
pub fn request_cancel(&self, job_id: Uuid) -> bool {
let jobs = self.inner.jobs.lock().expect("jobs lock poisoned");
let Some(rec) = jobs.get(&job_id) else {
return false;
};
rec.cancel.store(true, Ordering::SeqCst);
true
}
pub fn cancel_requested(&self, job_id: Uuid) -> bool {
let jobs = self.inner.jobs.lock().expect("jobs lock poisoned");
let Some(rec) = jobs.get(&job_id) else {
return false;
};
rec.cancel.load(Ordering::SeqCst)
}
pub fn update<F>(&self, job_id: Uuid, f: F) -> bool
where
F: FnOnce(&mut Job),
{
let jobs = self.inner.jobs.lock().expect("jobs lock poisoned");
let Some(rec) = jobs.get(&job_id) else {
return false;
};
let mut job = rec.job.lock().expect("job lock poisoned");
f(&mut job);
true
}
}

692
control/api/src/lib.rs Normal file
View File

@@ -0,0 +1,692 @@
mod admin;
mod audit;
mod auth;
mod build_info;
mod deployments;
mod fleet;
mod job_engine;
mod jobs;
mod placement;
mod swarm;
pub use audit::AuditStore;
pub use auth::{AuthConfig, Principal};
use axum::{
Router,
extract::State,
http::{HeaderName, HeaderValue, Request, StatusCode},
middleware::{Next, from_fn, from_fn_with_state},
response::{IntoResponse, Response},
routing::get,
};
pub use build_info::{BuildInfo, extract_build_info};
pub use deployments::{DeployAnnotationArgs, GrafanaAnnotation, build_grafana_deploy_annotation};
pub use fleet::FleetService;
pub use job_engine::TenantLocks;
pub use jobs::JobStore;
use metrics_exporter_prometheus::PrometheusHandle;
pub use placement::PlacementStore;
pub use placement::ServiceKind;
use std::time::Instant;
pub use swarm::SwarmStore;
use tower_http::trace::TraceLayer;
use tracing::{Span, field};
use uuid::Uuid;
#[derive(Clone)]
pub struct AppState {
pub prometheus: PrometheusHandle,
pub auth: AuthConfig,
pub jobs: JobStore,
pub audit: AuditStore,
pub tenant_locks: TenantLocks,
pub http: reqwest::Client,
pub placement: PlacementStore,
pub fleet_services: Vec<FleetService>,
pub swarm: SwarmStore,
}
#[derive(Clone, Debug)]
pub struct RequestIds {
pub request_id: String,
pub correlation_id: Option<String>,
pub traceparent: Option<String>,
}
const HEADER_REQUEST_ID: HeaderName = HeaderName::from_static("x-request-id");
const HEADER_CORRELATION_ID: HeaderName = HeaderName::from_static("x-correlation-id");
const HEADER_TRACEPARENT: HeaderName = HeaderName::from_static("traceparent");
pub fn build_app(state: AppState) -> Router {
let trace = TraceLayer::new_for_http()
.make_span_with(|req: &Request<_>| {
let request_id = req
.headers()
.get(&HEADER_REQUEST_ID)
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.to_owned();
let correlation_id = req
.headers()
.get(&HEADER_CORRELATION_ID)
.and_then(|v| v.to_str().ok())
.unwrap_or("")
.to_owned();
tracing::info_span!(
"http_request",
request.method = %req.method(),
request.path = %req.uri().path(),
request_id = %request_id,
correlation_id = %correlation_id,
trace_id = "",
status = field::Empty,
duration_ms = field::Empty,
)
})
.on_response(
|res: &Response, latency: std::time::Duration, span: &Span| {
span.record("status", field::display(res.status()));
span.record("duration_ms", field::display(latency.as_millis()));
tracing::info!("response");
},
);
let admin =
admin::admin_router().layer(from_fn_with_state(state.clone(), auth::auth_middleware));
Router::new()
.route("/health", get(health))
.route("/ready", get(ready))
.route("/metrics", get(metrics))
.nest("/admin/v1", admin)
.with_state(state)
.layer(trace)
.layer(from_fn(request_id_middleware))
}
async fn health() -> impl IntoResponse {
(StatusCode::OK, "ok")
}
async fn ready() -> impl IntoResponse {
(StatusCode::OK, "ready")
}
async fn metrics(State(state): State<AppState>) -> impl IntoResponse {
(StatusCode::OK, state.prometheus.render())
}
async fn request_id_middleware(mut req: Request<axum::body::Body>, next: Next) -> Response {
let request_id = req
.headers()
.get(&HEADER_REQUEST_ID)
.and_then(|v| v.to_str().ok())
.map(|s| s.to_owned())
.unwrap_or_else(|| Uuid::new_v4().to_string());
let correlation_id = req
.headers()
.get(&HEADER_CORRELATION_ID)
.and_then(|v| v.to_str().ok())
.map(|s| s.to_owned());
let traceparent = req
.headers()
.get(&HEADER_TRACEPARENT)
.and_then(|v| v.to_str().ok())
.map(|s| s.to_owned());
if req.headers().get(&HEADER_REQUEST_ID).is_none()
&& let Ok(v) = HeaderValue::from_str(&request_id)
{
req.headers_mut().insert(HEADER_REQUEST_ID.clone(), v);
}
req.extensions_mut().insert(RequestIds {
request_id: request_id.clone(),
correlation_id: correlation_id.clone(),
traceparent: traceparent.clone(),
});
let start = Instant::now();
let mut res = next.run(req).await;
if let Ok(v) = HeaderValue::from_str(&request_id) {
res.headers_mut().insert(HEADER_REQUEST_ID.clone(), v);
}
if let Some(correlation_id) = correlation_id
&& let Ok(v) = HeaderValue::from_str(&correlation_id)
{
res.headers_mut().insert(HEADER_CORRELATION_ID.clone(), v);
}
metrics::histogram!("http_request_duration_ms").record(start.elapsed().as_millis() as f64);
res
}
#[cfg(test)]
mod tests {
use super::*;
use crate::jobs::JobStatus;
use axum::{
body::Body,
http::{Request, StatusCode, header},
};
use jsonwebtoken::{EncodingKey, Header, encode};
use metrics_exporter_prometheus::PrometheusBuilder;
use serde::Serialize;
use std::fs;
use std::path::PathBuf;
use std::sync::OnceLock;
use tower::ServiceExt;
use uuid::Uuid;
static HANDLE: OnceLock<PrometheusHandle> = OnceLock::new();
#[derive(Serialize)]
struct TestClaims {
sub: String,
session_id: String,
permissions: Vec<String>,
exp: usize,
}
fn test_app() -> Router {
test_app_with_fleet(vec![])
}
fn test_app_with_fleet(fleet_services: Vec<FleetService>) -> Router {
let handle = HANDLE
.get_or_init(|| {
PrometheusBuilder::new()
.install_recorder()
.expect("failed to install prometheus recorder")
})
.clone();
let placement_path = temp_placement_file();
build_app(AppState {
prometheus: handle,
auth: AuthConfig {
hs256_secret: Some(b"test_secret".to_vec()),
},
jobs: JobStore::default(),
audit: AuditStore::default(),
tenant_locks: TenantLocks::default(),
http: reqwest::Client::new(),
placement: PlacementStore::new(placement_path),
fleet_services,
swarm: SwarmStore::new(repo_root().join("swarm/dev.json")),
})
}
fn repo_root() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.and_then(|p| p.parent())
.expect("api crate should live under repo root")
.to_path_buf()
}
fn temp_placement_file() -> PathBuf {
let root = repo_root();
let src = root.join("placement/dev.json");
let mut dst = std::env::temp_dir();
dst.push(format!(
"cloudlysis-control-placement-{}-{}.json",
std::process::id(),
Uuid::new_v4()
));
let raw = fs::read_to_string(src).expect("missing placement/dev.json");
fs::write(&dst, raw).expect("failed to write temp placement file");
dst
}
fn assert_send_sync<T: Send + Sync>() {}
#[test]
fn core_state_types_are_send_sync() {
assert_send_sync::<AppState>();
assert_send_sync::<JobStore>();
assert_send_sync::<AuthConfig>();
}
#[tokio::test]
async fn health_returns_200() {
let res = test_app()
.oneshot(
Request::builder()
.uri("/health")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
}
#[tokio::test]
async fn ready_returns_200() {
let res = test_app()
.oneshot(
Request::builder()
.uri("/ready")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
}
#[tokio::test]
async fn metrics_returns_200() {
let res = test_app()
.oneshot(
Request::builder()
.uri("/metrics")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
}
fn make_token(perms: &[&str]) -> String {
let exp = (std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_secs()
+ 60) as usize;
encode(
&Header::default(),
&TestClaims {
sub: "user_1".to_string(),
session_id: "sess_1".to_string(),
permissions: perms.iter().map(|p| (*p).to_string()).collect(),
exp,
},
&EncodingKey::from_secret(b"test_secret"),
)
.unwrap()
}
#[tokio::test]
async fn unauthorized_admin_calls_return_401() {
let res = test_app()
.oneshot(
Request::builder()
.uri("/admin/v1/platform/info")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::UNAUTHORIZED);
}
#[tokio::test]
async fn forbidden_admin_calls_return_403() {
let token = make_token(&["control:read"]);
let res = test_app()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/echo")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "k1")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::FORBIDDEN);
}
#[tokio::test]
async fn tenant_scoped_endpoints_require_x_tenant_id() {
let token = make_token(&["control:read"]);
let res = test_app()
.oneshot(
Request::builder()
.uri("/admin/v1/tenants/echo")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::BAD_REQUEST);
}
#[tokio::test]
async fn job_create_is_idempotent() {
let token = make_token(&["control:write"]);
let app = test_app();
let res1 = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/echo")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "same-key")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res1.status(), StatusCode::OK);
let body1 = axum::body::to_bytes(res1.into_body(), 1024 * 1024)
.await
.unwrap();
let v1: serde_json::Value = serde_json::from_slice(&body1).unwrap();
let id1 = Uuid::parse_str(v1.get("job_id").unwrap().as_str().unwrap()).unwrap();
let res2 = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/echo")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "same-key")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res2.status(), StatusCode::OK);
let body2 = axum::body::to_bytes(res2.into_body(), 1024 * 1024)
.await
.unwrap();
let v2: serde_json::Value = serde_json::from_slice(&body2).unwrap();
let id2 = Uuid::parse_str(v2.get("job_id").unwrap().as_str().unwrap()).unwrap();
assert_eq!(id1, id2);
}
async fn wait_for_terminal_status(app: Router, job_id: Uuid) -> JobStatus {
let start = tokio::time::Instant::now();
loop {
let res = app
.clone()
.oneshot(
Request::builder()
.uri(format!("/admin/v1/jobs/{job_id}"))
.header(
header::AUTHORIZATION,
format!("Bearer {}", make_token(&["control:read"])),
)
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
if res.status() == StatusCode::OK {
let body = axum::body::to_bytes(res.into_body(), 1024 * 1024)
.await
.unwrap();
let job: crate::jobs::Job = serde_json::from_slice(&body).unwrap();
if job.status != JobStatus::Pending && job.status != JobStatus::Running {
return job.status;
}
}
if start.elapsed() > std::time::Duration::from_millis(500) {
return JobStatus::Failed;
}
tokio::time::sleep(std::time::Duration::from_millis(10)).await;
}
}
#[tokio::test]
async fn tenant_job_idempotency_does_not_duplicate_effects() {
let token = make_token(&["control:write", "control:read"]);
let app = test_app();
let tenant_id = Uuid::new_v4();
let body = serde_json::json!({
"tenant_id": tenant_id,
"reason": "test",
});
let res1 = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/tenant/drain")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "same-key")
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(body.to_string()))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res1.status(), StatusCode::OK);
let res2 = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/tenant/drain")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "same-key")
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(body.to_string()))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res2.status(), StatusCode::OK);
let b1 = axum::body::to_bytes(res1.into_body(), 1024 * 1024)
.await
.unwrap();
let b2 = axum::body::to_bytes(res2.into_body(), 1024 * 1024)
.await
.unwrap();
let v1: serde_json::Value = serde_json::from_slice(&b1).unwrap();
let v2: serde_json::Value = serde_json::from_slice(&b2).unwrap();
assert_eq!(v1.get("job_id"), v2.get("job_id"));
}
#[tokio::test]
async fn tenant_lock_prevents_concurrent_mutations() {
let token = make_token(&["control:write", "control:read"]);
let app = test_app();
let tenant_id = Uuid::new_v4();
let res1 = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/tenant/drain")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "k1")
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(
serde_json::json!({ "tenant_id": tenant_id, "reason": "r" }).to_string(),
))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res1.status(), StatusCode::OK);
let res2 = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/tenant/migrate")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "k2")
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(
serde_json::json!({
"tenant_id": tenant_id,
"runner_target": "node-2",
"reason": "r2"
})
.to_string(),
))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res2.status(), StatusCode::CONFLICT);
}
#[tokio::test]
async fn migrate_preflight_fails_when_fleet_not_ready() {
let token = make_token(&["control:write", "control:read"]);
let app = test_app_with_fleet(vec![FleetService {
name: "unreachable".to_string(),
base_url: "http://127.0.0.1:1".to_string(),
}]);
let tenant_id = Uuid::new_v4();
let res = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/tenant/migrate")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "k3")
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(
serde_json::json!({
"tenant_id": tenant_id,
"runner_target": "node-2",
"reason": "r"
})
.to_string(),
))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
let body = axum::body::to_bytes(res.into_body(), 1024 * 1024)
.await
.unwrap();
let v: serde_json::Value = serde_json::from_slice(&body).unwrap();
let job_id = Uuid::parse_str(v.get("job_id").unwrap().as_str().unwrap()).unwrap();
let status = wait_for_terminal_status(app, job_id).await;
assert_eq!(status, JobStatus::Failed);
}
#[tokio::test]
async fn cancel_marks_job_cancelled() {
let token = make_token(&["control:write", "control:read"]);
let app = test_app();
let tenant_id = Uuid::new_v4();
let res = app
.clone()
.oneshot(
Request::builder()
.uri("/admin/v1/jobs/tenant/migrate")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header("idempotency-key", "k4")
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(
serde_json::json!({
"tenant_id": tenant_id,
"runner_target": "node-2",
"reason": "r"
})
.to_string(),
))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
let body = axum::body::to_bytes(res.into_body(), 1024 * 1024)
.await
.unwrap();
let v: serde_json::Value = serde_json::from_slice(&body).unwrap();
let job_id = Uuid::parse_str(v.get("job_id").unwrap().as_str().unwrap()).unwrap();
let res = app
.clone()
.oneshot(
Request::builder()
.uri(format!("/admin/v1/jobs/{job_id}/cancel"))
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
let status = wait_for_terminal_status(app, job_id).await;
assert_eq!(status, JobStatus::Cancelled);
}
#[tokio::test]
async fn migration_plan_is_deterministic() {
let token = make_token(&["control:write"]);
let app = test_app();
let tenant_id = Uuid::new_v4();
let res = app
.oneshot(
Request::builder()
.uri("/admin/v1/plan/tenant/migrate")
.method("POST")
.header(header::AUTHORIZATION, format!("Bearer {token}"))
.header(header::CONTENT_TYPE, "application/json")
.body(Body::from(
serde_json::json!({
"tenant_id": tenant_id,
"runner_target": "node-2",
"reason": "r"
})
.to_string(),
))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::OK);
let body = axum::body::to_bytes(res.into_body(), 1024 * 1024)
.await
.unwrap();
let v: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert_eq!(
v.get("steps").unwrap(),
&serde_json::json!(["preflight", "drain", "update_placement", "reload", "verify"])
);
}
}

109
control/api/src/main.rs Normal file
View File

@@ -0,0 +1,109 @@
use clap::Parser;
use metrics_exporter_prometheus::PrometheusBuilder;
use std::net::SocketAddr;
use tracing_subscriber::EnvFilter;
#[derive(Parser, Debug)]
#[command(name = "control-api")]
struct Args {
#[arg(long, env = "CONTROL_API_ADDR", default_value = "127.0.0.1:8080")]
addr: SocketAddr,
}
#[tokio::main]
async fn main() {
let args = Args::parse();
tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")),
)
.init();
let recorder = PrometheusBuilder::new()
.set_buckets(&[
1.0, 2.5, 5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0, 2500.0, 5000.0,
])
.expect("invalid prometheus buckets")
.install_recorder()
.expect("failed to install prometheus recorder");
let http = reqwest::Client::builder()
.user_agent("cloudlysis-control-api")
.build()
.expect("failed to build http client");
let placement_path = std::env::var("CONTROL_PLACEMENT_PATH")
.ok()
.unwrap_or_else(|| "placement/dev.json".to_string())
.into();
let swarm_path = std::env::var("CONTROL_SWARM_STATE_PATH")
.ok()
.unwrap_or_else(|| "swarm/dev.json".to_string())
.into();
let self_url = std::env::var("CONTROL_SELF_URL")
.ok()
.unwrap_or_else(|| "http://127.0.0.1:8080".to_string());
let mut fleet_services = vec![api::FleetService {
name: "control-api".to_string(),
base_url: self_url,
}];
if let Ok(spec) = std::env::var("CONTROL_FLEET_SERVICES") {
fleet_services.extend(parse_fleet_services(&spec));
}
let app = api::build_app(api::AppState {
prometheus: recorder,
auth: api::AuthConfig {
hs256_secret: std::env::var("CONTROL_GATEWAY_JWT_HS256_SECRET")
.ok()
.map(|s| s.into_bytes()),
},
jobs: api::JobStore::default(),
audit: api::AuditStore::default(),
tenant_locks: api::TenantLocks::default(),
http,
placement: api::PlacementStore::new(placement_path),
fleet_services,
swarm: api::SwarmStore::new(swarm_path),
});
let listener = tokio::net::TcpListener::bind(args.addr)
.await
.expect("failed to bind");
tracing::info!(addr = %args.addr, "control api listening");
axum::serve(listener, app)
.with_graceful_shutdown(shutdown_signal())
.await
.expect("server failed");
}
async fn shutdown_signal() {
let _ = tokio::signal::ctrl_c().await;
}
fn parse_fleet_services(spec: &str) -> Vec<api::FleetService> {
spec.split(',')
.filter_map(|pair| {
let pair = pair.trim();
if pair.is_empty() {
return None;
}
let (name, url) = pair.split_once('=')?;
let name = name.trim();
let url = url.trim();
if name.is_empty() || url.is_empty() {
return None;
}
Some(api::FleetService {
name: name.to_string(),
base_url: url.to_string(),
})
})
.collect()
}

View File

@@ -0,0 +1,227 @@
use serde::{Deserialize, Serialize};
use std::{
collections::BTreeMap,
fs,
path::{Path, PathBuf},
sync::{Arc, RwLock},
time::SystemTime,
};
use uuid::Uuid;
#[derive(Clone, Copy, Debug, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ServiceKind {
Aggregate,
Projection,
Runner,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct PlacementFile {
pub revision: Option<String>,
pub aggregate_placement: Option<PlacementKind>,
pub projection_placement: Option<PlacementKind>,
pub runner_placement: Option<PlacementKind>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct PlacementKind {
pub placements: Vec<TenantPlacement>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct TenantPlacement {
pub tenant_id: Uuid,
pub targets: Vec<String>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct PlacementResponse {
pub kind: ServiceKind,
pub revision: String,
pub placements: Vec<TenantPlacement>,
}
impl PlacementFile {
pub fn load(path: &Path) -> Option<Self> {
let raw = fs::read_to_string(path).ok()?;
serde_json::from_str(&raw).ok()
}
pub fn for_kind(&self, kind: ServiceKind) -> PlacementResponse {
let revision = self.revision.clone().unwrap_or_else(|| "dev".to_string());
let placements = match kind {
ServiceKind::Aggregate => self
.aggregate_placement
.as_ref()
.map(|p| p.placements.clone())
.unwrap_or_default(),
ServiceKind::Projection => self
.projection_placement
.as_ref()
.map(|p| p.placements.clone())
.unwrap_or_default(),
ServiceKind::Runner => self
.runner_placement
.as_ref()
.map(|p| p.placements.clone())
.unwrap_or_default(),
};
PlacementResponse {
kind,
revision,
placements,
}
}
}
#[derive(Clone)]
pub struct PlacementStore {
inner: Arc<RwLock<Inner>>,
}
struct Inner {
path: PathBuf,
last_modified: Option<SystemTime>,
cached: Option<PlacementFile>,
}
impl PlacementStore {
pub fn new(path: PathBuf) -> Self {
Self {
inner: Arc::new(RwLock::new(Inner {
path,
last_modified: None,
cached: None,
})),
}
}
pub fn get_for_kind(&self, kind: ServiceKind) -> PlacementResponse {
let mut inner = self.inner.write().expect("placement lock poisoned");
inner.reload_if_changed();
match inner.cached.as_ref() {
Some(p) => p.for_kind(kind),
None => PlacementResponse {
kind,
revision: "dev".to_string(),
placements: vec![],
},
}
}
pub fn tenant_summaries(&self) -> Vec<TenantSummary> {
let mut inner = self.inner.write().expect("placement lock poisoned");
inner.reload_if_changed();
let Some(p) = inner.cached.as_ref() else {
return vec![];
};
let mut map: BTreeMap<Uuid, TenantSummary> = BTreeMap::new();
for (kind, placements) in [
(
ServiceKind::Aggregate,
p.for_kind(ServiceKind::Aggregate).placements,
),
(
ServiceKind::Projection,
p.for_kind(ServiceKind::Projection).placements,
),
(
ServiceKind::Runner,
p.for_kind(ServiceKind::Runner).placements,
),
] {
for tp in placements {
let entry = map.entry(tp.tenant_id).or_insert_with(|| TenantSummary {
tenant_id: tp.tenant_id,
aggregate_targets: vec![],
projection_targets: vec![],
runner_targets: vec![],
});
match kind {
ServiceKind::Aggregate => entry.aggregate_targets = tp.targets,
ServiceKind::Projection => entry.projection_targets = tp.targets,
ServiceKind::Runner => entry.runner_targets = tp.targets,
}
}
}
map.into_values().collect()
}
pub fn update_runner_target(
&self,
tenant_id: Uuid,
runner_target: String,
) -> Result<String, String> {
let mut inner = self.inner.write().expect("placement lock poisoned");
inner.reload_if_changed();
let mut file = inner.cached.clone().unwrap_or(PlacementFile {
revision: Some("dev".to_string()),
aggregate_placement: Some(PlacementKind { placements: vec![] }),
projection_placement: Some(PlacementKind { placements: vec![] }),
runner_placement: Some(PlacementKind { placements: vec![] }),
});
let mut runner = file
.runner_placement
.take()
.unwrap_or(PlacementKind { placements: vec![] });
if let Some(existing) = runner
.placements
.iter_mut()
.find(|p| p.tenant_id == tenant_id)
{
existing.targets = vec![runner_target];
} else {
runner.placements.push(TenantPlacement {
tenant_id,
targets: vec![runner_target],
});
}
runner.placements.sort_by_key(|p| p.tenant_id);
file.runner_placement = Some(runner);
let revision = format!("rev-{}", Uuid::new_v4());
file.revision = Some(revision.clone());
let raw = serde_json::to_string_pretty(&file).map_err(|e| e.to_string())?;
let tmp = inner.path.with_extension("json.tmp");
fs::write(&tmp, raw).map_err(|e| e.to_string())?;
fs::rename(&tmp, &inner.path).map_err(|e| e.to_string())?;
inner.last_modified = None;
inner.cached = Some(file);
Ok(revision)
}
}
impl Inner {
fn reload_if_changed(&mut self) {
let meta = fs::metadata(&self.path).ok();
let modified = meta.and_then(|m| m.modified().ok());
if self.cached.is_some() && modified.is_some() && modified == self.last_modified {
return;
}
self.last_modified = modified;
self.cached = PlacementFile::load(&self.path);
}
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct TenantSummary {
pub tenant_id: Uuid,
pub aggregate_targets: Vec<String>,
pub projection_targets: Vec<String>,
pub runner_targets: Vec<String>,
}

62
control/api/src/swarm.rs Normal file
View File

@@ -0,0 +1,62 @@
use serde::{Deserialize, Serialize};
use std::{fs, path::Path};
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct SwarmService {
pub name: String,
pub image: Option<String>,
pub mode: Option<String>,
pub replicas: Option<String>,
pub updated_at: Option<String>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct SwarmTask {
pub id: String,
pub service: String,
pub node: Option<String>,
pub desired_state: Option<String>,
pub current_state: Option<String>,
pub error: Option<String>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct SwarmStateFile {
pub services: Vec<SwarmService>,
pub tasks: Vec<SwarmTask>,
}
#[derive(Clone)]
pub struct SwarmStore {
path: std::path::PathBuf,
}
impl SwarmStore {
pub fn new(path: std::path::PathBuf) -> Self {
Self { path }
}
pub fn list_services(&self) -> Vec<SwarmService> {
self.load().map(|s| s.services).unwrap_or_default()
}
pub fn list_tasks(&self, service_name: &str) -> Vec<SwarmTask> {
self.load()
.map(|s| {
s.tasks
.into_iter()
.filter(|t| t.service == service_name)
.collect()
})
.unwrap_or_default()
}
fn load(&self) -> Option<SwarmStateFile> {
load_state(&self.path)
}
}
fn load_state(path: &Path) -> Option<SwarmStateFile> {
let raw = fs::read_to_string(path).ok()?;
serde_json::from_str(&raw).ok()
}

View File

@@ -0,0 +1,16 @@
#[test]
fn annotation_writer_produces_expected_grafana_payload() {
let a = api::build_grafana_deploy_annotation(api::DeployAnnotationArgs {
service: "gateway",
version: Some("1.2.3"),
git_sha: Some("abc123"),
time_ms: 1234567890,
});
assert_eq!(a.time, 1234567890);
assert!(a.tags.iter().any(|t| t == "deploy"));
assert!(a.tags.iter().any(|t| t == "service:gateway"));
assert!(a.tags.iter().any(|t| t == "version:1.2.3"));
assert!(a.tags.iter().any(|t| t == "git_sha:abc123"));
assert!(a.text.contains("deploy gateway"));
}

View File

@@ -0,0 +1,39 @@
#[test]
fn build_info_parser_extracts_expected_labels() {
let metrics = r#"
# HELP gateway_build_info build info
# TYPE gateway_build_info gauge
gateway_build_info{service="gateway",version="1.2.3",git_sha="abc"} 1
runner_build_info{service="runner",version="2.0.0",git_sha="def"} 1
unrelated_metric 5
"#;
let info = api::extract_build_info(metrics);
assert_eq!(info.len(), 2);
assert!(
info.iter()
.any(|i| i.service == "gateway" && i.version == "1.2.3" && i.git_sha == "abc")
);
assert!(
info.iter()
.any(|i| i.service == "runner" && i.version == "2.0.0" && i.git_sha == "def")
);
}
#[test]
fn build_info_snapshot_has_required_services() {
let metrics = r#"
gateway_build_info{service="gateway",version="1.2.3",git_sha="abc"} 1
aggregate_build_info{service="aggregate",version="1.0.0",git_sha="aaa"} 1
projection_build_info{service="projection",version="1.0.0",git_sha="bbb"} 1
runner_build_info{service="runner",version="2.0.0",git_sha="ccc"} 1
"#;
let info = api::extract_build_info(metrics);
for required in ["gateway", "aggregate", "projection", "runner"] {
assert!(
info.iter().any(|i| i.service == required),
"missing build_info for service={required}"
);
}
}

View File

@@ -0,0 +1,55 @@
use std::{fs, path::PathBuf, time::Duration};
fn repo_root() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.and_then(|p| p.parent())
.expect("api crate should live under repo root")
.to_path_buf()
}
#[test]
fn docker_compose_files_parse_and_include_required_services() {
let root = repo_root();
let compose = fs::read_to_string(root.join("observability/docker-compose.yml")).unwrap();
let v: serde_yaml::Value = serde_yaml::from_str(&compose).unwrap();
let services = v
.get("services")
.and_then(|x| x.as_mapping())
.expect("missing services");
for required in ["grafana", "victoria-metrics", "vmagent", "loki", "tempo"] {
assert!(
services.contains_key(serde_yaml::Value::String(required.to_string())),
"missing service {required}"
);
}
}
#[tokio::test]
#[ignore]
async fn docker_compose_config_validation_is_gated_and_fast() {
let enabled = std::env::var("CONTROL_TEST_DOCKER").ok();
assert_eq!(enabled.as_deref(), Some("1"));
let root = repo_root();
let compose = root.join("observability/docker-compose.yml");
let cmd = tokio::process::Command::new("docker")
.args(["compose", "-f"])
.arg(compose)
.args(["config"])
.output();
let out = tokio::time::timeout(Duration::from_secs(10), cmd)
.await
.expect("docker compose config timed out")
.expect("failed to run docker compose config");
assert!(
out.status.success(),
"docker compose config failed: {}",
String::from_utf8_lossy(&out.stderr)
);
}

View File

@@ -0,0 +1,6 @@
#[test]
#[ignore]
fn docker_integration_tests_are_gated() {
let enabled = std::env::var("CONTROL_TEST_DOCKER").ok();
assert_eq!(enabled.as_deref(), Some("1"));
}

View File

@@ -0,0 +1,183 @@
use jsonwebtoken::{EncodingKey, Header, encode};
use serde::Serialize;
use std::{fs, net::TcpListener, time::Duration};
#[derive(Serialize)]
struct Claims {
sub: String,
session_id: String,
permissions: Vec<String>,
exp: usize,
}
fn free_port() -> u16 {
TcpListener::bind("127.0.0.1:0")
.unwrap()
.local_addr()
.unwrap()
.port()
}
fn token(secret: &[u8], perms: &[&str]) -> String {
let exp = (std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_secs()
+ 60) as usize;
encode(
&Header::default(),
&Claims {
sub: "op_1".to_string(),
session_id: "sess_1".to_string(),
permissions: perms.iter().map(|p| (*p).to_string()).collect(),
exp,
},
&EncodingKey::from_secret(secret),
)
.unwrap()
}
async fn wait_ready(url: &str) {
let client = reqwest::Client::new();
let start = tokio::time::Instant::now();
loop {
let ok = client
.get(format!("{url}/ready"))
.send()
.await
.map(|r| r.status().is_success())
.unwrap_or(false);
if ok {
return;
}
if start.elapsed() > Duration::from_secs(10) {
panic!("control-api did not become ready");
}
tokio::time::sleep(Duration::from_millis(100)).await;
}
}
#[tokio::test]
#[ignore]
async fn control_plane_can_see_the_fleet_via_docker_stubs() {
let enabled = std::env::var("CONTROL_TEST_DOCKER").ok();
assert_eq!(enabled.as_deref(), Some("1"));
let nginx_conf = r#"
server {
listen 80;
server_name _;
location = /health { return 200 "ok\n"; }
location = /ready { return 200 "ready\n"; }
location = /metrics { return 200 "stub_build_info{service=\"stub\",version=\"dev\",git_sha=\"000\"} 1\n"; }
}
"#;
let mut conf_path = std::env::temp_dir();
conf_path.push(format!(
"cloudlysis-control-nginx-{}.conf",
uuid::Uuid::new_v4()
));
fs::write(&conf_path, nginx_conf).unwrap();
let gateway_port = free_port();
let runner_port = free_port();
let aggregate_port = free_port();
let projection_port = free_port();
async fn run_stub(name: &str, port: u16, conf: &std::path::Path) -> String {
let out = tokio::process::Command::new("docker")
.args(["run", "-d", "--rm"])
.args(["-p", &format!("{port}:80")])
.args([
"-v",
&format!("{}:/etc/nginx/conf.d/default.conf:ro", conf.display()),
])
.arg("nginx:1.29-alpine")
.output()
.await
.expect("failed to run docker");
assert!(
out.status.success(),
"{name} stub failed: {}",
String::from_utf8_lossy(&out.stderr)
);
String::from_utf8_lossy(&out.stdout).trim().to_string()
}
let gateway_id = run_stub("gateway", gateway_port, &conf_path).await;
let runner_id = run_stub("runner", runner_port, &conf_path).await;
let aggregate_id = run_stub("aggregate", aggregate_port, &conf_path).await;
let projection_id = run_stub("projection", projection_port, &conf_path).await;
let secret = b"e2e_secret";
let api_port = free_port();
let api_url = format!("http://127.0.0.1:{api_port}");
let mut placement_path = std::env::temp_dir();
placement_path.push(format!(
"cloudlysis-control-placement-{}.json",
uuid::Uuid::new_v4()
));
fs::write(
&placement_path,
r#"{"revision":"e2e","aggregate_placement":{"placements":[]},"projection_placement":{"placements":[]},"runner_placement":{"placements":[]}}"#,
)
.unwrap();
let mut child = tokio::process::Command::new(env!("CARGO_BIN_EXE_api"))
.env("CONTROL_API_ADDR", format!("127.0.0.1:{api_port}"))
.env("CONTROL_GATEWAY_JWT_HS256_SECRET", "e2e_secret")
.env("CONTROL_PLACEMENT_PATH", placement_path.to_string_lossy().to_string())
.env(
"CONTROL_FLEET_SERVICES",
format!(
"gateway=http://127.0.0.1:{gateway_port},aggregate=http://127.0.0.1:{aggregate_port},projection=http://127.0.0.1:{projection_port},runner=http://127.0.0.1:{runner_port}"
),
)
.spawn()
.expect("failed to spawn control-api");
wait_ready(&api_url).await;
let client = reqwest::Client::new();
let t = token(secret, &["control:read"]);
let res = client
.get(format!("{api_url}/admin/v1/fleet/snapshot"))
.header(reqwest::header::AUTHORIZATION, format!("Bearer {t}"))
.send()
.await
.unwrap();
assert!(res.status().is_success());
let v: serde_json::Value = res.json().await.unwrap();
let services = v.get("services").and_then(|x| x.as_array()).unwrap();
assert!(
services.len() >= 5,
"expected at least 5 services (including control-api), got {}",
services.len()
);
let res = client
.get(format!("{api_url}/admin/v1/tenants"))
.header(reqwest::header::AUTHORIZATION, format!("Bearer {t}"))
.send()
.await
.unwrap();
assert!(res.status().is_success());
let _ = child.kill().await;
for id in [gateway_id, runner_id, aggregate_id, projection_id] {
let _ = tokio::process::Command::new("docker")
.args(["stop", &id])
.output()
.await;
}
let _ = fs::remove_file(&conf_path);
let _ = fs::remove_file(&placement_path);
}

View File

@@ -0,0 +1,30 @@
#[test]
fn fleet_services_env_parser_is_lenient() {
let services = {
fn parse(spec: &str) -> Vec<api::FleetService> {
spec.split(',')
.filter_map(|pair| {
let pair = pair.trim();
if pair.is_empty() {
return None;
}
let (name, url) = pair.split_once('=')?;
let name = name.trim();
let url = url.trim();
if name.is_empty() || url.is_empty() {
return None;
}
Some(api::FleetService {
name: name.to_string(),
base_url: url.to_string(),
})
})
.collect()
}
parse(" gateway=http://x , ,runner=http://y,broken, =http://z ")
};
assert_eq!(services.len(), 2);
assert_eq!(services[0].name, "gateway");
assert_eq!(services[1].name, "runner");
}

View File

@@ -0,0 +1,23 @@
use std::time::Duration;
#[tokio::test]
#[ignore]
async fn nats_integration_tests_are_gated_and_fast_fail() {
let url = std::env::var("CONTROL_TEST_NATS_URL").expect("CONTROL_TEST_NATS_URL is required");
let without_scheme = url.strip_prefix("nats://").unwrap_or(url.as_str());
let hostport = without_scheme.split('/').next().unwrap_or(without_scheme);
let mut parts = hostport.split(':');
let host = parts.next().unwrap_or("127.0.0.1");
let port: u16 = parts
.next()
.unwrap_or("4222")
.parse()
.expect("invalid port in CONTROL_TEST_NATS_URL");
let connect = tokio::net::TcpStream::connect((host, port));
tokio::time::timeout(Duration::from_secs(2), connect)
.await
.expect("tcp connect to NATS timed out")
.expect("failed to connect to NATS");
}

View File

@@ -0,0 +1,75 @@
use std::{collections::BTreeSet, fs, path::PathBuf};
fn repo_root() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.and_then(|p| p.parent())
.expect("api crate should live under repo root")
.to_path_buf()
}
#[test]
fn grafana_provisioning_files_are_syntactically_valid() {
let root = repo_root();
let datasources = fs::read_to_string(
root.join("observability/grafana/provisioning/datasources/datasources.yml"),
)
.expect("missing grafana datasources provisioning file");
let dashboards = fs::read_to_string(
root.join("observability/grafana/provisioning/dashboards/dashboards.yml"),
)
.expect("missing grafana dashboards provisioning file");
let _datasources_yaml: serde_yaml::Value =
serde_yaml::from_str(&datasources).expect("invalid grafana datasources yaml");
let _dashboards_yaml: serde_yaml::Value =
serde_yaml::from_str(&dashboards).expect("invalid grafana dashboards yaml");
}
#[test]
fn grafana_dashboards_are_syntactically_valid_json() {
let root = repo_root();
let dashboards_dir = root.join("observability/grafana/dashboards");
let mut found = 0usize;
for entry in fs::read_dir(&dashboards_dir).expect("missing dashboards dir") {
let entry = entry.expect("failed to read dashboards dir entry");
let path = entry.path();
if path.extension().and_then(|e| e.to_str()) != Some("json") {
continue;
}
found += 1;
let raw = fs::read_to_string(&path).expect("failed to read dashboard json");
let _: serde_json::Value =
serde_json::from_str(&raw).unwrap_or_else(|e| panic!("{path:?}: {e}"));
}
assert!(found > 0, "expected at least one dashboard json file");
}
#[test]
fn vmagent_config_parses_and_includes_required_jobs() {
let root = repo_root();
let scrape = fs::read_to_string(root.join("observability/vmagent/scrape.yml"))
.expect("missing vmagent scrape config");
let value: serde_yaml::Value =
serde_yaml::from_str(&scrape).expect("invalid vmagent scrape yaml");
let mut job_names = BTreeSet::<String>::new();
if let Some(scrape_configs) = value.get("scrape_configs").and_then(|v| v.as_sequence()) {
for cfg in scrape_configs {
if let Some(job) = cfg.get("job_name").and_then(|v| v.as_str()) {
job_names.insert(job.to_string());
}
}
}
for required in ["victoria-metrics", "vmagent", "control-api"] {
assert!(
job_names.contains(required),
"vmagent scrape config missing required job_name={required}"
);
}
}

View File

@@ -0,0 +1,61 @@
use std::{
net::TcpStream,
path::PathBuf,
process::Command,
time::{Duration, Instant},
};
fn repo_root() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.and_then(|p| p.parent())
.expect("api crate should live under repo root")
.to_path_buf()
}
fn wait_for_tcp(addr: &str, timeout: Duration) -> bool {
let start = Instant::now();
while start.elapsed() < timeout {
if TcpStream::connect_timeout(
&addr.parse().expect("invalid socket addr"),
Duration::from_secs(1),
)
.is_ok()
{
return true;
}
std::thread::sleep(Duration::from_millis(250));
}
false
}
#[test]
#[ignore]
fn observability_stack_reaches_healthy_state_fast() {
let enabled = std::env::var("CONTROL_TEST_DOCKER").ok();
assert_eq!(enabled.as_deref(), Some("1"));
let root = repo_root();
let compose = root.join("observability/docker-compose.yml");
let up = Command::new("docker")
.args(["compose", "-f"])
.arg(&compose)
.args(["up", "-d"])
.status()
.expect("failed to run docker compose up");
assert!(up.success(), "docker compose up failed");
let ok = wait_for_tcp("127.0.0.1:3000", Duration::from_secs(30))
&& wait_for_tcp("127.0.0.1:8428", Duration::from_secs(30))
&& wait_for_tcp("127.0.0.1:3100", Duration::from_secs(30))
&& wait_for_tcp("127.0.0.1:3200", Duration::from_secs(30));
let _ = Command::new("docker")
.args(["compose", "-f"])
.arg(&compose)
.args(["down", "-v"])
.status();
assert!(ok, "observability stack did not become reachable in time");
}

View File

@@ -0,0 +1,43 @@
use std::{fs, path::PathBuf, thread, time::Duration};
use api::PlacementStore;
fn tmp_file(name: &str) -> PathBuf {
let mut p = std::env::temp_dir();
p.push(format!(
"cloudlysis-control-{name}-{}-{}.json",
std::process::id(),
std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_nanos()
));
p
}
#[test]
fn placement_store_hot_reload_swaps_atomically() {
let path = tmp_file("placement");
fs::write(
&path,
r#"{"revision":"r1","aggregate_placement":{"placements":[]},"projection_placement":{"placements":[]},"runner_placement":{"placements":[]}}"#,
)
.unwrap();
let store = PlacementStore::new(path.clone());
let a1 = store.get_for_kind(api::ServiceKind::Aggregate);
assert_eq!(a1.revision, "r1");
thread::sleep(Duration::from_millis(5));
fs::write(
&path,
r#"{"revision":"r2","aggregate_placement":{"placements":[]},"projection_placement":{"placements":[]},"runner_placement":{"placements":[]}}"#,
)
.unwrap();
let a2 = store.get_for_kind(api::ServiceKind::Aggregate);
assert_eq!(a2.revision, "r2");
let _ = fs::remove_file(&path);
}

View File

@@ -0,0 +1,31 @@
use std::{fs, path::PathBuf};
#[test]
fn swarm_store_is_deterministic_from_file() {
let mut path = std::env::temp_dir();
path.push(format!(
"cloudlysis-control-swarm-{}-{}.json",
std::process::id(),
uuid::Uuid::new_v4()
));
fs::write(
&path,
r#"{"services":[{"name":"gateway","image":"x","mode":"replicated","replicas":"1/1","updated_at":null}],"tasks":[{"id":"t1","service":"gateway","node":"n1","desired_state":"running","current_state":"running","error":null}]}"#,
)
.unwrap();
let store = api::SwarmStore::new(PathBuf::from(&path));
let services = store.list_services();
assert_eq!(services.len(), 1);
assert_eq!(services[0].name, "gateway");
let tasks = store.list_tasks("gateway");
assert_eq!(tasks.len(), 1);
assert_eq!(tasks[0].id, "t1");
let none = store.list_tasks("missing");
assert_eq!(none.len(), 0);
let _ = fs::remove_file(&path);
}

View File

@@ -0,0 +1,42 @@
use std::time::Duration;
#[tokio::test]
#[ignore]
async fn docker_swarm_smoke_test_is_gated_and_times_out() {
let enabled = std::env::var("CONTROL_TEST_DOCKER").ok();
assert_eq!(enabled.as_deref(), Some("1"));
let stack = "cloudlysis_control_test";
let compose = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.and_then(|p| p.parent())
.unwrap()
.join("swarm/stacks/control-plane.yml");
let deploy = tokio::process::Command::new("docker")
.args(["stack", "deploy", "-c"])
.arg(&compose)
.arg(stack)
.output();
let out = tokio::time::timeout(Duration::from_secs(30), deploy)
.await
.expect("docker stack deploy timed out")
.expect("failed to run docker stack deploy");
assert!(
out.status.success(),
"docker stack deploy failed: {}",
String::from_utf8_lossy(&out.stderr)
);
let ls = tokio::process::Command::new("docker")
.args(["service", "ls"])
.output();
let _ = tokio::time::timeout(Duration::from_secs(10), ls).await;
let rm = tokio::process::Command::new("docker")
.args(["stack", "rm"])
.arg(stack)
.output();
let _ = tokio::time::timeout(Duration::from_secs(10), rm).await;
}

View File

@@ -0,0 +1,40 @@
use std::{fs, path::PathBuf};
fn repo_root() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.parent()
.and_then(|p| p.parent())
.expect("api crate should live under repo root")
.to_path_buf()
}
#[test]
fn stack_files_parse_as_yaml() {
let root = repo_root();
for file in [
root.join("swarm/stacks/control-plane.yml"),
root.join("swarm/stacks/observability.yml"),
] {
let raw = fs::read_to_string(&file).unwrap();
let _: serde_yaml::Value = serde_yaml::from_str(&raw).unwrap();
}
}
#[test]
fn control_plane_stack_has_required_services() {
let root = repo_root();
let raw = fs::read_to_string(root.join("swarm/stacks/control-plane.yml")).unwrap();
let v: serde_yaml::Value = serde_yaml::from_str(&raw).unwrap();
let services = v
.get("services")
.and_then(|x| x.as_mapping())
.expect("missing services");
for required in ["control-api", "control-ui"] {
assert!(
services.contains_key(serde_yaml::Value::String(required.to_string())),
"missing service {required}"
);
}
}

601
control/prd.md Normal file
View File

@@ -0,0 +1,601 @@
### 🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops)
**Definition:**
This repository hosts the **platform control plane**:
1) the **Admin UI** used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and
2) the **observability stack** and **production dashboards** (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production.
The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails.
---
## **Context: Existing Node Repositories (../)**
This PRD is derived from the currently implemented node repos in `../`:
- **Aggregate**: expects a control node to manage tenant placement and scaling operations, including tenant migrations ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L82-L151)). Tenant placement primitives and KV helper exist ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L5-L227)).
- **Gateway**: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production ([gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L13-L175)).
- **Projection**: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L7-L96)).
- **Runner**: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics ([tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L11-L104)) and exposes admin endpoints for drain/reload in its PRD ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L199-L210)).
The control plane also adopts the proven **Admin UI UX + component library** from UltraBases control-plane admin UI, adapting screens and information architecture to Cloudlysis needs:
- Reusable UI components live under [ui/control-plane-admin/src/components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui).
- Example pages include [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx), [AdminUsersPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminUsersPage.tsx), [AdminSessionsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminSessionsPage.tsx), [FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx), and [ObservabilityPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/ObservabilityPage.tsx).
---
## **Problem Statement**
Operating the platform without a unified control plane forces operators to:
- Use ad-hoc scripts, direct cluster access, or service-local admin endpoints
- Manage tenants, placements, and deployments without a consistent audit trail
- Correlate production incidents across services with incomplete dashboards and unsafe levels of access
The platform needs a control plane that:
- Centralizes **admin workflows** and **production operability**
- Enforces **least-privilege RBAC**, **step-up**, and **auditing**
- Provides a consistent, safe abstraction over **tenant placement**, **scale**, and **production operations**
---
## **Goals**
- Deliver an Admin UI with full admin management over:
- users, sessions, roles/permissions
- configuration (global + per-tenant)
- definitions (aggregates, projections, sagas, effects, manifests)
- scaling and production management (tenant placement, drains, migrations, deployments)
- Package production-grade monitoring:
- metrics via VictoriaMetrics
- logs via Loki
- dashboards and alerting via Grafana (+ vmalert where used)
- Make production operations observable, auditable, and safe by default:
- strong change logging + approvals where needed
- idempotent operations + dry runs + rollback paths
---
## **Non-Goals**
- Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway).
- Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns.
- Provide an arbitrary “general API gateway” for third-party upstreams.
---
## **Primary Users**
- **Platform Owner / SRE**: fleet operations, incident response, production change management.
- **Platform Admin**: tenant provisioning, RBAC, config/definition promotion.
- **Security Admin**: access reviews, session revocation, audit trails.
- **Support / On-call**: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback).
---
## **Key Concepts**
### Control Plane Scope
- The control plane is the authoritative interface for production operations and admin management.
- The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them.
### Tenant-Aware Operations
- All tenant-scoped operations are keyed by `tenant_id` (consistent with `x-tenant-id` usage across nodes and Gateway).
- Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L188-L226), [tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L41-L104)).
### Safe Change Management
- Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible.
- All high-impact operations support:
- validation and preflight checks
- dry-run planning
- idempotency keys
- explicit rollback guidance
### Control Plane Components (In This Repo)
- **Admin UI (React)**:
- Reuse UltraBases control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements ([components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)).
- The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent.
- **Control Plane API (BFF / Admin API)**:
- A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs.
- Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions.
- **Observability Stack**:
- Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBases baseline ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)).
---
## **Functional Requirements**
### 1) Admin IAM (Users, Sessions, Roles)
#### 1.1 Users
- CRUD users with lifecycle states:
- invited (pending acceptance), active, suspended, disabled, deleted (tombstoned)
- Identity attributes:
- email (primary), optional secondary identities
- display name, avatar, metadata tags
- auth methods enabled (password, OIDC providers), MFA state
- Administrative actions:
- invite/resend invite
- reset password flow initiation
- force MFA reset / revoke recovery codes
- disable login / suspend user
- impersonation (break-glass, audited, time-boxed)
- Security constraints:
- privileged actions require step-up / recent auth
- sensitive events must be audit logged (who, what, when, why, from where)
#### 1.2 Sessions
- View active sessions and refresh token families:
- by user, by tenant, by IP / geo, by device, by time range
- Revoke capabilities:
- revoke a single session
- revoke all sessions for a user
- revoke all sessions for a tenant (incident response)
- Detection surfaces:
- unusual session fanout (many sessions per user)
- repeated failed logins / MFA failures
- suspicious IP changes
#### 1.3 Roles & Permissions (RBAC)
- Roles are sets of permissions; assignments bind principals to roles in a scope.
- Scopes:
- global (platform-level)
- tenant-scoped
- environment-scoped (dev/staging/prod) when applicable
- Required permission domains (minimum):
- iam.users.* (create/update/suspend/delete)
- iam.sessions.* (list/revoke)
- iam.roles.* (create/update/assign)
- tenants.* (create/update/archive)
- configs.* (read/write/approve/apply)
- definitions.* (read/write/validate/promote/rollback)
- scale.* (view/apply/migrate/drain)
- ops.* (deploy/rollback/restart/drain)
- observability.* (view dashboards, manage alert rules)
- audit.* (view/export)
- Role templates:
- owner, admin, operator, support, read-only, security-admin, break-glass
---
### 2) Tenant Management
- Create, list, and archive tenants.
- Tenant status model:
- provisioning, active, draining, migrating, degraded, suspended, archived
- Tenant metadata:
- plan/tier, quotas, feature flags, contact + billing metadata, environment(s)
- Tenant operational actions:
- trigger provisioning workflows (create streams/buckets, seed configs, create placement)
- rotate tenant secrets (as definitions/config allow)
- pause/resume workload (soft kill switch via config flags)
Tenant pages should mirror UltraBases “Tenant Overview + subpages” navigation patterns (example: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx) and [TenantOverviewPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantOverviewPage.tsx)).
---
### 3) Configuration Management (Global + Per-Tenant)
#### 3.1 Config Model
- Config items are versioned, typed documents with:
- scope (global / tenant / environment)
- schema version
- provenance (who/what wrote it)
- effective date and rollout strategy
- Config must support:
- validation against a schema
- diff view (previous vs next)
- staged rollout (preview → apply)
- rollback to a prior version
#### 3.2 Node-Related Configuration
Required config surfaces (minimum):
- **Gateway**: routing/placement sources, auth policies, rate limits (see routing expectations in [gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L154-L175)).
- **Aggregate / Projection / Runner**:
- shard identifiers and tenant allowlists/placement settings
- drain/reload toggles and safety thresholds
- resource limits / concurrency caps
---
### 4) Definition Management (System “Blueprints”)
Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types.
Required capabilities:
- Upload/edit versioned definitions with:
- validation (schema + semantic checks)
- “impact analysis” (which tenants/services are affected)
- promotion workflow (dev → staging → prod)
- Change controls:
- approvals (role-based) for production promotion
- emergency rollback path (one-click revert to last-known-good definition bundle)
- Tenant overrides:
- allow per-tenant definition overrides only when explicitly permitted by policy
The control plane must present definitions in a way that maps to the node runtime responsibilities:
- Aggregates and deterministic decide/apply programs ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L155-L160))
- Projections and deterministic project programs ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L36-L55))
- Runner sagas and effect provider manifests ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L41-L57))
---
### 5) Scale Management (Tenant Placement, Shards, Fleet)
#### 5.1 Placement Model
- Placement is modeled as:
- a set of nodes/shards and their attributes (labels, capacity, region)
- tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant)
- Control plane supports both:
- static placement (development)
- dynamic placement (production) backed by NATS KV (consistent with existing client patterns in [swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L79-L227))
#### 5.2 Tenant Migration
- Provide guided migration planning and execution:
- show current assignment, target assignment, and a sequenced action plan
- execute “graceful drain → update placement → reload” style plans (see [plan_graceful_tenant_migration](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L41-L65))
- Migration safety:
- require explicit confirmation and reason
- block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high)
- time-box and alert if drains do not converge
#### 5.3 Fleet View
- Fleet inventory:
- nodes (labels, region, capacity, version)
- services (replicas, image version, health)
- per-node and per-service load indicators (CPU/mem, request rate, consumer lag)
- Operator actions:
- scale replicas, restart services, cordon/drain nodes (when supported by orchestrator)
UX should align with the UltraBase “Fleet” and “Topology” navigation patterns ([FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx)).
---
### 6) Production Operations (Deployments, Maintenance, Safety)
#### 6.1 Deployments
- Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with:
- environment-specific rollout policies
- canary/rolling deploy support (when orchestrator supports it)
- automatic health checks gates and rollback triggers
- Track releases:
- “what is running where” (service version matrix)
- change log links and approvals
#### 6.2 Maintenance Operations
- Drain operations:
- tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in [TenantGate](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L106-L200))
- node drain (aggregate tenant ranges, projection consumers, runner workers)
- Replay / rebuild operations:
- projection rebuild triggers (dangerous, must be guarded and audited)
- workflow replay controls (reset checkpoints only with explicit intent)
#### 6.3 Incident Response Toolkit
- “Safe switches”:
- per-tenant kill switch (disable commands/effects via config)
- global degrade modes (rate limit reductions, disable expensive features)
- Run actions:
- revoke sessions at scale
- freeze deployments
- trigger drain/migrate with guided plan
---
### 7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards
#### 7.1 Stack Requirements
Adopt a production-ready stack consistent with UltraBases operational baseline:
- **VictoriaMetrics** for metrics storage and Prometheus-compatible query
- **vmagent** for scraping and remote_write
- **Grafana** for dashboards and alert routing
- **Loki** (+ optional **Promtail**) for logs
- Optional **vmalert** for rule evaluation against VictoriaMetrics
UltraBases observability design is a direct reference implementation to mirror and adapt:
- Stack overview and conventions: [observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)
- Provisioned dashboards and datasources: [grafana provisioning](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning)
#### 7.2 Metrics Conventions
- Every service exports `/metrics` in Prometheus format.
- Required labels:
- `service` (stable, low cardinality)
- `env` (dev/staging/prod)
- `tenant_id` only where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled.
- HTTP metrics must avoid unbounded `path` cardinality; prefer route templates (pattern-based paths).
Tenant-aware metrics guidelines:
- Prefer **tenant-only aggregates** for “who is hurting us?” views:
- `..._requests_total{tenant_id,service,status_class}` (no `path`)
- `..._request_duration_seconds{tenant_id,service}` (no `path`, limited bucket count)
- Prefer **route-only aggregates** for “what endpoint is hurting us?” views:
- `..._requests_total{service,path,status}` (no `tenant_id`)
- Where per-tenant and per-route both matter, implement a **top-k sampling** policy:
- emit `(tenant_id,path)` series only for top N tenants, or only for a fixed allowlist of routes.
#### 7.3 Required Dashboards (Production)
Minimum set of dashboards (provisioned on startup):
- **Platform — Operations overview**
- `up` for core services and observability stack
- RPS, 4xx/5xx ratio, p95/p99 latency per service
- saturation indicators (CPU/mem, inflight, queue depth)
- **Platform — HTTP detail**
- per-service request breakdown by route template, method, status
- top failing paths and latency outliers
- **Platform — Logs**
- Loki stream filtering by `service`, `tenant_id` (where present), and correlation identifiers
- **Platform — Event bus / JetStream**
- consumer lag, redeliveries, ack latency, stream storage pressure
- **Platform — Workers (Runner)**
- outbox depth, effect latency, poison message counts, schedules backlog
- **Platform — Storage (libmdbx)**
- DB size growth, write stalls, fsync latency (where exported), disk usage
- **Platform — Cluster / Orchestrator**
- node health, container restarts, placement distribution by tenant range
Dashboards should be modeled after UltraBases default set (for structure, not content), e.g. [ultrabase-operations.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-operations.json) and [ultrabase-http-detail.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-http-detail.json).
Additional production-operability dashboards (chosen and adapted):
- **Platform — Noisy Neighbor & Tenant Health**
- Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant.
- Panels (minimum):
- Top tenants by Gateway RPS (topk of tenant-only request counters).
- Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms.
- Tenant error ratio (5xx and 429) per tenant.
- Aggregate in-flight commands by tenant (already exported: `aggregate_in_flight_commands{tenant_id}`).
- Projection processing error rate by tenant (from `projection_processing_errors_total{tenant_id,view_type}` aggregated per tenant).
- Loki logs panel with a `tenant_id` variable selector; selecting a tenant syncs RPS/latency/errors + logs.
- Required instrumentation:
- Gateway must expose **tenant-level** HTTP counters/histograms (tenant + status class + service, without `path`) in addition to existing route-level metrics.
- **Platform — API Regression & Deployment**
- Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events.
- Panels (minimum):
- Error rate comparison “old vs new” by `service` and `version` (or `image_tag`) labels.
- Latency comparison “old vs new” (p95/p99) per service.
- Restart / flapping rate per service (container restarts, crash loops).
- Dependency latency correlation:
- Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency.
- Loki “new errors” panel:
- errors seen in the last 10m that were not present in the prior 60m window, grouped by `service`.
- Deployment annotations:
- vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric).
- Required instrumentation:
- Every service exports a `*_build_info{service,version,git_sha}` gauge (value=1) or equivalent, and scrape relabeling adds `image_tag` where possible.
- Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations).
- **Platform — Storage & Event Bus Bottlenecks**
- Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting).
- Panels (minimum):
- NATS/JetStream health:
- stream storage pressure, publish/ack latency, consumer lag, redeliveries.
- Projection lag and throughput:
- events processed rate, processing duration, error rate.
- Aggregate write-path pressure:
- command duration, version conflicts, in-flight commands, tenant errors.
- Runner pressure:
- outbox dispatch failure rate, effect timeout rate, deadletter writes.
- Disk saturation on nodes hosting libmdbx:
- disk usage, read/write latency, IOPS; correlate with spikes in command/query latency.
- Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata:
- pool saturation, replica lag, slow queries, long transactions.
- Required instrumentation:
- Ensure JetStream metrics are scraped (NATS server `/varz` exporter or native Prometheus endpoint depending on deployment).
- Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent).
- **Platform — Infrastructure Exhaustion**
- Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots.
- Panels (minimum):
- CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation.
- OOM kill tracker across the cluster.
- Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics).
- vmagent health:
- scrape error rate, remote_write errors, queue backlog.
- Loki ingestion health:
- dropped log lines (promtail) and ingestion errors (loki).
- Swarm task hygiene:
- desired_state vs current_state mismatches, orphaned tasks, restart loops.
- Required instrumentation:
- node exporter / cadvisor (or equivalent) must be part of the production scrape plan.
- promtail (or alternative) must expose drop/error metrics when logs are enabled.
#### 7.4 Alerting Requirements
Minimum alert classes:
- Availability:
- service down (`up == 0`)
- scrape failures, vmagent remote_write errors
- Reliability:
- sustained elevated 5xx ratio
- sustained elevated p95 latency per service
- Backlogs:
- JetStream consumer lag above threshold
- Runner outbox depth above threshold
- Data safety:
- disk usage near full (nodes hosting libmdbx)
- abnormal restart loops
- Security:
- login anomaly detection signals (where instrumented)
- suspicious spike in session revocations / failed MFA
Alert rules can follow UltraBases approach of version-controlled rules in YAML (reference: [alerts/](file:///Users/vlad/Developer/madapes/ultrabase/observability/alerts)).
#### 7.5 Control Plane → Observability Linking
The Admin UI must embed or deep-link into observability tools:
- per-tenant and per-service quick links to Grafana dashboards and Loki queries
- incident triage shortcuts (operations overview → HTTP detail → logs)
This mirrors UltraBases “observability links JSON” concept ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L65-L75)), but adapted to Cloudlysis services and dashboards.
---
### 8) Audit, Compliance, and Change History
- Audit log is an append-only stream of security and operations events:
- authentication and session events
- RBAC changes and permission grants
- config/definition changes and promotions
- scaling, drain, and migration operations
- deployments and rollbacks
- Audit log must support:
- search and export (bounded and access controlled)
- correlation to production incidents (request ids, trace ids)
- retention policy controls
---
### 9) Control Plane API Surface (Admin API)
The control plane requires a stable API surface for the Admin UI and automation.
Minimum API capabilities:
- **Idempotent jobs for multi-step operations**:
- every mutating operation returns a `job_id`, supports polling and cancellation, and records a full execution trace in the audit log.
- **Preflight endpoints**:
- validate an intended change and return a plan (and “would-change” diff) without applying it.
- **RBAC-first access model**:
- all endpoints enforce permission checks at the API boundary (UI is not trusted).
Minimum endpoint groups:
- `/admin/v1/iam/*` (users, roles, assignments, sessions)
- `/admin/v1/tenants/*` (tenants lifecycle, status, metadata)
- `/admin/v1/config/*` (versioned config, diff, apply, rollback)
- `/admin/v1/definitions/*` (bundles, validate, promote, rollback)
- `/admin/v1/scale/*` (placement, migrations, drain status)
- `/admin/v1/ops/*` (deployments, rollbacks, service actions)
- `/admin/v1/observability/*` (links, saved queries, dashboard registry)
- `/admin/v1/audit/*` (search, export)
Authentication/authorization integration:
- Prefer using the **Gateway** as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions.
---
### 10) Secrets and Credentials Management
The control plane must treat secrets as first-class operational data with strict handling.
Requirements:
- Secret values must never be logged and must be redacted in UI/API responses.
- Secrets must support:
- creation and rotation workflows
- scoped access (global/tenant/environment)
- staged rollout (write new → verify → promote → retire old)
- Rendering rules:
- after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only).
- Integrations:
- support referencing secrets from config/definitions without embedding values (secret refs).
---
### 11) Backups, Restore, and Disaster Recovery (Production Operability)
The control plane must provide explicit visibility and guardrails for data safety operations.
Minimum requirements:
- **Backup status**:
- show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores).
- **Restore readiness**:
- preflight checks that validate a restore plan (target environment, versions, dependencies).
- **Operational playbooks**:
- link to the exact restore procedure and post-restore verification checklist.
- **Key rotation**:
- explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends.
This should align with the platforms existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs).
---
## **Admin UI Requirements (Information Architecture + UX)**
### Navigation (Minimum)
Left navigation sections:
- Overview
- Tenants
- Users
- Sessions
- Roles & Permissions
- Config
- Definitions
- Scale & Placement
- Deployments
- Observability
- Audit Log
- Settings
### Page Patterns (Reuse UltraBase UI)
Adopt the UltraBase component system and page layout patterns:
- Layout, styling tokens, UI primitives: [components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)
- Table + search + action dropdown pattern: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx#L94-L203)
Required page types:
- List pages:
- searchable table, bulk actions, row actions menu, status pills, empty states
- Detail pages:
- header with primary actions (drain, migrate, rollback)
- sub-nav tabs for domain-specific views
- Mutation flows:
- modal confirmation + explicit reason entry for high-impact changes
- toast notifications and “busy” state handling consistent with UltraBase patterns
### Tenant Detail Subpages (Minimum)
- Overview (status, assignments, SLO highlights)
- Placement (per service: Aggregate/Projection/Runner)
- Health (node readiness and dependency checks)
- Config (effective config + diffs)
- Definitions (applied definition bundle + version)
- Activity (audit trail filtered to tenant)
- Observability (embedded links / panels)
---
## **Non-Functional Requirements**
- **Security**:
- strict RBAC everywhere; deny-by-default
- audit every privileged operation
- step-up for sensitive actions
- CSRF protection for browser sessions
- safe secret handling (no secret values rendered after creation unless explicitly permitted)
- allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse
- **Reliability**:
- control plane operations are idempotent and resilient to partial failures
- operations have clear “current state” and do not rely on UI assumptions
- **Performance**:
- list pages paginate and filter server-side for large fleets
- dashboards load with bounded query costs and controlled label cardinality
- **Operability**:
- control plane itself must be observable (metrics/logs, dashboards, alerts)
- every operation must surface preflight checks and post-conditions
---
## **Open Questions / Design Constraints (To Resolve During Implementation)**
- Where does the source of truth live for:
- users/sessions/roles (Gateway vs control-plane backing store)?
- configs/definitions (NATS KV vs database vs GitOps)?
- How should production promotions be modeled:
- environment branches, approval workflow, and rollback semantics?
- What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)?
- Where should the job/execution state for long-running operations live:
- embedded in the control plane API process, durable store, or NATS workflows?

24
control/ui/.gitignore vendored Normal file
View File

@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*
node_modules
dist
dist-ssr
*.local
# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?

73
control/ui/README.md Normal file
View File

@@ -0,0 +1,73 @@
# React + TypeScript + Vite
This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
Currently, two official plugins are available:
- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react) uses [Oxc](https://oxc.rs)
- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react-swc) uses [SWC](https://swc.rs/)
## React Compiler
The React Compiler is not enabled on this template because of its impact on dev & build performances. To add it, see [this documentation](https://react.dev/learn/react-compiler/installation).
## Expanding the ESLint configuration
If you are developing a production application, we recommend updating the configuration to enable type-aware lint rules:
```js
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
// Other configs...
// Remove tseslint.configs.recommended and replace with this
tseslint.configs.recommendedTypeChecked,
// Alternatively, use this for stricter rules
tseslint.configs.strictTypeChecked,
// Optionally, add this for stylistic rules
tseslint.configs.stylisticTypeChecked,
// Other configs...
],
languageOptions: {
parserOptions: {
project: ['./tsconfig.node.json', './tsconfig.app.json'],
tsconfigRootDir: import.meta.dirname,
},
// other options...
},
},
])
```
You can also install [eslint-plugin-react-x](https://github.com/Rel1cx/eslint-react/tree/main/packages/plugins/eslint-plugin-react-x) and [eslint-plugin-react-dom](https://github.com/Rel1cx/eslint-react/tree/main/packages/plugins/eslint-plugin-react-dom) for React-specific lint rules:
```js
// eslint.config.js
import reactX from 'eslint-plugin-react-x'
import reactDom from 'eslint-plugin-react-dom'
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
// Other configs...
// Enable lint rules for React
reactX.configs['recommended-typescript'],
// Enable lint rules for React DOM
reactDom.configs.recommended,
],
languageOptions: {
parserOptions: {
project: ['./tsconfig.node.json', './tsconfig.app.json'],
tsconfigRootDir: import.meta.dirname,
},
// other options...
},
},
])
```

View File

@@ -0,0 +1,23 @@
import js from '@eslint/js'
import globals from 'globals'
import reactHooks from 'eslint-plugin-react-hooks'
import reactRefresh from 'eslint-plugin-react-refresh'
import tseslint from 'typescript-eslint'
import { defineConfig, globalIgnores } from 'eslint/config'
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
js.configs.recommended,
tseslint.configs.recommended,
reactHooks.configs.flat.recommended,
reactRefresh.configs.vite,
],
languageOptions: {
ecmaVersion: 2020,
globals: globals.browser,
},
},
])

13
control/ui/index.html Normal file
View File

@@ -0,0 +1,13 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/favicon.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>ui</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

11
control/ui/nginx.conf Normal file
View File

@@ -0,0 +1,11 @@
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index index.html;
location / {
try_files $uri $uri/ /index.html;
}
}

5333
control/ui/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

37
control/ui/package.json Normal file
View File

@@ -0,0 +1,37 @@
{
"name": "ui",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "tsc -b && vite build",
"lint": "eslint .",
"typecheck": "tsc -b --pretty false",
"test": "vitest run",
"preview": "vite preview"
},
"dependencies": {
"react": "^19.2.4",
"react-dom": "^19.2.4",
"react-router-dom": "^7.9.3"
},
"devDependencies": {
"@eslint/js": "^9.39.4",
"@testing-library/jest-dom": "^6.9.0",
"@testing-library/react": "^16.3.0",
"@types/node": "^24.12.0",
"@types/react": "^19.2.14",
"@types/react-dom": "^19.2.3",
"@vitejs/plugin-react": "^6.0.1",
"eslint": "^9.39.4",
"eslint-plugin-react-hooks": "^7.0.1",
"eslint-plugin-react-refresh": "^0.5.2",
"globals": "^17.4.0",
"jsdom": "^27.0.0",
"typescript": "~5.9.3",
"typescript-eslint": "^8.57.0",
"vite": "^8.0.1",
"vitest": "^3.2.4"
}
}

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 9.3 KiB

View File

@@ -0,0 +1,24 @@
<svg xmlns="http://www.w3.org/2000/svg">
<symbol id="bluesky-icon" viewBox="0 0 16 17">
<g clip-path="url(#bluesky-clip)"><path fill="#08060d" d="M7.75 7.735c-.693-1.348-2.58-3.86-4.334-5.097-1.68-1.187-2.32-.981-2.74-.79C.188 2.065.1 2.812.1 3.251s.241 3.602.398 4.13c.52 1.744 2.367 2.333 4.07 2.145-2.495.37-4.71 1.278-1.805 4.512 3.196 3.309 4.38-.71 4.987-2.746.608 2.036 1.307 5.91 4.93 2.746 2.72-2.746.747-4.143-1.747-4.512 1.702.189 3.55-.4 4.07-2.145.156-.528.397-3.691.397-4.13s-.088-1.186-.575-1.406c-.42-.19-1.06-.395-2.741.79-1.755 1.24-3.64 3.752-4.334 5.099"/></g>
<defs><clipPath id="bluesky-clip"><path fill="#fff" d="M.1.85h15.3v15.3H.1z"/></clipPath></defs>
</symbol>
<symbol id="discord-icon" viewBox="0 0 20 19">
<path fill="#08060d" d="M16.224 3.768a14.5 14.5 0 0 0-3.67-1.153c-.158.286-.343.67-.47.976a13.5 13.5 0 0 0-4.067 0c-.128-.306-.317-.69-.476-.976A14.4 14.4 0 0 0 3.868 3.77C1.546 7.28.916 10.703 1.231 14.077a14.7 14.7 0 0 0 4.5 2.306q.545-.748.965-1.587a9.5 9.5 0 0 1-1.518-.74q.191-.14.372-.293c2.927 1.369 6.107 1.369 8.999 0q.183.152.372.294-.723.437-1.52.74.418.838.963 1.588a14.6 14.6 0 0 0 4.504-2.308c.37-3.911-.63-7.302-2.644-10.309m-9.13 8.234c-.878 0-1.599-.82-1.599-1.82 0-.998.705-1.82 1.6-1.82.894 0 1.614.82 1.599 1.82.001 1-.705 1.82-1.6 1.82m5.91 0c-.878 0-1.599-.82-1.599-1.82 0-.998.705-1.82 1.6-1.82.893 0 1.614.82 1.599 1.82 0 1-.706 1.82-1.6 1.82"/>
</symbol>
<symbol id="documentation-icon" viewBox="0 0 21 20">
<path fill="none" stroke="#aa3bff" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.35" d="m15.5 13.333 1.533 1.322c.645.555.967.833.967 1.178s-.322.623-.967 1.179L15.5 18.333m-3.333-5-1.534 1.322c-.644.555-.966.833-.966 1.178s.322.623.966 1.179l1.534 1.321"/>
<path fill="none" stroke="#aa3bff" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.35" d="M17.167 10.836v-4.32c0-1.41 0-2.117-.224-2.68-.359-.906-1.118-1.621-2.08-1.96-.599-.21-1.349-.21-2.848-.21-2.623 0-3.935 0-4.983.369-1.684.591-3.013 1.842-3.641 3.428C3 6.449 3 7.684 3 10.154v2.122c0 2.558 0 3.838.706 4.726q.306.383.713.671c.76.536 1.79.64 3.581.66"/>
<path fill="none" stroke="#aa3bff" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.35" d="M3 10a2.78 2.78 0 0 1 2.778-2.778c.555 0 1.209.097 1.748-.047.48-.129.854-.503.982-.982.145-.54.048-1.194.048-1.749a2.78 2.78 0 0 1 2.777-2.777"/>
</symbol>
<symbol id="github-icon" viewBox="0 0 19 19">
<path fill="#08060d" fill-rule="evenodd" d="M9.356 1.85C5.05 1.85 1.57 5.356 1.57 9.694a7.84 7.84 0 0 0 5.324 7.44c.387.079.528-.168.528-.376 0-.182-.013-.805-.013-1.454-2.165.467-2.616-.935-2.616-.935-.349-.91-.864-1.143-.864-1.143-.71-.48.051-.48.051-.48.787.051 1.2.805 1.2.805.695 1.194 1.817.857 2.268.649.064-.507.27-.857.49-1.052-1.728-.182-3.545-.857-3.545-3.87 0-.857.31-1.558.8-2.104-.078-.195-.349-1 .077-2.078 0 0 .657-.208 2.14.805a7.5 7.5 0 0 1 1.946-.26c.657 0 1.328.092 1.946.26 1.483-1.013 2.14-.805 2.14-.805.426 1.078.155 1.883.078 2.078.502.546.799 1.247.799 2.104 0 3.013-1.818 3.675-3.558 3.87.284.247.528.714.528 1.454 0 1.052-.012 1.896-.012 2.156 0 .208.142.455.528.377a7.84 7.84 0 0 0 5.324-7.441c.013-4.338-3.48-7.844-7.773-7.844" clip-rule="evenodd"/>
</symbol>
<symbol id="social-icon" viewBox="0 0 20 20">
<path fill="none" stroke="#aa3bff" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.35" d="M12.5 6.667a4.167 4.167 0 1 0-8.334 0 4.167 4.167 0 0 0 8.334 0"/>
<path fill="none" stroke="#aa3bff" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.35" d="M2.5 16.667a5.833 5.833 0 0 1 8.75-5.053m3.837.474.513 1.035c.07.144.257.282.414.309l.93.155c.596.1.736.536.307.965l-.723.73a.64.64 0 0 0-.152.531l.207.903c.164.715-.213.991-.84.618l-.872-.52a.63.63 0 0 0-.577 0l-.872.52c-.624.373-1.003.094-.84-.618l.207-.903a.64.64 0 0 0-.152-.532l-.723-.729c-.426-.43-.289-.864.306-.964l.93-.156a.64.64 0 0 0 .412-.31l.513-1.034c.28-.562.735-.562 1.012 0"/>
</symbol>
<symbol id="x-icon" viewBox="0 0 19 19">
<path fill="#08060d" fill-rule="evenodd" d="M1.893 1.98c.052.072 1.245 1.769 2.653 3.77l2.892 4.114c.183.261.333.48.333.486s-.068.089-.152.183l-.522.593-.765.867-3.597 4.087c-.375.426-.734.834-.798.905a1 1 0 0 0-.118.148c0 .01.236.017.664.017h.663l.729-.83c.4-.457.796-.906.879-.999a692 692 0 0 0 1.794-2.038c.034-.037.301-.34.594-.675l.551-.624.345-.392a7 7 0 0 1 .34-.374c.006 0 .93 1.306 2.052 2.903l2.084 2.965.045.063h2.275c1.87 0 2.273-.003 2.266-.021-.008-.02-1.098-1.572-3.894-5.547-2.013-2.862-2.28-3.246-2.273-3.266.008-.019.282-.332 2.085-2.38l2-2.274 1.567-1.782c.022-.028-.016-.03-.65-.03h-.674l-.3.342a871 871 0 0 1-1.782 2.025c-.067.075-.405.458-.75.852a100 100 0 0 1-.803.91c-.148.172-.299.344-.99 1.127-.304.343-.32.358-.345.327-.015-.019-.904-1.282-1.976-2.808L6.365 1.85H1.8zm1.782.91 8.078 11.294c.772 1.08 1.413 1.973 1.425 1.984.016.017.241.02 1.05.017l1.03-.004-2.694-3.766L7.796 5.75 5.722 2.852l-1.039-.004-1.039-.004z" clip-rule="evenodd"/>
</symbol>
</svg>

After

Width:  |  Height:  |  Size: 4.9 KiB

184
control/ui/src/App.css Normal file
View File

@@ -0,0 +1,184 @@
.counter {
font-size: 16px;
padding: 5px 10px;
border-radius: 5px;
color: var(--accent);
background: var(--accent-bg);
border: 2px solid transparent;
transition: border-color 0.3s;
margin-bottom: 24px;
&:hover {
border-color: var(--accent-border);
}
&:focus-visible {
outline: 2px solid var(--accent);
outline-offset: 2px;
}
}
.hero {
position: relative;
.base,
.framework,
.vite {
inset-inline: 0;
margin: 0 auto;
}
.base {
width: 170px;
position: relative;
z-index: 0;
}
.framework,
.vite {
position: absolute;
}
.framework {
z-index: 1;
top: 34px;
height: 28px;
transform: perspective(2000px) rotateZ(300deg) rotateX(44deg) rotateY(39deg)
scale(1.4);
}
.vite {
z-index: 0;
top: 107px;
height: 26px;
width: auto;
transform: perspective(2000px) rotateZ(300deg) rotateX(40deg) rotateY(39deg)
scale(0.8);
}
}
#center {
display: flex;
flex-direction: column;
gap: 25px;
place-content: center;
place-items: center;
flex-grow: 1;
@media (max-width: 1024px) {
padding: 32px 20px 24px;
gap: 18px;
}
}
#next-steps {
display: flex;
border-top: 1px solid var(--border);
text-align: left;
& > div {
flex: 1 1 0;
padding: 32px;
@media (max-width: 1024px) {
padding: 24px 20px;
}
}
.icon {
margin-bottom: 16px;
width: 22px;
height: 22px;
}
@media (max-width: 1024px) {
flex-direction: column;
text-align: center;
}
}
#docs {
border-right: 1px solid var(--border);
@media (max-width: 1024px) {
border-right: none;
border-bottom: 1px solid var(--border);
}
}
#next-steps ul {
list-style: none;
padding: 0;
display: flex;
gap: 8px;
margin: 32px 0 0;
.logo {
height: 18px;
}
a {
color: var(--text-h);
font-size: 16px;
border-radius: 6px;
background: var(--social-bg);
display: flex;
padding: 6px 12px;
align-items: center;
gap: 8px;
text-decoration: none;
transition: box-shadow 0.3s;
&:hover {
box-shadow: var(--shadow);
}
.button-icon {
height: 18px;
width: 18px;
}
}
@media (max-width: 1024px) {
margin-top: 20px;
flex-wrap: wrap;
justify-content: center;
li {
flex: 1 1 calc(50% - 8px);
}
a {
width: 100%;
justify-content: center;
box-sizing: border-box;
}
}
}
#spacer {
height: 88px;
border-top: 1px solid var(--border);
@media (max-width: 1024px) {
height: 48px;
}
}
.ticks {
position: relative;
width: 100%;
&::before,
&::after {
content: '';
position: absolute;
top: -4.5px;
border: 5px solid transparent;
}
&::before {
left: 0;
border-left-color: var(--border);
}
&::after {
right: 0;
border-right-color: var(--border);
}
}

8
control/ui/src/App.tsx Normal file
View File

@@ -0,0 +1,8 @@
import { RouterProvider } from 'react-router-dom'
import { createBrowserAppRouter } from './app/router'
const router = createBrowserAppRouter()
export default function App() {
return <RouterProvider router={router} />
}

View File

@@ -0,0 +1,122 @@
type RequestIds = {
requestId: string
correlationId?: string
traceparent?: string
}
const LAST_IDS_STORAGE_KEY = 'control:last_request_ids'
export class ApiError extends Error {
status: number
requestId: string
correlationId?: string
traceparent?: string
constructor(args: {
status: number
message: string
requestId: string
correlationId?: string
traceparent?: string
}) {
super(args.message)
this.name = 'ApiError'
this.status = args.status
this.requestId = args.requestId
this.correlationId = args.correlationId
this.traceparent = args.traceparent
}
}
const state: {
last?: RequestIds
} = {}
function isRecord(value: unknown): value is Record<string, unknown> {
return typeof value === 'object' && value !== null
}
function loadLastIds(): RequestIds | undefined {
try {
const raw = localStorage.getItem(LAST_IDS_STORAGE_KEY)
if (!raw) return undefined
const parsed = JSON.parse(raw) as unknown
if (isRecord(parsed) && typeof parsed.requestId === 'string') {
const correlationId =
typeof parsed.correlationId === 'string' ? parsed.correlationId : undefined
const traceparent =
typeof parsed.traceparent === 'string' ? parsed.traceparent : undefined
return { requestId: parsed.requestId, correlationId, traceparent }
}
} catch {
return undefined
}
return undefined
}
function persistLastIds(ids: RequestIds) {
try {
localStorage.setItem(LAST_IDS_STORAGE_KEY, JSON.stringify(ids))
} catch {
return
}
}
function newRequestId(): string {
if (typeof crypto !== 'undefined' && 'randomUUID' in crypto) {
return crypto.randomUUID()
}
return `${Date.now()}-${Math.random().toString(16).slice(2)}`
}
export function getLastRequestIds(): RequestIds | undefined {
return state.last ?? loadLastIds()
}
type ApiRequestInit = RequestInit & {
correlationId?: string
traceparent?: string
useLastCorrelationId?: boolean
useLastTraceparent?: boolean
}
export async function apiFetch(
input: RequestInfo | URL,
init?: ApiRequestInit,
) {
const requestId = newRequestId()
const headers = new Headers(init?.headers)
headers.set('x-request-id', requestId)
const last = getLastRequestIds()
const correlationId =
init?.correlationId ?? (init?.useLastCorrelationId ? last?.correlationId : undefined)
const traceparent =
init?.traceparent ?? (init?.useLastTraceparent ? last?.traceparent : undefined)
if (correlationId) headers.set('x-correlation-id', correlationId)
if (traceparent) headers.set('traceparent', traceparent)
const res = await fetch(input, { ...init, headers })
const resCorrelationId = res.headers.get('x-correlation-id') ?? correlationId ?? undefined
const resTraceparent = res.headers.get('traceparent') ?? traceparent ?? undefined
const ids = { requestId, correlationId: resCorrelationId, traceparent: resTraceparent }
state.last = ids
persistLastIds(ids)
if (!res.ok) {
const text = await res.text().catch(() => '')
const err = new ApiError({
status: res.status,
requestId,
correlationId: resCorrelationId,
traceparent: resTraceparent,
message: `API error ${res.status}${text ? `: ${text}` : ''} (request_id=${requestId}${
resCorrelationId ? ` correlation_id=${resCorrelationId}` : ''
})`,
})
throw err
}
return res
}

View File

@@ -0,0 +1,179 @@
import { apiFetch } from './client'
import { getAccessToken } from '../auth/token'
function baseUrl() {
const v = import.meta.env.VITE_CONTROL_API_URL as string | undefined
return (v ?? 'http://127.0.0.1:8080').replace(/\/$/, '')
}
async function apiJson<T>(path: string): Promise<T> {
const controller = new AbortController()
const t = window.setTimeout(() => controller.abort(), 2000)
const token = getAccessToken()
const headers: HeadersInit = token ? { Authorization: `Bearer ${token}` } : {}
try {
const res = await apiFetch(`${baseUrl()}${path}`, {
headers,
signal: controller.signal,
useLastCorrelationId: true,
useLastTraceparent: true,
})
return (await res.json()) as T
} finally {
window.clearTimeout(t)
}
}
async function apiPostJson<T>(path: string, body: unknown, idempotencyKey?: string): Promise<T> {
const controller = new AbortController()
const t = window.setTimeout(() => controller.abort(), 2000)
const token = getAccessToken()
const headers: HeadersInit = {
'content-type': 'application/json',
...(token ? { Authorization: `Bearer ${token}` } : {}),
...(idempotencyKey ? { 'Idempotency-Key': idempotencyKey } : {}),
}
try {
const res = await apiFetch(`${baseUrl()}${path}`, {
method: 'POST',
headers,
body: JSON.stringify(body),
signal: controller.signal,
useLastCorrelationId: true,
useLastTraceparent: true,
})
return (await res.json()) as T
} finally {
window.clearTimeout(t)
}
}
export type FleetSnapshot = {
services: Array<{
name: string
base_url: string
health_ok: boolean
ready_ok: boolean
metrics_ok: boolean
}>
}
export type PlacementResponse = {
kind: 'aggregate' | 'projection' | 'runner'
revision: string
placements: Array<{ tenant_id: string; targets: string[] }>
}
export type TenantsResponse = {
tenants: Array<{
tenant_id: string
aggregate_targets: string[]
projection_targets: string[]
runner_targets: string[]
}>
}
export type Job = {
job_id: string
status: 'pending' | 'running' | 'succeeded' | 'failed' | 'cancelled'
steps: Array<{ name: string; status: Job['status']; attempts: number; error?: string | null }>
error?: string | null
created_at_ms: number
started_at_ms?: number | null
finished_at_ms?: number | null
}
export type AuditEvent = {
ts_ms: number
principal_sub: string
action: string
tenant_id?: string | null
reason: string
job_id?: string | null
}
export function getFleetSnapshot(): Promise<FleetSnapshot> {
return apiJson('/admin/v1/fleet/snapshot')
}
export function getPlacement(kind: 'aggregate' | 'projection' | 'runner'): Promise<PlacementResponse> {
return apiJson(`/admin/v1/placement/${kind}`)
}
export function getTenants(): Promise<TenantsResponse> {
return apiJson('/admin/v1/tenants')
}
export function getJob(jobId: string): Promise<Job> {
return apiJson(`/admin/v1/jobs/${jobId}`)
}
export function cancelJob(jobId: string): Promise<void> {
return apiPostJson(`/admin/v1/jobs/${jobId}/cancel`, {}, undefined).then(() => undefined)
}
export function startTenantDrainJob(args: {
tenantId: string
reason: string
idempotencyKey: string
}): Promise<{ job_id: string }> {
return apiPostJson(
'/admin/v1/jobs/tenant/drain',
{ tenant_id: args.tenantId, reason: args.reason },
args.idempotencyKey,
)
}
export function startTenantMigrateJob(args: {
tenantId: string
runnerTarget: string
reason: string
idempotencyKey: string
}): Promise<{ job_id: string }> {
return apiPostJson(
'/admin/v1/jobs/tenant/migrate',
{ tenant_id: args.tenantId, runner_target: args.runnerTarget, reason: args.reason },
args.idempotencyKey,
)
}
export function planTenantMigrate(args: { tenantId: string; runnerTarget: string; reason: string }): Promise<{ steps: string[] }> {
return apiPostJson('/admin/v1/plan/tenant/migrate', {
tenant_id: args.tenantId,
runner_target: args.runnerTarget,
reason: args.reason,
})
}
export function listAudit(): Promise<{ events: AuditEvent[] }> {
return apiJson('/admin/v1/audit')
}
export type SwarmService = {
name: string
image?: string | null
mode?: string | null
replicas?: string | null
updated_at?: string | null
}
export type SwarmTask = {
id: string
service: string
node?: string | null
desired_state?: string | null
current_state?: string | null
error?: string | null
}
export function getSwarmServices(): Promise<{ services: SwarmService[] }> {
return apiJson('/admin/v1/swarm/services')
}
export function getSwarmTasks(serviceName: string): Promise<{ service: string; tasks: SwarmTask[] }> {
return apiJson(`/admin/v1/swarm/services/${encodeURIComponent(serviceName)}/tasks`)
}

View File

@@ -0,0 +1,183 @@
import { useMemo, useState } from 'react'
import { Link, Outlet, useLocation } from 'react-router-dom'
import { getLastRequestIds } from '../api/client'
import { Button, Code, TextInput } from '../components/primitives'
type NavItem = {
label: string
to: string
}
const navItems: NavItem[] = [
{ label: 'Overview', to: '/' },
{ label: 'Tenants', to: '/tenants' },
{ label: 'Users', to: '/users' },
{ label: 'Sessions', to: '/sessions' },
{ label: 'Roles & Permissions', to: '/roles-permissions' },
{ label: 'Config', to: '/config' },
{ label: 'Definitions', to: '/definitions' },
{ label: 'Scale & Placement', to: '/scale-placement' },
{ label: 'Deployments', to: '/deployments' },
{ label: 'Observability', to: '/observability' },
{ label: 'Audit Log', to: '/audit-log' },
{ label: 'Settings', to: '/settings' },
]
function normalizePath(pathname: string) {
if (pathname === '') return '/'
if (pathname === '/') return '/'
return pathname.endsWith('/') ? pathname.slice(0, -1) : pathname
}
export function Layout() {
const location = useLocation()
const active = normalizePath(location.pathname)
const [query, setQuery] = useState('')
const lastIds = getLastRequestIds()
const grafana = useMemo(() => {
const base = (import.meta.env.VITE_GRAFANA_URL as string | undefined) ?? ''
const loki = (import.meta.env.VITE_GRAFANA_LOKI_DATASOURCE as string | undefined) ?? 'Loki'
const tempo = (import.meta.env.VITE_GRAFANA_TEMPO_DATASOURCE as string | undefined) ?? 'Tempo'
return { base, loki, tempo }
}, [])
function openGrafanaLogs(id: string) {
if (!grafana.base) return
const left = encodeURIComponent(
JSON.stringify({
datasource: grafana.loki,
queries: [{ refId: 'A', expr: `{correlation_id="${id}"}` }],
}),
)
window.open(`${grafana.base.replace(/\/$/, '')}/explore?left=${left}`, '_blank', 'noreferrer')
}
function openGrafanaTrace(id: string) {
if (!grafana.base) return
const left = encodeURIComponent(
JSON.stringify({
datasource: grafana.tempo,
queries: [{ refId: 'A', queryType: 'traceId', traceId: id }],
}),
)
window.open(`${grafana.base.replace(/\/$/, '')}/explore?left=${left}`, '_blank', 'noreferrer')
}
async function copy(text: string) {
try {
await navigator.clipboard.writeText(text)
} catch {
return
}
}
return (
<div style={{ display: 'flex', minHeight: '100vh', fontFamily: 'system-ui, sans-serif' }}>
<aside
style={{
width: 260,
borderRight: '1px solid #eee',
padding: 16,
background: '#fafafa',
}}
>
<div style={{ fontWeight: 700, marginBottom: 16 }}>Cloudlysis Control</div>
<nav style={{ display: 'flex', flexDirection: 'column', gap: 8 }}>
{navItems.map((item) => {
const isActive = active === normalizePath(item.to)
return (
<Link
key={item.to}
to={item.to}
style={{
textDecoration: 'none',
color: '#111',
padding: '6px 10px',
borderRadius: 8,
background: isActive ? '#eaeaea' : 'transparent',
}}
>
{item.label}
</Link>
)
})}
</nav>
</aside>
<div style={{ flex: 1 }}>
<header
style={{
borderBottom: '1px solid #eee',
padding: 16,
display: 'flex',
flexDirection: 'column',
gap: 10,
}}
>
<div style={{ display: 'flex', gap: 12, alignItems: 'center' }}>
<div style={{ width: 420, maxWidth: '100%' }}>
<TextInput
ariaLabel="Global search"
placeholder="Search request/correlation/trace id"
value={query}
onChange={setQuery}
onKeyDown={(e) => {
if (e.key === 'Enter') {
const id = query.trim()
if (!id) return
openGrafanaLogs(id)
}
}}
/>
</div>
<Button
onClick={() => {
const id = query.trim()
if (!id) return
openGrafanaLogs(id)
}}
disabled={!grafana.base}
>
Logs
</Button>
<Button
onClick={() => {
const id = query.trim()
if (!id) return
openGrafanaTrace(id)
}}
disabled={!grafana.base}
>
Trace
</Button>
</div>
{lastIds ? (
<div style={{ display: 'flex', gap: 16, alignItems: 'center', flexWrap: 'wrap' }}>
<div style={{ display: 'flex', gap: 8, alignItems: 'center' }}>
<span style={{ fontSize: 12, color: '#666' }}>request_id</span>
<Code>{lastIds.requestId}</Code>
<Button onClick={() => copy(lastIds.requestId)}>Copy</Button>
</div>
{lastIds.correlationId ? (
<div style={{ display: 'flex', gap: 8, alignItems: 'center' }}>
<span style={{ fontSize: 12, color: '#666' }}>correlation_id</span>
<Code>{lastIds.correlationId}</Code>
<Button onClick={() => copy(lastIds.correlationId ?? '')}>Copy</Button>
<Button
onClick={() => openGrafanaLogs(lastIds.correlationId ?? '')}
disabled={!grafana.base}
>
Investigate
</Button>
</div>
) : null}
</div>
) : null}
</header>
<Outlet />
</div>
</div>
)
}

View File

@@ -0,0 +1,37 @@
import { cleanup, render, screen } from '@testing-library/react'
import { RouterProvider } from 'react-router-dom'
import { afterEach, describe, expect, it } from 'vitest'
import { createMemoryAppRouter } from './router'
afterEach(() => {
cleanup()
})
const paths = [
'/',
'/tenants',
'/users',
'/sessions',
'/roles-permissions',
'/config',
'/definitions',
'/scale-placement',
'/deployments',
'/observability',
'/audit-log',
'/settings',
]
describe('routing', () => {
it.each(paths)('renders %s without runtime errors', async (path: string) => {
const router = createMemoryAppRouter([path])
render(<RouterProvider router={router} />)
expect(await screen.findByRole('heading', { level: 1 })).toBeInTheDocument()
})
it('renders not found for unknown routes', async () => {
const router = createMemoryAppRouter(['/does-not-exist'])
render(<RouterProvider router={router} />)
expect(await screen.findByText('Not Found')).toBeInTheDocument()
})
})

View File

@@ -0,0 +1,51 @@
import { createBrowserRouter, createMemoryRouter, type RouteObject } from 'react-router-dom'
import { Layout } from './layout'
import {
AuditLogPage,
ConfigPage,
DefinitionsPage,
DeploymentDetailPage,
DeploymentsPage,
JobPage,
NotFoundPage,
ObservabilityPage,
OverviewPage,
RolesPermissionsPage,
ScalePlacementPage,
SessionsPage,
SettingsPage,
TenantsPage,
UsersPage,
} from '../pages'
export const routes: RouteObject[] = [
{
path: '/',
element: <Layout />,
children: [
{ index: true, element: <OverviewPage /> },
{ path: 'tenants', element: <TenantsPage /> },
{ path: 'users', element: <UsersPage /> },
{ path: 'sessions', element: <SessionsPage /> },
{ path: 'roles-permissions', element: <RolesPermissionsPage /> },
{ path: 'config', element: <ConfigPage /> },
{ path: 'definitions', element: <DefinitionsPage /> },
{ path: 'scale-placement', element: <ScalePlacementPage /> },
{ path: 'deployments', element: <DeploymentsPage /> },
{ path: 'deployments/:serviceName', element: <DeploymentDetailPage /> },
{ path: 'observability', element: <ObservabilityPage /> },
{ path: 'audit-log', element: <AuditLogPage /> },
{ path: 'jobs/:jobId', element: <JobPage /> },
{ path: 'settings', element: <SettingsPage /> },
{ path: '*', element: <NotFoundPage /> },
],
},
]
export function createBrowserAppRouter() {
return createBrowserRouter(routes)
}
export function createMemoryAppRouter(initialEntries: string[]) {
return createMemoryRouter(routes, { initialEntries })
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

View File

@@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" class="iconify iconify--logos" width="35.93" height="32" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 228"><path fill="#00D8FF" d="M210.483 73.824a171.49 171.49 0 0 0-8.24-2.597c.465-1.9.893-3.777 1.273-5.621c6.238-30.281 2.16-54.676-11.769-62.708c-13.355-7.7-35.196.329-57.254 19.526a171.23 171.23 0 0 0-6.375 5.848a155.866 155.866 0 0 0-4.241-3.917C100.759 3.829 77.587-4.822 63.673 3.233C50.33 10.957 46.379 33.89 51.995 62.588a170.974 170.974 0 0 0 1.892 8.48c-3.28.932-6.445 1.924-9.474 2.98C17.309 83.498 0 98.307 0 113.668c0 15.865 18.582 31.778 46.812 41.427a145.52 145.52 0 0 0 6.921 2.165a167.467 167.467 0 0 0-2.01 9.138c-5.354 28.2-1.173 50.591 12.134 58.266c13.744 7.926 36.812-.22 59.273-19.855a145.567 145.567 0 0 0 5.342-4.923a168.064 168.064 0 0 0 6.92 6.314c21.758 18.722 43.246 26.282 56.54 18.586c13.731-7.949 18.194-32.003 12.4-61.268a145.016 145.016 0 0 0-1.535-6.842c1.62-.48 3.21-.974 4.76-1.488c29.348-9.723 48.443-25.443 48.443-41.52c0-15.417-17.868-30.326-45.517-39.844Zm-6.365 70.984c-1.4.463-2.836.91-4.3 1.345c-3.24-10.257-7.612-21.163-12.963-32.432c5.106-11 9.31-21.767 12.459-31.957c2.619.758 5.16 1.557 7.61 2.4c23.69 8.156 38.14 20.213 38.14 29.504c0 9.896-15.606 22.743-40.946 31.14Zm-10.514 20.834c2.562 12.94 2.927 24.64 1.23 33.787c-1.524 8.219-4.59 13.698-8.382 15.893c-8.067 4.67-25.32-1.4-43.927-17.412a156.726 156.726 0 0 1-6.437-5.87c7.214-7.889 14.423-17.06 21.459-27.246c12.376-1.098 24.068-2.894 34.671-5.345a134.17 134.17 0 0 1 1.386 6.193ZM87.276 214.515c-7.882 2.783-14.16 2.863-17.955.675c-8.075-4.657-11.432-22.636-6.853-46.752a156.923 156.923 0 0 1 1.869-8.499c10.486 2.32 22.093 3.988 34.498 4.994c7.084 9.967 14.501 19.128 21.976 27.15a134.668 134.668 0 0 1-4.877 4.492c-9.933 8.682-19.886 14.842-28.658 17.94ZM50.35 144.747c-12.483-4.267-22.792-9.812-29.858-15.863c-6.35-5.437-9.555-10.836-9.555-15.216c0-9.322 13.897-21.212 37.076-29.293c2.813-.98 5.757-1.905 8.812-2.773c3.204 10.42 7.406 21.315 12.477 32.332c-5.137 11.18-9.399 22.249-12.634 32.792a134.718 134.718 0 0 1-6.318-1.979Zm12.378-84.26c-4.811-24.587-1.616-43.134 6.425-47.789c8.564-4.958 27.502 2.111 47.463 19.835a144.318 144.318 0 0 1 3.841 3.545c-7.438 7.987-14.787 17.08-21.808 26.988c-12.04 1.116-23.565 2.908-34.161 5.309a160.342 160.342 0 0 1-1.76-7.887Zm110.427 27.268a347.8 347.8 0 0 0-7.785-12.803c8.168 1.033 15.994 2.404 23.343 4.08c-2.206 7.072-4.956 14.465-8.193 22.045a381.151 381.151 0 0 0-7.365-13.322Zm-45.032-43.861c5.044 5.465 10.096 11.566 15.065 18.186a322.04 322.04 0 0 0-30.257-.006c4.974-6.559 10.069-12.652 15.192-18.18ZM82.802 87.83a323.167 323.167 0 0 0-7.227 13.238c-3.184-7.553-5.909-14.98-8.134-22.152c7.304-1.634 15.093-2.97 23.209-3.984a321.524 321.524 0 0 0-7.848 12.897Zm8.081 65.352c-8.385-.936-16.291-2.203-23.593-3.793c2.26-7.3 5.045-14.885 8.298-22.6a321.187 321.187 0 0 0 7.257 13.246c2.594 4.48 5.28 8.868 8.038 13.147Zm37.542 31.03c-5.184-5.592-10.354-11.779-15.403-18.433c4.902.192 9.899.29 14.978.29c5.218 0 10.376-.117 15.453-.343c-4.985 6.774-10.018 12.97-15.028 18.486Zm52.198-57.817c3.422 7.8 6.306 15.345 8.596 22.52c-7.422 1.694-15.436 3.058-23.88 4.071a382.417 382.417 0 0 0 7.859-13.026a347.403 347.403 0 0 0 7.425-13.565Zm-16.898 8.101a358.557 358.557 0 0 1-12.281 19.815a329.4 329.4 0 0 1-23.444.823c-7.967 0-15.716-.248-23.178-.732a310.202 310.202 0 0 1-12.513-19.846h.001a307.41 307.41 0 0 1-10.923-20.627a310.278 310.278 0 0 1 10.89-20.637l-.001.001a307.318 307.318 0 0 1 12.413-19.761c7.613-.576 15.42-.876 23.31-.876H128c7.926 0 15.743.303 23.354.883a329.357 329.357 0 0 1 12.335 19.695a358.489 358.489 0 0 1 11.036 20.54a329.472 329.472 0 0 1-11 20.722Zm22.56-122.124c8.572 4.944 11.906 24.881 6.52 51.026c-.344 1.668-.73 3.367-1.15 5.09c-10.622-2.452-22.155-4.275-34.23-5.408c-7.034-10.017-14.323-19.124-21.64-27.008a160.789 160.789 0 0 1 5.888-5.4c18.9-16.447 36.564-22.941 44.612-18.3ZM128 90.808c12.625 0 22.86 10.235 22.86 22.86s-10.235 22.86-22.86 22.86s-22.86-10.235-22.86-22.86s10.235-22.86 22.86-22.86Z"></path></svg>

After

Width:  |  Height:  |  Size: 4.0 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 8.5 KiB

View File

@@ -0,0 +1,23 @@
const TOKEN_KEY = 'control:access_token'
export function getAccessToken(): string | undefined {
try {
const v = localStorage.getItem(TOKEN_KEY)
return v && v.trim() ? v : undefined
} catch {
return undefined
}
}
export function setAccessToken(token: string) {
try {
const v = token.trim()
if (!v) {
localStorage.removeItem(TOKEN_KEY)
return
}
localStorage.setItem(TOKEN_KEY, v)
} catch {
return
}
}

View File

@@ -0,0 +1,148 @@
import type { KeyboardEvent, ReactNode } from 'react'
const colors = {
border: '#ddd',
borderSubtle: '#eee',
text: '#111',
muted: '#666',
danger: '#b00020',
bg: '#fff',
bgSubtle: '#fafafa',
bgActive: '#eaeaea',
}
export function Button(props: {
children: ReactNode
onClick?: () => void
disabled?: boolean
variant?: 'default' | 'danger'
type?: 'button' | 'submit'
}) {
const variant = props.variant ?? 'default'
const borderColor = variant === 'danger' ? colors.danger : colors.border
const textColor = variant === 'danger' ? colors.danger : colors.text
return (
<button
type={props.type ?? 'button'}
onClick={props.onClick}
disabled={props.disabled}
style={{
padding: '8px 10px',
borderRadius: 8,
border: `1px solid ${borderColor}`,
background: colors.bg,
color: textColor,
cursor: props.disabled ? 'not-allowed' : 'pointer',
opacity: props.disabled ? 0.6 : 1,
}}
>
{props.children}
</button>
)
}
export function TextInput(props: {
id?: string
value: string
onChange: (value: string) => void
placeholder?: string
ariaLabel?: string
onKeyDown?: (e: KeyboardEvent<HTMLInputElement>) => void
}) {
return (
<input
id={props.id}
aria-label={props.ariaLabel}
value={props.value}
onChange={(e) => props.onChange(e.target.value)}
placeholder={props.placeholder}
onKeyDown={props.onKeyDown}
style={{
padding: '8px 10px',
borderRadius: 8,
border: `1px solid ${colors.border}`,
width: '100%',
}}
/>
)
}
export function Code(props: { children: ReactNode }) {
return <code style={{ fontSize: 12 }}>{props.children}</code>
}
export function ErrorText(props: { children: ReactNode }) {
return <div style={{ color: colors.danger }}>{props.children}</div>
}
export function MutedText(props: { children: ReactNode }) {
return <div style={{ fontSize: 12, color: colors.muted }}>{props.children}</div>
}
export function Table(props: { columns: ReactNode[]; rows: ReactNode[][] }) {
return (
<div style={{ overflowX: 'auto' }}>
<table style={{ borderCollapse: 'collapse', width: '100%' }}>
<thead>
<tr>
{props.columns.map((c, idx) => (
<th
key={idx}
style={{ textAlign: 'left', padding: 8, borderBottom: `1px solid ${colors.borderSubtle}` }}
>
{c}
</th>
))}
</tr>
</thead>
<tbody>
{props.rows.map((r, ridx) => (
<tr key={ridx}>
{r.map((cell, cidx) => (
<td key={cidx} style={{ padding: 8, borderBottom: `1px solid ${colors.bgActive}` }}>
{cell}
</td>
))}
</tr>
))}
</tbody>
</table>
</div>
)
}
export function Modal(props: {
title: string
open: boolean
onClose: () => void
children: ReactNode
footer?: ReactNode
}) {
if (!props.open) return null
return (
<div
role="dialog"
aria-modal="true"
style={{
position: 'fixed',
inset: 0,
background: 'rgba(0,0,0,0.35)',
display: 'flex',
alignItems: 'center',
justifyContent: 'center',
padding: 24,
}}
onMouseDown={(e) => {
if (e.target === e.currentTarget) props.onClose()
}}
>
<div style={{ background: colors.bg, borderRadius: 12, padding: 16, width: 520, maxWidth: '100%' }}>
<div style={{ fontWeight: 700, marginBottom: 8 }}>{props.title}</div>
<div>{props.children}</div>
{props.footer ? <div style={{ marginTop: 16 }}>{props.footer}</div> : null}
</div>
</div>
)
}

111
control/ui/src/index.css Normal file
View File

@@ -0,0 +1,111 @@
:root {
--text: #6b6375;
--text-h: #08060d;
--bg: #fff;
--border: #e5e4e7;
--code-bg: #f4f3ec;
--accent: #aa3bff;
--accent-bg: rgba(170, 59, 255, 0.1);
--accent-border: rgba(170, 59, 255, 0.5);
--social-bg: rgba(244, 243, 236, 0.5);
--shadow:
rgba(0, 0, 0, 0.1) 0 10px 15px -3px, rgba(0, 0, 0, 0.05) 0 4px 6px -2px;
--sans: system-ui, 'Segoe UI', Roboto, sans-serif;
--heading: system-ui, 'Segoe UI', Roboto, sans-serif;
--mono: ui-monospace, Consolas, monospace;
font: 18px/145% var(--sans);
letter-spacing: 0.18px;
color-scheme: light dark;
color: var(--text);
background: var(--bg);
font-synthesis: none;
text-rendering: optimizeLegibility;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
@media (max-width: 1024px) {
font-size: 16px;
}
}
@media (prefers-color-scheme: dark) {
:root {
--text: #9ca3af;
--text-h: #f3f4f6;
--bg: #16171d;
--border: #2e303a;
--code-bg: #1f2028;
--accent: #c084fc;
--accent-bg: rgba(192, 132, 252, 0.15);
--accent-border: rgba(192, 132, 252, 0.5);
--social-bg: rgba(47, 48, 58, 0.5);
--shadow:
rgba(0, 0, 0, 0.4) 0 10px 15px -3px, rgba(0, 0, 0, 0.25) 0 4px 6px -2px;
}
#social .button-icon {
filter: invert(1) brightness(2);
}
}
#root {
width: 1126px;
max-width: 100%;
margin: 0 auto;
text-align: center;
border-inline: 1px solid var(--border);
min-height: 100svh;
display: flex;
flex-direction: column;
box-sizing: border-box;
}
body {
margin: 0;
}
h1,
h2 {
font-family: var(--heading);
font-weight: 500;
color: var(--text-h);
}
h1 {
font-size: 56px;
letter-spacing: -1.68px;
margin: 32px 0;
@media (max-width: 1024px) {
font-size: 36px;
margin: 20px 0;
}
}
h2 {
font-size: 24px;
line-height: 118%;
letter-spacing: -0.24px;
margin: 0 0 8px;
@media (max-width: 1024px) {
font-size: 20px;
}
}
p {
margin: 0;
}
code,
.counter {
font-family: var(--mono);
display: inline-flex;
border-radius: 4px;
color: var(--text-h);
}
code {
font-size: 15px;
line-height: 135%;
padding: 4px 8px;
background: var(--code-bg);
}

10
control/ui/src/main.tsx Normal file
View File

@@ -0,0 +1,10 @@
import { StrictMode } from 'react'
import { createRoot } from 'react-dom/client'
import './index.css'
import App from './App.tsx'
createRoot(document.getElementById('root')!).render(
<StrictMode>
<App />
</StrictMode>,
)

527
control/ui/src/pages.tsx Normal file
View File

@@ -0,0 +1,527 @@
import { useEffect, useMemo, useState, type ReactNode } from 'react'
import { useNavigate, useParams } from 'react-router-dom'
import {
getFleetSnapshot,
getPlacement,
getTenants,
getJob,
cancelJob,
listAudit,
getSwarmServices,
getSwarmTasks,
startTenantDrainJob,
startTenantMigrateJob,
type FleetSnapshot,
type PlacementResponse,
type TenantsResponse,
type Job,
type AuditEvent,
type SwarmService,
type SwarmTask,
} from './api/control'
import { getAccessToken, setAccessToken } from './auth/token'
import { Button, Code, ErrorText, Modal, MutedText, Table, TextInput } from './components/primitives'
function PageShell(props: { title: string; children?: ReactNode }) {
return (
<main style={{ padding: 24 }}>
<h1 style={{ margin: 0, fontSize: 22 }}>{props.title}</h1>
{props.children ? <div style={{ marginTop: 16 }}>{props.children}</div> : null}
</main>
)
}
export function OverviewPage() {
const [data, setData] = useState<FleetSnapshot | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
useEffect(() => {
let cancelled = false
getFleetSnapshot()
.then((d) => {
if (cancelled) return
setError(undefined)
setData(d)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
return () => {
cancelled = true
}
}, [])
return (
<PageShell title="Overview">
{error ? <ErrorText>{error}</ErrorText> : null}
{!data ? <div>Loading</div> : null}
{data ? (
<Table
columns={['Service', 'Base URL', 'Health', 'Ready', 'Metrics']}
rows={data.services.map((s) => [
s.name,
<Code key="url">{s.base_url}</Code>,
s.health_ok ? 'ok' : 'fail',
s.ready_ok ? 'ok' : 'fail',
s.metrics_ok ? 'ok' : 'fail',
])}
/>
) : null}
</PageShell>
)
}
export function TenantsPage() {
const [data, setData] = useState<TenantsResponse | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
const navigate = useNavigate()
const [action, setAction] = useState<
| { kind: 'drain'; tenantId: string }
| { kind: 'migrate'; tenantId: string }
| undefined
>(undefined)
const [reason, setReason] = useState('')
const [runnerTarget, setRunnerTarget] = useState('')
const [submitting, setSubmitting] = useState(false)
useEffect(() => {
let cancelled = false
getTenants()
.then((d) => {
if (cancelled) return
setError(undefined)
setData(d)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
return () => {
cancelled = true
}
}, [])
const canSubmit = reason.trim().length > 0 && (!action || action.kind !== 'migrate' || runnerTarget.trim().length > 0)
function newIdempotencyKey() {
if (typeof crypto !== 'undefined' && 'randomUUID' in crypto) return crypto.randomUUID()
return `${Date.now()}-${Math.random().toString(16).slice(2)}`
}
return (
<PageShell title="Tenants">
{error ? <ErrorText>{error}</ErrorText> : null}
{!data ? <div>Loading</div> : null}
{data ? (
<Table
columns={['Tenant', 'Aggregate', 'Projection', 'Runner', 'Actions']}
rows={data.tenants.map((t) => [
<Code key="tenant">{t.tenant_id}</Code>,
<Code key="agg">{t.aggregate_targets.join(', ')}</Code>,
<Code key="proj">{t.projection_targets.join(', ')}</Code>,
<Code key="run">{t.runner_targets.join(', ')}</Code>,
<div key="actions" style={{ display: 'flex', gap: 8 }}>
<Button
onClick={() => {
setReason('')
setRunnerTarget('')
setAction({ kind: 'drain', tenantId: t.tenant_id })
}}
>
Drain
</Button>
<Button
onClick={() => {
setReason('')
setRunnerTarget('')
setAction({ kind: 'migrate', tenantId: t.tenant_id })
}}
>
Migrate
</Button>
</div>,
])}
/>
) : null}
<Modal
open={!!action}
title={action?.kind === 'drain' ? 'Confirm drain' : 'Confirm migrate'}
onClose={() => setAction(undefined)}
footer={
<div style={{ display: 'flex', gap: 10, justifyContent: 'flex-end' }}>
<Button onClick={() => setAction(undefined)} disabled={submitting}>
Cancel
</Button>
<Button
disabled={submitting || !canSubmit}
onClick={async () => {
if (!action) return
setSubmitting(true)
try {
const key = newIdempotencyKey()
const job =
action.kind === 'drain'
? await startTenantDrainJob({ tenantId: action.tenantId, reason, idempotencyKey: key })
: await startTenantMigrateJob({
tenantId: action.tenantId,
runnerTarget,
reason,
idempotencyKey: key,
})
setAction(undefined)
navigate(`/jobs/${job.job_id}`)
} finally {
setSubmitting(false)
}
}}
>
Start job
</Button>
</div>
}
>
{action ? (
<div style={{ display: 'flex', flexDirection: 'column', gap: 12 }}>
<MutedText>
Tenant: <Code>{action.tenantId}</Code>
</MutedText>
{action.kind === 'migrate' ? (
<div style={{ display: 'flex', flexDirection: 'column', gap: 6 }}>
<label htmlFor="runnerTarget" style={{ fontSize: 12, color: '#666' }}>
Runner target
</label>
<TextInput
id="runnerTarget"
value={runnerTarget}
onChange={setRunnerTarget}
placeholder="e.g. node-2"
/>
</div>
) : null}
<div style={{ display: 'flex', flexDirection: 'column', gap: 6 }}>
<label htmlFor="reason" style={{ fontSize: 12, color: '#666' }}>
Reason (required)
</label>
<TextInput id="reason" value={reason} onChange={setReason} placeholder="why are you doing this?" />
</div>
</div>
) : null}
</Modal>
</PageShell>
)
}
export function UsersPage() {
return <PageShell title="Users" />
}
export function SessionsPage() {
return <PageShell title="Sessions" />
}
export function RolesPermissionsPage() {
return <PageShell title="Roles & Permissions" />
}
export function ConfigPage() {
return <PageShell title="Config" />
}
export function DefinitionsPage() {
return <PageShell title="Definitions" />
}
export function ScalePlacementPage() {
const [aggregate, setAggregate] = useState<PlacementResponse | undefined>(undefined)
const [projection, setProjection] = useState<PlacementResponse | undefined>(undefined)
const [runner, setRunner] = useState<PlacementResponse | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
useEffect(() => {
let cancelled = false
Promise.all([getPlacement('aggregate'), getPlacement('projection'), getPlacement('runner')])
.then(([a, p, r]) => {
if (cancelled) return
setError(undefined)
setAggregate(a)
setProjection(p)
setRunner(r)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
return () => {
cancelled = true
}
}, [])
const blocks = [
{ title: 'Aggregate', data: aggregate },
{ title: 'Projection', data: projection },
{ title: 'Runner', data: runner },
] as const
return (
<PageShell title="Scale & Placement">
{error ? <div style={{ color: '#b00020' }}>{error}</div> : null}
<div style={{ display: 'flex', flexDirection: 'column', gap: 16 }}>
{blocks.map((b) => (
<section key={b.title} style={{ border: '1px solid #eee', borderRadius: 12, padding: 12 }}>
<div style={{ fontWeight: 700, marginBottom: 8 }}>{b.title}</div>
{!b.data ? (
<div>Loading</div>
) : (
<pre style={{ margin: 0, fontSize: 12, overflowX: 'auto' }}>
{JSON.stringify(b.data, null, 2)}
</pre>
)}
</section>
))}
</div>
</PageShell>
)
}
export function DeploymentsPage() {
const [data, setData] = useState<SwarmService[] | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
const navigate = useNavigate()
useEffect(() => {
let cancelled = false
getSwarmServices()
.then((d) => {
if (cancelled) return
setError(undefined)
setData(d.services)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
return () => {
cancelled = true
}
}, [])
return (
<PageShell title="Deployments">
{error ? <ErrorText>{error}</ErrorText> : null}
{!data ? <div>Loading</div> : null}
{data ? (
<Table
columns={['Service', 'Image', 'Mode', 'Replicas']}
rows={data.map((s) => [
<Button key="svc" onClick={() => navigate(`/deployments/${encodeURIComponent(s.name)}`)}>
{s.name}
</Button>,
<Code key="img">{s.image ?? ''}</Code>,
s.mode ?? '',
s.replicas ?? '',
])}
/>
) : null}
</PageShell>
)
}
export function ObservabilityPage() {
return <PageShell title="Observability" />
}
export function AuditLogPage() {
const [data, setData] = useState<AuditEvent[] | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
useEffect(() => {
let cancelled = false
listAudit()
.then((d) => {
if (cancelled) return
setError(undefined)
setData(d.events)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
return () => {
cancelled = true
}
}, [])
return (
<PageShell title="Audit Log">
{error ? <ErrorText>{error}</ErrorText> : null}
{!data ? <div>Loading</div> : null}
{data ? (
<Table
columns={['ts', 'principal', 'action', 'tenant', 'reason', 'job']}
rows={data.map((e, idx) => [
<Code key={`ts-${idx}`}>{e.ts_ms}</Code>,
e.principal_sub,
e.action,
<Code key={`tenant-${idx}`}>{e.tenant_id ?? ''}</Code>,
e.reason,
e.job_id ? <Code key={`job-${idx}`}>{e.job_id}</Code> : '',
])}
/>
) : null}
</PageShell>
)
}
export function SettingsPage() {
const [token, setToken] = useState(() => getAccessToken() ?? '')
return (
<PageShell title="Settings">
<div style={{ display: 'flex', flexDirection: 'column', gap: 8, maxWidth: 720 }}>
<label htmlFor="token" style={{ fontSize: 12, color: '#666' }}>
Access token (Bearer)
</label>
<TextInput
id="token"
value={token}
onChange={(v) => {
setToken(v)
setAccessToken(v)
}}
placeholder="paste token here"
/>
</div>
</PageShell>
)
}
export function NotFoundPage() {
return <PageShell title="Not Found" />
}
export function JobPage() {
const params = useParams()
const jobId = params.jobId
const [job, setJob] = useState<Job | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
const canCancel = job?.status === 'pending' || job?.status === 'running'
useEffect(() => {
if (!jobId) return
let cancelled = false
const load = () => {
getJob(jobId)
.then((j) => {
if (cancelled) return
setError(undefined)
setJob(j)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
}
load()
const t = window.setInterval(load, 1000)
return () => {
cancelled = true
window.clearInterval(t)
}
}, [jobId])
const steps = useMemo(() => job?.steps ?? [], [job?.steps])
return (
<PageShell title="Job">
{jobId ? (
<MutedText>
job_id: <Code>{jobId}</Code>
</MutedText>
) : null}
{error ? <ErrorText>{error}</ErrorText> : null}
{!job ? <div>Loading</div> : null}
{job ? (
<div style={{ display: 'flex', flexDirection: 'column', gap: 12 }}>
<div>
Status: <Code>{job.status}</Code>
</div>
{job.error ? <ErrorText><Code>{job.error}</Code></ErrorText> : null}
<div style={{ display: 'flex', gap: 10 }}>
<Button
disabled={!canCancel}
onClick={async () => {
if (!jobId) return
await cancelJob(jobId)
}}
>
Cancel job
</Button>
</div>
<Table
columns={['Step', 'Status', 'Attempts', 'Error']}
rows={steps.map((s) => [
s.name,
<Code key={`${s.name}-st`}>{s.status}</Code>,
s.attempts,
s.error ? <Code key={`${s.name}-err`}>{s.error}</Code> : '',
])}
/>
</div>
) : null}
</PageShell>
)
}
export function DeploymentDetailPage() {
const params = useParams()
const name = params.serviceName
const [data, setData] = useState<SwarmTask[] | undefined>(undefined)
const [error, setError] = useState<string | undefined>(undefined)
useEffect(() => {
if (!name) return
let cancelled = false
getSwarmTasks(name)
.then((d) => {
if (cancelled) return
setError(undefined)
setData(d.tasks)
})
.catch((e: unknown) => {
if (cancelled) return
setError(e instanceof Error ? e.message : 'failed to load')
})
return () => {
cancelled = true
}
}, [name])
return (
<PageShell title="Deployment">
{name ? (
<MutedText>
service: <Code>{name}</Code>
</MutedText>
) : null}
{error ? <ErrorText>{error}</ErrorText> : null}
{!data ? <div>Loading</div> : null}
{data ? (
<Table
columns={['Task', 'Node', 'Desired', 'Current', 'Error']}
rows={data.map((t) => [
<Code key={t.id}>{t.id}</Code>,
t.node ?? '',
t.desired_state ?? '',
t.current_state ?? '',
t.error ? <Code key={`${t.id}-e`}>{t.error}</Code> : '',
])}
/>
) : null}
</PageShell>
)
}

View File

@@ -0,0 +1,127 @@
import '@testing-library/jest-dom/vitest'
import { vi } from 'vitest'
vi.stubGlobal(
'fetch',
vi.fn(async (input: RequestInfo | URL) => {
const url = typeof input === 'string' ? input : input.toString()
if (url.includes('/admin/v1/fleet/snapshot')) {
return new Response(
JSON.stringify({
services: [
{
name: 'control-api',
base_url: 'http://127.0.0.1:8080',
health_ok: true,
ready_ok: true,
metrics_ok: true,
},
],
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
if (url.includes('/admin/v1/placement/')) {
const kind = url.split('/admin/v1/placement/')[1]?.split('?')[0] ?? 'aggregate'
return new Response(
JSON.stringify({
kind,
revision: 'dev',
placements: [],
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
if (url.includes('/admin/v1/tenants')) {
return new Response(
JSON.stringify({
tenants: [
{
tenant_id: '00000000-0000-0000-0000-000000000000',
aggregate_targets: [],
projection_targets: [],
runner_targets: [],
},
],
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
if (url.includes('/admin/v1/audit')) {
return new Response(
JSON.stringify({
events: [],
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
if (url.includes('/admin/v1/jobs/') && url.includes('/cancel')) {
return new Response('', { status: 200 })
}
if (url.includes('/admin/v1/jobs/tenant/')) {
return new Response(JSON.stringify({ job_id: 'job-1' }), {
status: 200,
headers: { 'content-type': 'application/json' },
})
}
if (url.includes('/admin/v1/jobs/')) {
return new Response(
JSON.stringify({
job_id: 'job-1',
status: 'succeeded',
steps: [{ name: 'echo', status: 'succeeded', attempts: 1, error: null }],
error: null,
created_at_ms: 0,
started_at_ms: 0,
finished_at_ms: 0,
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
if (url.includes('/admin/v1/swarm/services') && url.includes('/tasks')) {
return new Response(
JSON.stringify({
service: 'gateway',
tasks: [
{
id: 'task-1',
service: 'gateway',
node: 'node-1',
desired_state: 'running',
current_state: 'running',
error: null,
},
],
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
if (url.includes('/admin/v1/swarm/services')) {
return new Response(
JSON.stringify({
services: [
{
name: 'gateway',
image: 'cloudlysis/gateway:dev',
mode: 'replicated',
replicas: '1/1',
updated_at: null,
},
],
}),
{ status: 200, headers: { 'content-type': 'application/json' } },
)
}
return new Response('not found', { status: 404 })
}),
)

View File

@@ -0,0 +1,28 @@
{
"compilerOptions": {
"tsBuildInfoFile": "./node_modules/.tmp/tsconfig.app.tsbuildinfo",
"target": "ES2023",
"useDefineForClassFields": true,
"lib": ["ES2023", "DOM", "DOM.Iterable"],
"module": "ESNext",
"types": ["vite/client"],
"skipLibCheck": true,
/* Bundler mode */
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"verbatimModuleSyntax": true,
"moduleDetection": "force",
"noEmit": true,
"jsx": "react-jsx",
/* Linting */
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"erasableSyntaxOnly": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true
},
"include": ["src"]
}

7
control/ui/tsconfig.json Normal file
View File

@@ -0,0 +1,7 @@
{
"files": [],
"references": [
{ "path": "./tsconfig.app.json" },
{ "path": "./tsconfig.node.json" }
]
}

View File

@@ -0,0 +1,26 @@
{
"compilerOptions": {
"tsBuildInfoFile": "./node_modules/.tmp/tsconfig.node.tsbuildinfo",
"target": "ES2023",
"lib": ["ES2023"],
"module": "ESNext",
"types": ["node"],
"skipLibCheck": true,
/* Bundler mode */
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"verbatimModuleSyntax": true,
"moduleDetection": "force",
"noEmit": true,
/* Linting */
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"erasableSyntaxOnly": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true
},
"include": ["vite.config.ts"]
}

View File

@@ -0,0 +1,7 @@
import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react'
// https://vite.dev/config/
export default defineConfig({
plugins: [react()],
})

View File

@@ -0,0 +1,10 @@
import { defineConfig } from 'vitest/config'
export default defineConfig({
test: {
environment: 'jsdom',
setupFiles: ['./src/test/setup.ts'],
testTimeout: 5000,
hookTimeout: 5000,
},
})