30 KiB
🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops)
Definition:
This repository hosts the platform control plane:
- the Admin UI used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and
- the observability stack and production dashboards (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production.
The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails.
Context: Existing Node Repositories (../)
This PRD is derived from the currently implemented node repos in ../:
- Aggregate: expects a control node to manage tenant placement and scaling operations, including tenant migrations (aggregate/prd.md). Tenant placement primitives and KV helper exist (swarm.rs).
- Gateway: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production (gateway/prd.md).
- Projection: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) (projection/prd.md).
- Runner: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics (tenant_placement.rs) and exposes admin endpoints for drain/reload in its PRD (runner/prd.md).
The control plane also adopts the proven Admin UI UX + component library from UltraBase’s control-plane admin UI, adapting screens and information architecture to Cloudlysis needs:
- Reusable UI components live under ui/control-plane-admin/src/components/ui.
- Example pages include TenantsPage, AdminUsersPage, AdminSessionsPage, FleetPage, TopologyPage, and ObservabilityPage.
Problem Statement
Operating the platform without a unified control plane forces operators to:
- Use ad-hoc scripts, direct cluster access, or service-local admin endpoints
- Manage tenants, placements, and deployments without a consistent audit trail
- Correlate production incidents across services with incomplete dashboards and unsafe levels of access
The platform needs a control plane that:
- Centralizes admin workflows and production operability
- Enforces least-privilege RBAC, step-up, and auditing
- Provides a consistent, safe abstraction over tenant placement, scale, and production operations
Goals
- Deliver an Admin UI with full admin management over:
- users, sessions, roles/permissions
- configuration (global + per-tenant)
- definitions (aggregates, projections, sagas, effects, manifests)
- scaling and production management (tenant placement, drains, migrations, deployments)
- Package production-grade monitoring:
- metrics via VictoriaMetrics
- logs via Loki
- dashboards and alerting via Grafana (+ vmalert where used)
- Make production operations observable, auditable, and safe by default:
- strong change logging + approvals where needed
- idempotent operations + dry runs + rollback paths
Non-Goals
- Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway).
- Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns.
- Provide an arbitrary “general API gateway” for third-party upstreams.
Primary Users
- Platform Owner / SRE: fleet operations, incident response, production change management.
- Platform Admin: tenant provisioning, RBAC, config/definition promotion.
- Security Admin: access reviews, session revocation, audit trails.
- Support / On-call: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback).
Key Concepts
Control Plane Scope
- The control plane is the authoritative interface for production operations and admin management.
- The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them.
Tenant-Aware Operations
- All tenant-scoped operations are keyed by
tenant_id(consistent withx-tenant-idusage across nodes and Gateway). - Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns (swarm.rs, tenant_placement.rs).
Safe Change Management
- Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible.
- All high-impact operations support:
- validation and preflight checks
- dry-run planning
- idempotency keys
- explicit rollback guidance
Control Plane Components (In This Repo)
- Admin UI (React):
- Reuse UltraBase’s control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements (components/ui).
- The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent.
- Control Plane API (BFF / Admin API):
- A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs.
- Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions.
- Observability Stack:
- Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBase’s baseline (observability/README.md).
Functional Requirements
1) Admin IAM (Users, Sessions, Roles)
1.1 Users
- CRUD users with lifecycle states:
- invited (pending acceptance), active, suspended, disabled, deleted (tombstoned)
- Identity attributes:
- email (primary), optional secondary identities
- display name, avatar, metadata tags
- auth methods enabled (password, OIDC providers), MFA state
- Administrative actions:
- invite/resend invite
- reset password flow initiation
- force MFA reset / revoke recovery codes
- disable login / suspend user
- impersonation (break-glass, audited, time-boxed)
- Security constraints:
- privileged actions require step-up / recent auth
- sensitive events must be audit logged (who, what, when, why, from where)
1.2 Sessions
- View active sessions and refresh token families:
- by user, by tenant, by IP / geo, by device, by time range
- Revoke capabilities:
- revoke a single session
- revoke all sessions for a user
- revoke all sessions for a tenant (incident response)
- Detection surfaces:
- unusual session fanout (many sessions per user)
- repeated failed logins / MFA failures
- suspicious IP changes
1.3 Roles & Permissions (RBAC)
- Roles are sets of permissions; assignments bind principals to roles in a scope.
- Scopes:
- global (platform-level)
- tenant-scoped
- environment-scoped (dev/staging/prod) when applicable
- Required permission domains (minimum):
- iam.users.* (create/update/suspend/delete)
- iam.sessions.* (list/revoke)
- iam.roles.* (create/update/assign)
- tenants.* (create/update/archive)
- configs.* (read/write/approve/apply)
- definitions.* (read/write/validate/promote/rollback)
- scale.* (view/apply/migrate/drain)
- ops.* (deploy/rollback/restart/drain)
- observability.* (view dashboards, manage alert rules)
- audit.* (view/export)
- Role templates:
- owner, admin, operator, support, read-only, security-admin, break-glass
2) Tenant Management
- Create, list, and archive tenants.
- Tenant status model:
- provisioning, active, draining, migrating, degraded, suspended, archived
- Tenant metadata:
- plan/tier, quotas, feature flags, contact + billing metadata, environment(s)
- Tenant operational actions:
- trigger provisioning workflows (create streams/buckets, seed configs, create placement)
- rotate tenant secrets (as definitions/config allow)
- pause/resume workload (soft kill switch via config flags)
Tenant pages should mirror UltraBase’s “Tenant Overview + subpages” navigation patterns (example: TenantsPage and TenantOverviewPage).
3) Configuration Management (Global + Per-Tenant)
3.1 Config Model
- Config items are versioned, typed documents with:
- scope (global / tenant / environment)
- schema version
- provenance (who/what wrote it)
- effective date and rollout strategy
- Config must support:
- validation against a schema
- diff view (previous vs next)
- staged rollout (preview → apply)
- rollback to a prior version
3.2 Node-Related Configuration
Required config surfaces (minimum):
- Gateway: routing/placement sources, auth policies, rate limits (see routing expectations in gateway/prd.md).
- Aggregate / Projection / Runner:
- shard identifiers and tenant allowlists/placement settings
- drain/reload toggles and safety thresholds
- resource limits / concurrency caps
4) Definition Management (System “Blueprints”)
Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types.
Required capabilities:
- Upload/edit versioned definitions with:
- validation (schema + semantic checks)
- “impact analysis” (which tenants/services are affected)
- promotion workflow (dev → staging → prod)
- Change controls:
- approvals (role-based) for production promotion
- emergency rollback path (one-click revert to last-known-good definition bundle)
- Tenant overrides:
- allow per-tenant definition overrides only when explicitly permitted by policy
The control plane must present definitions in a way that maps to the node runtime responsibilities:
- Aggregates and deterministic decide/apply programs (aggregate/prd.md)
- Projections and deterministic project programs (projection/prd.md)
- Runner sagas and effect provider manifests (runner/prd.md)
5) Scale Management (Tenant Placement, Shards, Fleet)
5.1 Placement Model
- Placement is modeled as:
- a set of nodes/shards and their attributes (labels, capacity, region)
- tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant)
- Control plane supports both:
- static placement (development)
- dynamic placement (production) backed by NATS KV (consistent with existing client patterns in swarm.rs)
5.2 Tenant Migration
- Provide guided migration planning and execution:
- show current assignment, target assignment, and a sequenced action plan
- execute “graceful drain → update placement → reload” style plans (see plan_graceful_tenant_migration)
- Migration safety:
- require explicit confirmation and reason
- block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high)
- time-box and alert if drains do not converge
5.3 Fleet View
- Fleet inventory:
- nodes (labels, region, capacity, version)
- services (replicas, image version, health)
- per-node and per-service load indicators (CPU/mem, request rate, consumer lag)
- Operator actions:
- scale replicas, restart services, cordon/drain nodes (when supported by orchestrator)
UX should align with the UltraBase “Fleet” and “Topology” navigation patterns (FleetPage, TopologyPage).
6) Production Operations (Deployments, Maintenance, Safety)
6.1 Deployments
- Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with:
- environment-specific rollout policies
- canary/rolling deploy support (when orchestrator supports it)
- automatic health checks gates and rollback triggers
- Track releases:
- “what is running where” (service version matrix)
- change log links and approvals
6.2 Maintenance Operations
- Drain operations:
- tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in TenantGate)
- node drain (aggregate tenant ranges, projection consumers, runner workers)
- Replay / rebuild operations:
- projection rebuild triggers (dangerous, must be guarded and audited)
- workflow replay controls (reset checkpoints only with explicit intent)
6.3 Incident Response Toolkit
- “Safe switches”:
- per-tenant kill switch (disable commands/effects via config)
- global degrade modes (rate limit reductions, disable expensive features)
- Run actions:
- revoke sessions at scale
- freeze deployments
- trigger drain/migrate with guided plan
7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards
7.1 Stack Requirements
Adopt a production-ready stack consistent with UltraBase’s operational baseline:
- VictoriaMetrics for metrics storage and Prometheus-compatible query
- vmagent for scraping and remote_write
- Grafana for dashboards and alert routing
- Loki (+ optional Promtail) for logs
- Optional vmalert for rule evaluation against VictoriaMetrics
UltraBase’s observability design is a direct reference implementation to mirror and adapt:
- Stack overview and conventions: observability/README.md
- Provisioned dashboards and datasources: grafana provisioning
7.2 Metrics Conventions
- Every service exports
/metricsin Prometheus format. - Required labels:
service(stable, low cardinality)env(dev/staging/prod)tenant_idonly where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled.
- HTTP metrics must avoid unbounded
pathcardinality; prefer route templates (pattern-based paths).
Tenant-aware metrics guidelines:
- Prefer tenant-only aggregates for “who is hurting us?” views:
..._requests_total{tenant_id,service,status_class}(nopath)..._request_duration_seconds{tenant_id,service}(nopath, limited bucket count)
- Prefer route-only aggregates for “what endpoint is hurting us?” views:
..._requests_total{service,path,status}(notenant_id)
- Where per-tenant and per-route both matter, implement a top-k sampling policy:
- emit
(tenant_id,path)series only for top N tenants, or only for a fixed allowlist of routes.
- emit
7.3 Required Dashboards (Production)
Minimum set of dashboards (provisioned on startup):
- Platform — Operations overview
upfor core services and observability stack- RPS, 4xx/5xx ratio, p95/p99 latency per service
- saturation indicators (CPU/mem, inflight, queue depth)
- Platform — HTTP detail
- per-service request breakdown by route template, method, status
- top failing paths and latency outliers
- Platform — Logs
- Loki stream filtering by
service,tenant_id(where present), and correlation identifiers
- Loki stream filtering by
- Platform — Event bus / JetStream
- consumer lag, redeliveries, ack latency, stream storage pressure
- Platform — Workers (Runner)
- outbox depth, effect latency, poison message counts, schedules backlog
- Platform — Storage (libmdbx)
- DB size growth, write stalls, fsync latency (where exported), disk usage
- Platform — Cluster / Orchestrator
- node health, container restarts, placement distribution by tenant range
Dashboards should be modeled after UltraBase’s default set (for structure, not content), e.g. ultrabase-operations.json and ultrabase-http-detail.json.
Additional production-operability dashboards (chosen and adapted):
-
Platform — Noisy Neighbor & Tenant Health
- Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant.
- Panels (minimum):
- Top tenants by Gateway RPS (topk of tenant-only request counters).
- Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms.
- Tenant error ratio (5xx and 429) per tenant.
- Aggregate in-flight commands by tenant (already exported:
aggregate_in_flight_commands{tenant_id}). - Projection processing error rate by tenant (from
projection_processing_errors_total{tenant_id,view_type}aggregated per tenant). - Loki logs panel with a
tenant_idvariable selector; selecting a tenant syncs RPS/latency/errors + logs.
- Required instrumentation:
- Gateway must expose tenant-level HTTP counters/histograms (tenant + status class + service, without
path) in addition to existing route-level metrics.
- Gateway must expose tenant-level HTTP counters/histograms (tenant + status class + service, without
-
Platform — API Regression & Deployment
- Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events.
- Panels (minimum):
- Error rate comparison “old vs new” by
serviceandversion(orimage_tag) labels. - Latency comparison “old vs new” (p95/p99) per service.
- Restart / flapping rate per service (container restarts, crash loops).
- Dependency latency correlation:
- Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency.
- Loki “new errors” panel:
- errors seen in the last 10m that were not present in the prior 60m window, grouped by
service.
- errors seen in the last 10m that were not present in the prior 60m window, grouped by
- Deployment annotations:
- vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric).
- Error rate comparison “old vs new” by
- Required instrumentation:
- Every service exports a
*_build_info{service,version,git_sha}gauge (value=1) or equivalent, and scrape relabeling addsimage_tagwhere possible. - Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations).
- Every service exports a
-
Platform — Storage & Event Bus Bottlenecks
- Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting).
- Panels (minimum):
- NATS/JetStream health:
- stream storage pressure, publish/ack latency, consumer lag, redeliveries.
- Projection lag and throughput:
- events processed rate, processing duration, error rate.
- Aggregate write-path pressure:
- command duration, version conflicts, in-flight commands, tenant errors.
- Runner pressure:
- outbox dispatch failure rate, effect timeout rate, deadletter writes.
- Disk saturation on nodes hosting libmdbx:
- disk usage, read/write latency, IOPS; correlate with spikes in command/query latency.
- Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata:
- pool saturation, replica lag, slow queries, long transactions.
- NATS/JetStream health:
- Required instrumentation:
- Ensure JetStream metrics are scraped (NATS server
/varzexporter or native Prometheus endpoint depending on deployment). - Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent).
- Ensure JetStream metrics are scraped (NATS server
-
Platform — Infrastructure Exhaustion
- Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots.
- Panels (minimum):
- CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation.
- OOM kill tracker across the cluster.
- Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics).
- vmagent health:
- scrape error rate, remote_write errors, queue backlog.
- Loki ingestion health:
- dropped log lines (promtail) and ingestion errors (loki).
- Swarm task hygiene:
- desired_state vs current_state mismatches, orphaned tasks, restart loops.
- Required instrumentation:
- node exporter / cadvisor (or equivalent) must be part of the production scrape plan.
- promtail (or alternative) must expose drop/error metrics when logs are enabled.
7.4 Alerting Requirements
Minimum alert classes:
- Availability:
- service down (
up == 0) - scrape failures, vmagent remote_write errors
- service down (
- Reliability:
- sustained elevated 5xx ratio
- sustained elevated p95 latency per service
- Backlogs:
- JetStream consumer lag above threshold
- Runner outbox depth above threshold
- Data safety:
- disk usage near full (nodes hosting libmdbx)
- abnormal restart loops
- Security:
- login anomaly detection signals (where instrumented)
- suspicious spike in session revocations / failed MFA
Alert rules can follow UltraBase’s approach of version-controlled rules in YAML (reference: alerts/).
7.5 Control Plane → Observability Linking
The Admin UI must embed or deep-link into observability tools:
- per-tenant and per-service quick links to Grafana dashboards and Loki queries
- incident triage shortcuts (operations overview → HTTP detail → logs)
This mirrors UltraBase’s “observability links JSON” concept (observability/README.md), but adapted to Cloudlysis services and dashboards.
8) Audit, Compliance, and Change History
- Audit log is an append-only stream of security and operations events:
- authentication and session events
- RBAC changes and permission grants
- config/definition changes and promotions
- scaling, drain, and migration operations
- deployments and rollbacks
- Audit log must support:
- search and export (bounded and access controlled)
- correlation to production incidents (request ids, trace ids)
- retention policy controls
9) Control Plane API Surface (Admin API)
The control plane requires a stable API surface for the Admin UI and automation.
Minimum API capabilities:
- Idempotent jobs for multi-step operations:
- every mutating operation returns a
job_id, supports polling and cancellation, and records a full execution trace in the audit log.
- every mutating operation returns a
- Preflight endpoints:
- validate an intended change and return a plan (and “would-change” diff) without applying it.
- RBAC-first access model:
- all endpoints enforce permission checks at the API boundary (UI is not trusted).
Minimum endpoint groups:
/admin/v1/iam/*(users, roles, assignments, sessions)/admin/v1/tenants/*(tenants lifecycle, status, metadata)/admin/v1/config/*(versioned config, diff, apply, rollback)/admin/v1/definitions/*(bundles, validate, promote, rollback)/admin/v1/scale/*(placement, migrations, drain status)/admin/v1/ops/*(deployments, rollbacks, service actions)/admin/v1/observability/*(links, saved queries, dashboard registry)/admin/v1/audit/*(search, export)
Authentication/authorization integration:
- Prefer using the Gateway as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions.
10) Secrets and Credentials Management
The control plane must treat secrets as first-class operational data with strict handling.
Requirements:
- Secret values must never be logged and must be redacted in UI/API responses.
- Secrets must support:
- creation and rotation workflows
- scoped access (global/tenant/environment)
- staged rollout (write new → verify → promote → retire old)
- Rendering rules:
- after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only).
- Integrations:
- support referencing secrets from config/definitions without embedding values (secret refs).
11) Backups, Restore, and Disaster Recovery (Production Operability)
The control plane must provide explicit visibility and guardrails for data safety operations.
Minimum requirements:
- Backup status:
- show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores).
- Restore readiness:
- preflight checks that validate a restore plan (target environment, versions, dependencies).
- Operational playbooks:
- link to the exact restore procedure and post-restore verification checklist.
- Key rotation:
- explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends.
This should align with the platform’s existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs).
Admin UI Requirements (Information Architecture + UX)
Navigation (Minimum)
Left navigation sections:
- Overview
- Tenants
- Users
- Sessions
- Roles & Permissions
- Config
- Definitions
- Scale & Placement
- Deployments
- Observability
- Audit Log
- Settings
Page Patterns (Reuse UltraBase UI)
Adopt the UltraBase component system and page layout patterns:
- Layout, styling tokens, UI primitives: components/ui
- Table + search + action dropdown pattern: TenantsPage
Required page types:
- List pages:
- searchable table, bulk actions, row actions menu, status pills, empty states
- Detail pages:
- header with primary actions (drain, migrate, rollback)
- sub-nav tabs for domain-specific views
- Mutation flows:
- modal confirmation + explicit reason entry for high-impact changes
- toast notifications and “busy” state handling consistent with UltraBase patterns
Tenant Detail Subpages (Minimum)
- Overview (status, assignments, SLO highlights)
- Placement (per service: Aggregate/Projection/Runner)
- Health (node readiness and dependency checks)
- Config (effective config + diffs)
- Definitions (applied definition bundle + version)
- Activity (audit trail filtered to tenant)
- Observability (embedded links / panels)
Non-Functional Requirements
- Security:
- strict RBAC everywhere; deny-by-default
- audit every privileged operation
- step-up for sensitive actions
- CSRF protection for browser sessions
- safe secret handling (no secret values rendered after creation unless explicitly permitted)
- allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse
- Reliability:
- control plane operations are idempotent and resilient to partial failures
- operations have clear “current state” and do not rely on UI assumptions
- Performance:
- list pages paginate and filter server-side for large fleets
- dashboards load with bounded query costs and controlled label cardinality
- Operability:
- control plane itself must be observable (metrics/logs, dashboards, alerts)
- every operation must surface preflight checks and post-conditions
Open Questions / Design Constraints (To Resolve During Implementation)
- Where does the source of truth live for:
- users/sessions/roles (Gateway vs control-plane backing store)?
- configs/definitions (NATS KV vs database vs GitOps)?
- How should production promotions be modeled:
- environment branches, approval workflow, and rollback semantics?
- What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)?
- Where should the job/execution state for long-running operations live:
- embedded in the control plane API process, durable store, or NATS workflows?