Files
cloudlysis/control/prd.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

602 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
### 🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops)
**Definition:**
This repository hosts the **platform control plane**:
1) the **Admin UI** used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and
2) the **observability stack** and **production dashboards** (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production.
The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails.
---
## **Context: Existing Node Repositories (../)**
This PRD is derived from the currently implemented node repos in `../`:
- **Aggregate**: expects a control node to manage tenant placement and scaling operations, including tenant migrations ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L82-L151)). Tenant placement primitives and KV helper exist ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L5-L227)).
- **Gateway**: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production ([gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L13-L175)).
- **Projection**: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L7-L96)).
- **Runner**: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics ([tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L11-L104)) and exposes admin endpoints for drain/reload in its PRD ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L199-L210)).
The control plane also adopts the proven **Admin UI UX + component library** from UltraBases control-plane admin UI, adapting screens and information architecture to Cloudlysis needs:
- Reusable UI components live under [ui/control-plane-admin/src/components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui).
- Example pages include [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx), [AdminUsersPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminUsersPage.tsx), [AdminSessionsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminSessionsPage.tsx), [FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx), and [ObservabilityPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/ObservabilityPage.tsx).
---
## **Problem Statement**
Operating the platform without a unified control plane forces operators to:
- Use ad-hoc scripts, direct cluster access, or service-local admin endpoints
- Manage tenants, placements, and deployments without a consistent audit trail
- Correlate production incidents across services with incomplete dashboards and unsafe levels of access
The platform needs a control plane that:
- Centralizes **admin workflows** and **production operability**
- Enforces **least-privilege RBAC**, **step-up**, and **auditing**
- Provides a consistent, safe abstraction over **tenant placement**, **scale**, and **production operations**
---
## **Goals**
- Deliver an Admin UI with full admin management over:
- users, sessions, roles/permissions
- configuration (global + per-tenant)
- definitions (aggregates, projections, sagas, effects, manifests)
- scaling and production management (tenant placement, drains, migrations, deployments)
- Package production-grade monitoring:
- metrics via VictoriaMetrics
- logs via Loki
- dashboards and alerting via Grafana (+ vmalert where used)
- Make production operations observable, auditable, and safe by default:
- strong change logging + approvals where needed
- idempotent operations + dry runs + rollback paths
---
## **Non-Goals**
- Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway).
- Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns.
- Provide an arbitrary “general API gateway” for third-party upstreams.
---
## **Primary Users**
- **Platform Owner / SRE**: fleet operations, incident response, production change management.
- **Platform Admin**: tenant provisioning, RBAC, config/definition promotion.
- **Security Admin**: access reviews, session revocation, audit trails.
- **Support / On-call**: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback).
---
## **Key Concepts**
### Control Plane Scope
- The control plane is the authoritative interface for production operations and admin management.
- The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them.
### Tenant-Aware Operations
- All tenant-scoped operations are keyed by `tenant_id` (consistent with `x-tenant-id` usage across nodes and Gateway).
- Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L188-L226), [tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L41-L104)).
### Safe Change Management
- Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible.
- All high-impact operations support:
- validation and preflight checks
- dry-run planning
- idempotency keys
- explicit rollback guidance
### Control Plane Components (In This Repo)
- **Admin UI (React)**:
- Reuse UltraBases control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements ([components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)).
- The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent.
- **Control Plane API (BFF / Admin API)**:
- A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs.
- Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions.
- **Observability Stack**:
- Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBases baseline ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)).
---
## **Functional Requirements**
### 1) Admin IAM (Users, Sessions, Roles)
#### 1.1 Users
- CRUD users with lifecycle states:
- invited (pending acceptance), active, suspended, disabled, deleted (tombstoned)
- Identity attributes:
- email (primary), optional secondary identities
- display name, avatar, metadata tags
- auth methods enabled (password, OIDC providers), MFA state
- Administrative actions:
- invite/resend invite
- reset password flow initiation
- force MFA reset / revoke recovery codes
- disable login / suspend user
- impersonation (break-glass, audited, time-boxed)
- Security constraints:
- privileged actions require step-up / recent auth
- sensitive events must be audit logged (who, what, when, why, from where)
#### 1.2 Sessions
- View active sessions and refresh token families:
- by user, by tenant, by IP / geo, by device, by time range
- Revoke capabilities:
- revoke a single session
- revoke all sessions for a user
- revoke all sessions for a tenant (incident response)
- Detection surfaces:
- unusual session fanout (many sessions per user)
- repeated failed logins / MFA failures
- suspicious IP changes
#### 1.3 Roles & Permissions (RBAC)
- Roles are sets of permissions; assignments bind principals to roles in a scope.
- Scopes:
- global (platform-level)
- tenant-scoped
- environment-scoped (dev/staging/prod) when applicable
- Required permission domains (minimum):
- iam.users.* (create/update/suspend/delete)
- iam.sessions.* (list/revoke)
- iam.roles.* (create/update/assign)
- tenants.* (create/update/archive)
- configs.* (read/write/approve/apply)
- definitions.* (read/write/validate/promote/rollback)
- scale.* (view/apply/migrate/drain)
- ops.* (deploy/rollback/restart/drain)
- observability.* (view dashboards, manage alert rules)
- audit.* (view/export)
- Role templates:
- owner, admin, operator, support, read-only, security-admin, break-glass
---
### 2) Tenant Management
- Create, list, and archive tenants.
- Tenant status model:
- provisioning, active, draining, migrating, degraded, suspended, archived
- Tenant metadata:
- plan/tier, quotas, feature flags, contact + billing metadata, environment(s)
- Tenant operational actions:
- trigger provisioning workflows (create streams/buckets, seed configs, create placement)
- rotate tenant secrets (as definitions/config allow)
- pause/resume workload (soft kill switch via config flags)
Tenant pages should mirror UltraBases “Tenant Overview + subpages” navigation patterns (example: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx) and [TenantOverviewPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantOverviewPage.tsx)).
---
### 3) Configuration Management (Global + Per-Tenant)
#### 3.1 Config Model
- Config items are versioned, typed documents with:
- scope (global / tenant / environment)
- schema version
- provenance (who/what wrote it)
- effective date and rollout strategy
- Config must support:
- validation against a schema
- diff view (previous vs next)
- staged rollout (preview → apply)
- rollback to a prior version
#### 3.2 Node-Related Configuration
Required config surfaces (minimum):
- **Gateway**: routing/placement sources, auth policies, rate limits (see routing expectations in [gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L154-L175)).
- **Aggregate / Projection / Runner**:
- shard identifiers and tenant allowlists/placement settings
- drain/reload toggles and safety thresholds
- resource limits / concurrency caps
---
### 4) Definition Management (System “Blueprints”)
Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types.
Required capabilities:
- Upload/edit versioned definitions with:
- validation (schema + semantic checks)
- “impact analysis” (which tenants/services are affected)
- promotion workflow (dev → staging → prod)
- Change controls:
- approvals (role-based) for production promotion
- emergency rollback path (one-click revert to last-known-good definition bundle)
- Tenant overrides:
- allow per-tenant definition overrides only when explicitly permitted by policy
The control plane must present definitions in a way that maps to the node runtime responsibilities:
- Aggregates and deterministic decide/apply programs ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L155-L160))
- Projections and deterministic project programs ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L36-L55))
- Runner sagas and effect provider manifests ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L41-L57))
---
### 5) Scale Management (Tenant Placement, Shards, Fleet)
#### 5.1 Placement Model
- Placement is modeled as:
- a set of nodes/shards and their attributes (labels, capacity, region)
- tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant)
- Control plane supports both:
- static placement (development)
- dynamic placement (production) backed by NATS KV (consistent with existing client patterns in [swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L79-L227))
#### 5.2 Tenant Migration
- Provide guided migration planning and execution:
- show current assignment, target assignment, and a sequenced action plan
- execute “graceful drain → update placement → reload” style plans (see [plan_graceful_tenant_migration](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L41-L65))
- Migration safety:
- require explicit confirmation and reason
- block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high)
- time-box and alert if drains do not converge
#### 5.3 Fleet View
- Fleet inventory:
- nodes (labels, region, capacity, version)
- services (replicas, image version, health)
- per-node and per-service load indicators (CPU/mem, request rate, consumer lag)
- Operator actions:
- scale replicas, restart services, cordon/drain nodes (when supported by orchestrator)
UX should align with the UltraBase “Fleet” and “Topology” navigation patterns ([FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx)).
---
### 6) Production Operations (Deployments, Maintenance, Safety)
#### 6.1 Deployments
- Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with:
- environment-specific rollout policies
- canary/rolling deploy support (when orchestrator supports it)
- automatic health checks gates and rollback triggers
- Track releases:
- “what is running where” (service version matrix)
- change log links and approvals
#### 6.2 Maintenance Operations
- Drain operations:
- tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in [TenantGate](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L106-L200))
- node drain (aggregate tenant ranges, projection consumers, runner workers)
- Replay / rebuild operations:
- projection rebuild triggers (dangerous, must be guarded and audited)
- workflow replay controls (reset checkpoints only with explicit intent)
#### 6.3 Incident Response Toolkit
- “Safe switches”:
- per-tenant kill switch (disable commands/effects via config)
- global degrade modes (rate limit reductions, disable expensive features)
- Run actions:
- revoke sessions at scale
- freeze deployments
- trigger drain/migrate with guided plan
---
### 7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards
#### 7.1 Stack Requirements
Adopt a production-ready stack consistent with UltraBases operational baseline:
- **VictoriaMetrics** for metrics storage and Prometheus-compatible query
- **vmagent** for scraping and remote_write
- **Grafana** for dashboards and alert routing
- **Loki** (+ optional **Promtail**) for logs
- Optional **vmalert** for rule evaluation against VictoriaMetrics
UltraBases observability design is a direct reference implementation to mirror and adapt:
- Stack overview and conventions: [observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)
- Provisioned dashboards and datasources: [grafana provisioning](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning)
#### 7.2 Metrics Conventions
- Every service exports `/metrics` in Prometheus format.
- Required labels:
- `service` (stable, low cardinality)
- `env` (dev/staging/prod)
- `tenant_id` only where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled.
- HTTP metrics must avoid unbounded `path` cardinality; prefer route templates (pattern-based paths).
Tenant-aware metrics guidelines:
- Prefer **tenant-only aggregates** for “who is hurting us?” views:
- `..._requests_total{tenant_id,service,status_class}` (no `path`)
- `..._request_duration_seconds{tenant_id,service}` (no `path`, limited bucket count)
- Prefer **route-only aggregates** for “what endpoint is hurting us?” views:
- `..._requests_total{service,path,status}` (no `tenant_id`)
- Where per-tenant and per-route both matter, implement a **top-k sampling** policy:
- emit `(tenant_id,path)` series only for top N tenants, or only for a fixed allowlist of routes.
#### 7.3 Required Dashboards (Production)
Minimum set of dashboards (provisioned on startup):
- **Platform — Operations overview**
- `up` for core services and observability stack
- RPS, 4xx/5xx ratio, p95/p99 latency per service
- saturation indicators (CPU/mem, inflight, queue depth)
- **Platform — HTTP detail**
- per-service request breakdown by route template, method, status
- top failing paths and latency outliers
- **Platform — Logs**
- Loki stream filtering by `service`, `tenant_id` (where present), and correlation identifiers
- **Platform — Event bus / JetStream**
- consumer lag, redeliveries, ack latency, stream storage pressure
- **Platform — Workers (Runner)**
- outbox depth, effect latency, poison message counts, schedules backlog
- **Platform — Storage (libmdbx)**
- DB size growth, write stalls, fsync latency (where exported), disk usage
- **Platform — Cluster / Orchestrator**
- node health, container restarts, placement distribution by tenant range
Dashboards should be modeled after UltraBases default set (for structure, not content), e.g. [ultrabase-operations.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-operations.json) and [ultrabase-http-detail.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-http-detail.json).
Additional production-operability dashboards (chosen and adapted):
- **Platform — Noisy Neighbor & Tenant Health**
- Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant.
- Panels (minimum):
- Top tenants by Gateway RPS (topk of tenant-only request counters).
- Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms.
- Tenant error ratio (5xx and 429) per tenant.
- Aggregate in-flight commands by tenant (already exported: `aggregate_in_flight_commands{tenant_id}`).
- Projection processing error rate by tenant (from `projection_processing_errors_total{tenant_id,view_type}` aggregated per tenant).
- Loki logs panel with a `tenant_id` variable selector; selecting a tenant syncs RPS/latency/errors + logs.
- Required instrumentation:
- Gateway must expose **tenant-level** HTTP counters/histograms (tenant + status class + service, without `path`) in addition to existing route-level metrics.
- **Platform — API Regression & Deployment**
- Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events.
- Panels (minimum):
- Error rate comparison “old vs new” by `service` and `version` (or `image_tag`) labels.
- Latency comparison “old vs new” (p95/p99) per service.
- Restart / flapping rate per service (container restarts, crash loops).
- Dependency latency correlation:
- Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency.
- Loki “new errors” panel:
- errors seen in the last 10m that were not present in the prior 60m window, grouped by `service`.
- Deployment annotations:
- vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric).
- Required instrumentation:
- Every service exports a `*_build_info{service,version,git_sha}` gauge (value=1) or equivalent, and scrape relabeling adds `image_tag` where possible.
- Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations).
- **Platform — Storage & Event Bus Bottlenecks**
- Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting).
- Panels (minimum):
- NATS/JetStream health:
- stream storage pressure, publish/ack latency, consumer lag, redeliveries.
- Projection lag and throughput:
- events processed rate, processing duration, error rate.
- Aggregate write-path pressure:
- command duration, version conflicts, in-flight commands, tenant errors.
- Runner pressure:
- outbox dispatch failure rate, effect timeout rate, deadletter writes.
- Disk saturation on nodes hosting libmdbx:
- disk usage, read/write latency, IOPS; correlate with spikes in command/query latency.
- Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata:
- pool saturation, replica lag, slow queries, long transactions.
- Required instrumentation:
- Ensure JetStream metrics are scraped (NATS server `/varz` exporter or native Prometheus endpoint depending on deployment).
- Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent).
- **Platform — Infrastructure Exhaustion**
- Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots.
- Panels (minimum):
- CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation.
- OOM kill tracker across the cluster.
- Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics).
- vmagent health:
- scrape error rate, remote_write errors, queue backlog.
- Loki ingestion health:
- dropped log lines (promtail) and ingestion errors (loki).
- Swarm task hygiene:
- desired_state vs current_state mismatches, orphaned tasks, restart loops.
- Required instrumentation:
- node exporter / cadvisor (or equivalent) must be part of the production scrape plan.
- promtail (or alternative) must expose drop/error metrics when logs are enabled.
#### 7.4 Alerting Requirements
Minimum alert classes:
- Availability:
- service down (`up == 0`)
- scrape failures, vmagent remote_write errors
- Reliability:
- sustained elevated 5xx ratio
- sustained elevated p95 latency per service
- Backlogs:
- JetStream consumer lag above threshold
- Runner outbox depth above threshold
- Data safety:
- disk usage near full (nodes hosting libmdbx)
- abnormal restart loops
- Security:
- login anomaly detection signals (where instrumented)
- suspicious spike in session revocations / failed MFA
Alert rules can follow UltraBases approach of version-controlled rules in YAML (reference: [alerts/](file:///Users/vlad/Developer/madapes/ultrabase/observability/alerts)).
#### 7.5 Control Plane → Observability Linking
The Admin UI must embed or deep-link into observability tools:
- per-tenant and per-service quick links to Grafana dashboards and Loki queries
- incident triage shortcuts (operations overview → HTTP detail → logs)
This mirrors UltraBases “observability links JSON” concept ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L65-L75)), but adapted to Cloudlysis services and dashboards.
---
### 8) Audit, Compliance, and Change History
- Audit log is an append-only stream of security and operations events:
- authentication and session events
- RBAC changes and permission grants
- config/definition changes and promotions
- scaling, drain, and migration operations
- deployments and rollbacks
- Audit log must support:
- search and export (bounded and access controlled)
- correlation to production incidents (request ids, trace ids)
- retention policy controls
---
### 9) Control Plane API Surface (Admin API)
The control plane requires a stable API surface for the Admin UI and automation.
Minimum API capabilities:
- **Idempotent jobs for multi-step operations**:
- every mutating operation returns a `job_id`, supports polling and cancellation, and records a full execution trace in the audit log.
- **Preflight endpoints**:
- validate an intended change and return a plan (and “would-change” diff) without applying it.
- **RBAC-first access model**:
- all endpoints enforce permission checks at the API boundary (UI is not trusted).
Minimum endpoint groups:
- `/admin/v1/iam/*` (users, roles, assignments, sessions)
- `/admin/v1/tenants/*` (tenants lifecycle, status, metadata)
- `/admin/v1/config/*` (versioned config, diff, apply, rollback)
- `/admin/v1/definitions/*` (bundles, validate, promote, rollback)
- `/admin/v1/scale/*` (placement, migrations, drain status)
- `/admin/v1/ops/*` (deployments, rollbacks, service actions)
- `/admin/v1/observability/*` (links, saved queries, dashboard registry)
- `/admin/v1/audit/*` (search, export)
Authentication/authorization integration:
- Prefer using the **Gateway** as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions.
---
### 10) Secrets and Credentials Management
The control plane must treat secrets as first-class operational data with strict handling.
Requirements:
- Secret values must never be logged and must be redacted in UI/API responses.
- Secrets must support:
- creation and rotation workflows
- scoped access (global/tenant/environment)
- staged rollout (write new → verify → promote → retire old)
- Rendering rules:
- after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only).
- Integrations:
- support referencing secrets from config/definitions without embedding values (secret refs).
---
### 11) Backups, Restore, and Disaster Recovery (Production Operability)
The control plane must provide explicit visibility and guardrails for data safety operations.
Minimum requirements:
- **Backup status**:
- show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores).
- **Restore readiness**:
- preflight checks that validate a restore plan (target environment, versions, dependencies).
- **Operational playbooks**:
- link to the exact restore procedure and post-restore verification checklist.
- **Key rotation**:
- explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends.
This should align with the platforms existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs).
---
## **Admin UI Requirements (Information Architecture + UX)**
### Navigation (Minimum)
Left navigation sections:
- Overview
- Tenants
- Users
- Sessions
- Roles & Permissions
- Config
- Definitions
- Scale & Placement
- Deployments
- Observability
- Audit Log
- Settings
### Page Patterns (Reuse UltraBase UI)
Adopt the UltraBase component system and page layout patterns:
- Layout, styling tokens, UI primitives: [components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)
- Table + search + action dropdown pattern: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx#L94-L203)
Required page types:
- List pages:
- searchable table, bulk actions, row actions menu, status pills, empty states
- Detail pages:
- header with primary actions (drain, migrate, rollback)
- sub-nav tabs for domain-specific views
- Mutation flows:
- modal confirmation + explicit reason entry for high-impact changes
- toast notifications and “busy” state handling consistent with UltraBase patterns
### Tenant Detail Subpages (Minimum)
- Overview (status, assignments, SLO highlights)
- Placement (per service: Aggregate/Projection/Runner)
- Health (node readiness and dependency checks)
- Config (effective config + diffs)
- Definitions (applied definition bundle + version)
- Activity (audit trail filtered to tenant)
- Observability (embedded links / panels)
---
## **Non-Functional Requirements**
- **Security**:
- strict RBAC everywhere; deny-by-default
- audit every privileged operation
- step-up for sensitive actions
- CSRF protection for browser sessions
- safe secret handling (no secret values rendered after creation unless explicitly permitted)
- allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse
- **Reliability**:
- control plane operations are idempotent and resilient to partial failures
- operations have clear “current state” and do not rely on UI assumptions
- **Performance**:
- list pages paginate and filter server-side for large fleets
- dashboards load with bounded query costs and controlled label cardinality
- **Operability**:
- control plane itself must be observable (metrics/logs, dashboards, alerts)
- every operation must surface preflight checks and post-conditions
---
## **Open Questions / Design Constraints (To Resolve During Implementation)**
- Where does the source of truth live for:
- users/sessions/roles (Gateway vs control-plane backing store)?
- configs/definitions (NATS KV vs database vs GitOps)?
- How should production promotions be modeled:
- environment branches, approval workflow, and rollback semantics?
- What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)?
- Where should the job/execution state for long-running operations live:
- embedded in the control plane API process, durable store, or NATS workflows?