### 🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops) **Definition:** This repository hosts the **platform control plane**: 1) the **Admin UI** used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and 2) the **observability stack** and **production dashboards** (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production. The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails. --- ## **Context: Existing Node Repositories (../)** This PRD is derived from the currently implemented node repos in `../`: - **Aggregate**: expects a control node to manage tenant placement and scaling operations, including tenant migrations ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L82-L151)). Tenant placement primitives and KV helper exist ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L5-L227)). - **Gateway**: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production ([gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L13-L175)). - **Projection**: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L7-L96)). - **Runner**: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics ([tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L11-L104)) and exposes admin endpoints for drain/reload in its PRD ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L199-L210)). The control plane also adopts the proven **Admin UI UX + component library** from UltraBase’s control-plane admin UI, adapting screens and information architecture to Cloudlysis needs: - Reusable UI components live under [ui/control-plane-admin/src/components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui). - Example pages include [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx), [AdminUsersPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminUsersPage.tsx), [AdminSessionsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminSessionsPage.tsx), [FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx), and [ObservabilityPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/ObservabilityPage.tsx). --- ## **Problem Statement** Operating the platform without a unified control plane forces operators to: - Use ad-hoc scripts, direct cluster access, or service-local admin endpoints - Manage tenants, placements, and deployments without a consistent audit trail - Correlate production incidents across services with incomplete dashboards and unsafe levels of access The platform needs a control plane that: - Centralizes **admin workflows** and **production operability** - Enforces **least-privilege RBAC**, **step-up**, and **auditing** - Provides a consistent, safe abstraction over **tenant placement**, **scale**, and **production operations** --- ## **Goals** - Deliver an Admin UI with full admin management over: - users, sessions, roles/permissions - configuration (global + per-tenant) - definitions (aggregates, projections, sagas, effects, manifests) - scaling and production management (tenant placement, drains, migrations, deployments) - Package production-grade monitoring: - metrics via VictoriaMetrics - logs via Loki - dashboards and alerting via Grafana (+ vmalert where used) - Make production operations observable, auditable, and safe by default: - strong change logging + approvals where needed - idempotent operations + dry runs + rollback paths --- ## **Non-Goals** - Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway). - Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns. - Provide an arbitrary “general API gateway” for third-party upstreams. --- ## **Primary Users** - **Platform Owner / SRE**: fleet operations, incident response, production change management. - **Platform Admin**: tenant provisioning, RBAC, config/definition promotion. - **Security Admin**: access reviews, session revocation, audit trails. - **Support / On-call**: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback). --- ## **Key Concepts** ### Control Plane Scope - The control plane is the authoritative interface for production operations and admin management. - The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them. ### Tenant-Aware Operations - All tenant-scoped operations are keyed by `tenant_id` (consistent with `x-tenant-id` usage across nodes and Gateway). - Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L188-L226), [tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L41-L104)). ### Safe Change Management - Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible. - All high-impact operations support: - validation and preflight checks - dry-run planning - idempotency keys - explicit rollback guidance ### Control Plane Components (In This Repo) - **Admin UI (React)**: - Reuse UltraBase’s control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements ([components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)). - The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent. - **Control Plane API (BFF / Admin API)**: - A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs. - Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions. - **Observability Stack**: - Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBase’s baseline ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)). --- ## **Functional Requirements** ### 1) Admin IAM (Users, Sessions, Roles) #### 1.1 Users - CRUD users with lifecycle states: - invited (pending acceptance), active, suspended, disabled, deleted (tombstoned) - Identity attributes: - email (primary), optional secondary identities - display name, avatar, metadata tags - auth methods enabled (password, OIDC providers), MFA state - Administrative actions: - invite/resend invite - reset password flow initiation - force MFA reset / revoke recovery codes - disable login / suspend user - impersonation (break-glass, audited, time-boxed) - Security constraints: - privileged actions require step-up / recent auth - sensitive events must be audit logged (who, what, when, why, from where) #### 1.2 Sessions - View active sessions and refresh token families: - by user, by tenant, by IP / geo, by device, by time range - Revoke capabilities: - revoke a single session - revoke all sessions for a user - revoke all sessions for a tenant (incident response) - Detection surfaces: - unusual session fanout (many sessions per user) - repeated failed logins / MFA failures - suspicious IP changes #### 1.3 Roles & Permissions (RBAC) - Roles are sets of permissions; assignments bind principals to roles in a scope. - Scopes: - global (platform-level) - tenant-scoped - environment-scoped (dev/staging/prod) when applicable - Required permission domains (minimum): - iam.users.* (create/update/suspend/delete) - iam.sessions.* (list/revoke) - iam.roles.* (create/update/assign) - tenants.* (create/update/archive) - configs.* (read/write/approve/apply) - definitions.* (read/write/validate/promote/rollback) - scale.* (view/apply/migrate/drain) - ops.* (deploy/rollback/restart/drain) - observability.* (view dashboards, manage alert rules) - audit.* (view/export) - Role templates: - owner, admin, operator, support, read-only, security-admin, break-glass --- ### 2) Tenant Management - Create, list, and archive tenants. - Tenant status model: - provisioning, active, draining, migrating, degraded, suspended, archived - Tenant metadata: - plan/tier, quotas, feature flags, contact + billing metadata, environment(s) - Tenant operational actions: - trigger provisioning workflows (create streams/buckets, seed configs, create placement) - rotate tenant secrets (as definitions/config allow) - pause/resume workload (soft kill switch via config flags) Tenant pages should mirror UltraBase’s “Tenant Overview + subpages” navigation patterns (example: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx) and [TenantOverviewPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantOverviewPage.tsx)). --- ### 3) Configuration Management (Global + Per-Tenant) #### 3.1 Config Model - Config items are versioned, typed documents with: - scope (global / tenant / environment) - schema version - provenance (who/what wrote it) - effective date and rollout strategy - Config must support: - validation against a schema - diff view (previous vs next) - staged rollout (preview → apply) - rollback to a prior version #### 3.2 Node-Related Configuration Required config surfaces (minimum): - **Gateway**: routing/placement sources, auth policies, rate limits (see routing expectations in [gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L154-L175)). - **Aggregate / Projection / Runner**: - shard identifiers and tenant allowlists/placement settings - drain/reload toggles and safety thresholds - resource limits / concurrency caps --- ### 4) Definition Management (System “Blueprints”) Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types. Required capabilities: - Upload/edit versioned definitions with: - validation (schema + semantic checks) - “impact analysis” (which tenants/services are affected) - promotion workflow (dev → staging → prod) - Change controls: - approvals (role-based) for production promotion - emergency rollback path (one-click revert to last-known-good definition bundle) - Tenant overrides: - allow per-tenant definition overrides only when explicitly permitted by policy The control plane must present definitions in a way that maps to the node runtime responsibilities: - Aggregates and deterministic decide/apply programs ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L155-L160)) - Projections and deterministic project programs ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L36-L55)) - Runner sagas and effect provider manifests ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L41-L57)) --- ### 5) Scale Management (Tenant Placement, Shards, Fleet) #### 5.1 Placement Model - Placement is modeled as: - a set of nodes/shards and their attributes (labels, capacity, region) - tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant) - Control plane supports both: - static placement (development) - dynamic placement (production) backed by NATS KV (consistent with existing client patterns in [swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L79-L227)) #### 5.2 Tenant Migration - Provide guided migration planning and execution: - show current assignment, target assignment, and a sequenced action plan - execute “graceful drain → update placement → reload” style plans (see [plan_graceful_tenant_migration](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L41-L65)) - Migration safety: - require explicit confirmation and reason - block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high) - time-box and alert if drains do not converge #### 5.3 Fleet View - Fleet inventory: - nodes (labels, region, capacity, version) - services (replicas, image version, health) - per-node and per-service load indicators (CPU/mem, request rate, consumer lag) - Operator actions: - scale replicas, restart services, cordon/drain nodes (when supported by orchestrator) UX should align with the UltraBase “Fleet” and “Topology” navigation patterns ([FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx)). --- ### 6) Production Operations (Deployments, Maintenance, Safety) #### 6.1 Deployments - Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with: - environment-specific rollout policies - canary/rolling deploy support (when orchestrator supports it) - automatic health checks gates and rollback triggers - Track releases: - “what is running where” (service version matrix) - change log links and approvals #### 6.2 Maintenance Operations - Drain operations: - tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in [TenantGate](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L106-L200)) - node drain (aggregate tenant ranges, projection consumers, runner workers) - Replay / rebuild operations: - projection rebuild triggers (dangerous, must be guarded and audited) - workflow replay controls (reset checkpoints only with explicit intent) #### 6.3 Incident Response Toolkit - “Safe switches”: - per-tenant kill switch (disable commands/effects via config) - global degrade modes (rate limit reductions, disable expensive features) - Run actions: - revoke sessions at scale - freeze deployments - trigger drain/migrate with guided plan --- ### 7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards #### 7.1 Stack Requirements Adopt a production-ready stack consistent with UltraBase’s operational baseline: - **VictoriaMetrics** for metrics storage and Prometheus-compatible query - **vmagent** for scraping and remote_write - **Grafana** for dashboards and alert routing - **Loki** (+ optional **Promtail**) for logs - Optional **vmalert** for rule evaluation against VictoriaMetrics UltraBase’s observability design is a direct reference implementation to mirror and adapt: - Stack overview and conventions: [observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47) - Provisioned dashboards and datasources: [grafana provisioning](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning) #### 7.2 Metrics Conventions - Every service exports `/metrics` in Prometheus format. - Required labels: - `service` (stable, low cardinality) - `env` (dev/staging/prod) - `tenant_id` only where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled. - HTTP metrics must avoid unbounded `path` cardinality; prefer route templates (pattern-based paths). Tenant-aware metrics guidelines: - Prefer **tenant-only aggregates** for “who is hurting us?” views: - `..._requests_total{tenant_id,service,status_class}` (no `path`) - `..._request_duration_seconds{tenant_id,service}` (no `path`, limited bucket count) - Prefer **route-only aggregates** for “what endpoint is hurting us?” views: - `..._requests_total{service,path,status}` (no `tenant_id`) - Where per-tenant and per-route both matter, implement a **top-k sampling** policy: - emit `(tenant_id,path)` series only for top N tenants, or only for a fixed allowlist of routes. #### 7.3 Required Dashboards (Production) Minimum set of dashboards (provisioned on startup): - **Platform — Operations overview** - `up` for core services and observability stack - RPS, 4xx/5xx ratio, p95/p99 latency per service - saturation indicators (CPU/mem, inflight, queue depth) - **Platform — HTTP detail** - per-service request breakdown by route template, method, status - top failing paths and latency outliers - **Platform — Logs** - Loki stream filtering by `service`, `tenant_id` (where present), and correlation identifiers - **Platform — Event bus / JetStream** - consumer lag, redeliveries, ack latency, stream storage pressure - **Platform — Workers (Runner)** - outbox depth, effect latency, poison message counts, schedules backlog - **Platform — Storage (libmdbx)** - DB size growth, write stalls, fsync latency (where exported), disk usage - **Platform — Cluster / Orchestrator** - node health, container restarts, placement distribution by tenant range Dashboards should be modeled after UltraBase’s default set (for structure, not content), e.g. [ultrabase-operations.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-operations.json) and [ultrabase-http-detail.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-http-detail.json). Additional production-operability dashboards (chosen and adapted): - **Platform — Noisy Neighbor & Tenant Health** - Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant. - Panels (minimum): - Top tenants by Gateway RPS (topk of tenant-only request counters). - Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms. - Tenant error ratio (5xx and 429) per tenant. - Aggregate in-flight commands by tenant (already exported: `aggregate_in_flight_commands{tenant_id}`). - Projection processing error rate by tenant (from `projection_processing_errors_total{tenant_id,view_type}` aggregated per tenant). - Loki logs panel with a `tenant_id` variable selector; selecting a tenant syncs RPS/latency/errors + logs. - Required instrumentation: - Gateway must expose **tenant-level** HTTP counters/histograms (tenant + status class + service, without `path`) in addition to existing route-level metrics. - **Platform — API Regression & Deployment** - Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events. - Panels (minimum): - Error rate comparison “old vs new” by `service` and `version` (or `image_tag`) labels. - Latency comparison “old vs new” (p95/p99) per service. - Restart / flapping rate per service (container restarts, crash loops). - Dependency latency correlation: - Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency. - Loki “new errors” panel: - errors seen in the last 10m that were not present in the prior 60m window, grouped by `service`. - Deployment annotations: - vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric). - Required instrumentation: - Every service exports a `*_build_info{service,version,git_sha}` gauge (value=1) or equivalent, and scrape relabeling adds `image_tag` where possible. - Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations). - **Platform — Storage & Event Bus Bottlenecks** - Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting). - Panels (minimum): - NATS/JetStream health: - stream storage pressure, publish/ack latency, consumer lag, redeliveries. - Projection lag and throughput: - events processed rate, processing duration, error rate. - Aggregate write-path pressure: - command duration, version conflicts, in-flight commands, tenant errors. - Runner pressure: - outbox dispatch failure rate, effect timeout rate, deadletter writes. - Disk saturation on nodes hosting libmdbx: - disk usage, read/write latency, IOPS; correlate with spikes in command/query latency. - Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata: - pool saturation, replica lag, slow queries, long transactions. - Required instrumentation: - Ensure JetStream metrics are scraped (NATS server `/varz` exporter or native Prometheus endpoint depending on deployment). - Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent). - **Platform — Infrastructure Exhaustion** - Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots. - Panels (minimum): - CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation. - OOM kill tracker across the cluster. - Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics). - vmagent health: - scrape error rate, remote_write errors, queue backlog. - Loki ingestion health: - dropped log lines (promtail) and ingestion errors (loki). - Swarm task hygiene: - desired_state vs current_state mismatches, orphaned tasks, restart loops. - Required instrumentation: - node exporter / cadvisor (or equivalent) must be part of the production scrape plan. - promtail (or alternative) must expose drop/error metrics when logs are enabled. #### 7.4 Alerting Requirements Minimum alert classes: - Availability: - service down (`up == 0`) - scrape failures, vmagent remote_write errors - Reliability: - sustained elevated 5xx ratio - sustained elevated p95 latency per service - Backlogs: - JetStream consumer lag above threshold - Runner outbox depth above threshold - Data safety: - disk usage near full (nodes hosting libmdbx) - abnormal restart loops - Security: - login anomaly detection signals (where instrumented) - suspicious spike in session revocations / failed MFA Alert rules can follow UltraBase’s approach of version-controlled rules in YAML (reference: [alerts/](file:///Users/vlad/Developer/madapes/ultrabase/observability/alerts)). #### 7.5 Control Plane → Observability Linking The Admin UI must embed or deep-link into observability tools: - per-tenant and per-service quick links to Grafana dashboards and Loki queries - incident triage shortcuts (operations overview → HTTP detail → logs) This mirrors UltraBase’s “observability links JSON” concept ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L65-L75)), but adapted to Cloudlysis services and dashboards. --- ### 8) Audit, Compliance, and Change History - Audit log is an append-only stream of security and operations events: - authentication and session events - RBAC changes and permission grants - config/definition changes and promotions - scaling, drain, and migration operations - deployments and rollbacks - Audit log must support: - search and export (bounded and access controlled) - correlation to production incidents (request ids, trace ids) - retention policy controls --- ### 9) Control Plane API Surface (Admin API) The control plane requires a stable API surface for the Admin UI and automation. Minimum API capabilities: - **Idempotent jobs for multi-step operations**: - every mutating operation returns a `job_id`, supports polling and cancellation, and records a full execution trace in the audit log. - **Preflight endpoints**: - validate an intended change and return a plan (and “would-change” diff) without applying it. - **RBAC-first access model**: - all endpoints enforce permission checks at the API boundary (UI is not trusted). Minimum endpoint groups: - `/admin/v1/iam/*` (users, roles, assignments, sessions) - `/admin/v1/tenants/*` (tenants lifecycle, status, metadata) - `/admin/v1/config/*` (versioned config, diff, apply, rollback) - `/admin/v1/definitions/*` (bundles, validate, promote, rollback) - `/admin/v1/scale/*` (placement, migrations, drain status) - `/admin/v1/ops/*` (deployments, rollbacks, service actions) - `/admin/v1/observability/*` (links, saved queries, dashboard registry) - `/admin/v1/audit/*` (search, export) Authentication/authorization integration: - Prefer using the **Gateway** as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions. --- ### 10) Secrets and Credentials Management The control plane must treat secrets as first-class operational data with strict handling. Requirements: - Secret values must never be logged and must be redacted in UI/API responses. - Secrets must support: - creation and rotation workflows - scoped access (global/tenant/environment) - staged rollout (write new → verify → promote → retire old) - Rendering rules: - after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only). - Integrations: - support referencing secrets from config/definitions without embedding values (secret refs). --- ### 11) Backups, Restore, and Disaster Recovery (Production Operability) The control plane must provide explicit visibility and guardrails for data safety operations. Minimum requirements: - **Backup status**: - show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores). - **Restore readiness**: - preflight checks that validate a restore plan (target environment, versions, dependencies). - **Operational playbooks**: - link to the exact restore procedure and post-restore verification checklist. - **Key rotation**: - explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends. This should align with the platform’s existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs). --- ## **Admin UI Requirements (Information Architecture + UX)** ### Navigation (Minimum) Left navigation sections: - Overview - Tenants - Users - Sessions - Roles & Permissions - Config - Definitions - Scale & Placement - Deployments - Observability - Audit Log - Settings ### Page Patterns (Reuse UltraBase UI) Adopt the UltraBase component system and page layout patterns: - Layout, styling tokens, UI primitives: [components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui) - Table + search + action dropdown pattern: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx#L94-L203) Required page types: - List pages: - searchable table, bulk actions, row actions menu, status pills, empty states - Detail pages: - header with primary actions (drain, migrate, rollback) - sub-nav tabs for domain-specific views - Mutation flows: - modal confirmation + explicit reason entry for high-impact changes - toast notifications and “busy” state handling consistent with UltraBase patterns ### Tenant Detail Subpages (Minimum) - Overview (status, assignments, SLO highlights) - Placement (per service: Aggregate/Projection/Runner) - Health (node readiness and dependency checks) - Config (effective config + diffs) - Definitions (applied definition bundle + version) - Activity (audit trail filtered to tenant) - Observability (embedded links / panels) --- ## **Non-Functional Requirements** - **Security**: - strict RBAC everywhere; deny-by-default - audit every privileged operation - step-up for sensitive actions - CSRF protection for browser sessions - safe secret handling (no secret values rendered after creation unless explicitly permitted) - allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse - **Reliability**: - control plane operations are idempotent and resilient to partial failures - operations have clear “current state” and do not rely on UI assumptions - **Performance**: - list pages paginate and filter server-side for large fleets - dashboards load with bounded query costs and controlled label cardinality - **Operability**: - control plane itself must be observable (metrics/logs, dashboards, alerts) - every operation must surface preflight checks and post-conditions --- ## **Open Questions / Design Constraints (To Resolve During Implementation)** - Where does the source of truth live for: - users/sessions/roles (Gateway vs control-plane backing store)? - configs/definitions (NATS KV vs database vs GitOps)? - How should production promotions be modeled: - environment branches, approval workflow, and rollback semantics? - What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)? - Where should the job/execution state for long-running operations live: - embedded in the control plane API process, durable store, or NATS workflows?