Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets

2026-03-30 11:40:42 +03:00
parent 7e7041cf8b
commit 1298d9a3df
246 changed files with 55434 additions and 0 deletions
--- a/control/prd.md
+++ b/control/prd.md
@@ -0,0 +1,601 @@
+### 🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops)
+
+**Definition:**  
+This repository hosts the **platform control plane**:
+1) the **Admin UI** used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and  
+2) the **observability stack** and **production dashboards** (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production.
+
+The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails.
+
+---
+
+## **Context: Existing Node Repositories (../)**
+
+This PRD is derived from the currently implemented node repos in `../`:
+- **Aggregate**: expects a control node to manage tenant placement and scaling operations, including tenant migrations ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L82-L151)). Tenant placement primitives and KV helper exist ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L5-L227)).
+- **Gateway**: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production ([gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L13-L175)).
+- **Projection**: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L7-L96)).
+- **Runner**: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics ([tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L11-L104)) and exposes admin endpoints for drain/reload in its PRD ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L199-L210)).
+
+The control plane also adopts the proven **Admin UI UX + component library** from UltraBase’s control-plane admin UI, adapting screens and information architecture to Cloudlysis needs:
+- Reusable UI components live under [ui/control-plane-admin/src/components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui).
+- Example pages include [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx), [AdminUsersPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminUsersPage.tsx), [AdminSessionsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/AdminSessionsPage.tsx), [FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx), and [ObservabilityPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/ObservabilityPage.tsx).
+
+---
+
+## **Problem Statement**
+
+Operating the platform without a unified control plane forces operators to:
+- Use ad-hoc scripts, direct cluster access, or service-local admin endpoints
+- Manage tenants, placements, and deployments without a consistent audit trail
+- Correlate production incidents across services with incomplete dashboards and unsafe levels of access
+
+The platform needs a control plane that:
+- Centralizes **admin workflows** and **production operability**
+- Enforces **least-privilege RBAC**, **step-up**, and **auditing**
+- Provides a consistent, safe abstraction over **tenant placement**, **scale**, and **production operations**
+
+---
+
+## **Goals**
+
+- Deliver an Admin UI with full admin management over:
+  - users, sessions, roles/permissions
+  - configuration (global + per-tenant)
+  - definitions (aggregates, projections, sagas, effects, manifests)
+  - scaling and production management (tenant placement, drains, migrations, deployments)
+- Package production-grade monitoring:
+  - metrics via VictoriaMetrics
+  - logs via Loki
+  - dashboards and alerting via Grafana (+ vmalert where used)
+- Make production operations observable, auditable, and safe by default:
+  - strong change logging + approvals where needed
+  - idempotent operations + dry runs + rollback paths
+
+---
+
+## **Non-Goals**
+
+- Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway).
+- Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns.
+- Provide an arbitrary “general API gateway” for third-party upstreams.
+
+---
+
+## **Primary Users**
+
+- **Platform Owner / SRE**: fleet operations, incident response, production change management.
+- **Platform Admin**: tenant provisioning, RBAC, config/definition promotion.
+- **Security Admin**: access reviews, session revocation, audit trails.
+- **Support / On-call**: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback).
+
+---
+
+## **Key Concepts**
+
+### Control Plane Scope
+
+- The control plane is the authoritative interface for production operations and admin management.
+- The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them.
+
+### Tenant-Aware Operations
+
+- All tenant-scoped operations are keyed by `tenant_id` (consistent with `x-tenant-id` usage across nodes and Gateway).
+- Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns ([swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L188-L226), [tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L41-L104)).
+
+### Safe Change Management
+
+- Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible.
+- All high-impact operations support:
+  - validation and preflight checks
+  - dry-run planning
+  - idempotency keys
+  - explicit rollback guidance
+
+### Control Plane Components (In This Repo)
+
+- **Admin UI (React)**:
+  - Reuse UltraBase’s control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements ([components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)).
+  - The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent.
+- **Control Plane API (BFF / Admin API)**:
+  - A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs.
+  - Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions.
+- **Observability Stack**:
+  - Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBase’s baseline ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)).
+
+---
+
+## **Functional Requirements**
+
+### 1) Admin IAM (Users, Sessions, Roles)
+
+#### 1.1 Users
+
+- CRUD users with lifecycle states:
+  - invited (pending acceptance), active, suspended, disabled, deleted (tombstoned)
+- Identity attributes:
+  - email (primary), optional secondary identities
+  - display name, avatar, metadata tags
+  - auth methods enabled (password, OIDC providers), MFA state
+- Administrative actions:
+  - invite/resend invite
+  - reset password flow initiation
+  - force MFA reset / revoke recovery codes
+  - disable login / suspend user
+  - impersonation (break-glass, audited, time-boxed)
+- Security constraints:
+  - privileged actions require step-up / recent auth
+  - sensitive events must be audit logged (who, what, when, why, from where)
+
+#### 1.2 Sessions
+
+- View active sessions and refresh token families:
+  - by user, by tenant, by IP / geo, by device, by time range
+- Revoke capabilities:
+  - revoke a single session
+  - revoke all sessions for a user
+  - revoke all sessions for a tenant (incident response)
+- Detection surfaces:
+  - unusual session fanout (many sessions per user)
+  - repeated failed logins / MFA failures
+  - suspicious IP changes
+
+#### 1.3 Roles & Permissions (RBAC)
+
+- Roles are sets of permissions; assignments bind principals to roles in a scope.
+- Scopes:
+  - global (platform-level)
+  - tenant-scoped
+  - environment-scoped (dev/staging/prod) when applicable
+- Required permission domains (minimum):
+  - iam.users.* (create/update/suspend/delete)
+  - iam.sessions.* (list/revoke)
+  - iam.roles.* (create/update/assign)
+  - tenants.* (create/update/archive)
+  - configs.* (read/write/approve/apply)
+  - definitions.* (read/write/validate/promote/rollback)
+  - scale.* (view/apply/migrate/drain)
+  - ops.* (deploy/rollback/restart/drain)
+  - observability.* (view dashboards, manage alert rules)
+  - audit.* (view/export)
+- Role templates:
+  - owner, admin, operator, support, read-only, security-admin, break-glass
+
+---
+
+### 2) Tenant Management
+
+- Create, list, and archive tenants.
+- Tenant status model:
+  - provisioning, active, draining, migrating, degraded, suspended, archived
+- Tenant metadata:
+  - plan/tier, quotas, feature flags, contact + billing metadata, environment(s)
+- Tenant operational actions:
+  - trigger provisioning workflows (create streams/buckets, seed configs, create placement)
+  - rotate tenant secrets (as definitions/config allow)
+  - pause/resume workload (soft kill switch via config flags)
+
+Tenant pages should mirror UltraBase’s “Tenant Overview + subpages” navigation patterns (example: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx) and [TenantOverviewPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantOverviewPage.tsx)).
+
+---
+
+### 3) Configuration Management (Global + Per-Tenant)
+
+#### 3.1 Config Model
+
+- Config items are versioned, typed documents with:
+  - scope (global / tenant / environment)
+  - schema version
+  - provenance (who/what wrote it)
+  - effective date and rollout strategy
+- Config must support:
+  - validation against a schema
+  - diff view (previous vs next)
+  - staged rollout (preview → apply)
+  - rollback to a prior version
+
+#### 3.2 Node-Related Configuration
+
+Required config surfaces (minimum):
+- **Gateway**: routing/placement sources, auth policies, rate limits (see routing expectations in [gateway/prd.md](file:///Users/vlad/Developer/cloudlysis/gateway/prd.md#L154-L175)).
+- **Aggregate / Projection / Runner**:
+  - shard identifiers and tenant allowlists/placement settings
+  - drain/reload toggles and safety thresholds
+  - resource limits / concurrency caps
+
+---
+
+### 4) Definition Management (System “Blueprints”)
+
+Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types.
+
+Required capabilities:
+- Upload/edit versioned definitions with:
+  - validation (schema + semantic checks)
+  - “impact analysis” (which tenants/services are affected)
+  - promotion workflow (dev → staging → prod)
+- Change controls:
+  - approvals (role-based) for production promotion
+  - emergency rollback path (one-click revert to last-known-good definition bundle)
+- Tenant overrides:
+  - allow per-tenant definition overrides only when explicitly permitted by policy
+
+The control plane must present definitions in a way that maps to the node runtime responsibilities:
+- Aggregates and deterministic decide/apply programs ([aggregate/prd.md](file:///Users/vlad/Developer/cloudlysis/aggregate/prd.md#L155-L160))
+- Projections and deterministic project programs ([projection/prd.md](file:///Users/vlad/Developer/cloudlysis/projection/prd.md#L36-L55))
+- Runner sagas and effect provider manifests ([runner/prd.md](file:///Users/vlad/Developer/cloudlysis/runner/prd.md#L41-L57))
+
+---
+
+### 5) Scale Management (Tenant Placement, Shards, Fleet)
+
+#### 5.1 Placement Model
+
+- Placement is modeled as:
+  - a set of nodes/shards and their attributes (labels, capacity, region)
+  - tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant)
+- Control plane supports both:
+  - static placement (development)
+  - dynamic placement (production) backed by NATS KV (consistent with existing client patterns in [swarm.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L79-L227))
+
+#### 5.2 Tenant Migration
+
+- Provide guided migration planning and execution:
+  - show current assignment, target assignment, and a sequenced action plan
+  - execute “graceful drain → update placement → reload” style plans (see [plan_graceful_tenant_migration](file:///Users/vlad/Developer/cloudlysis/aggregate/src/swarm.rs#L41-L65))
+- Migration safety:
+  - require explicit confirmation and reason
+  - block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high)
+  - time-box and alert if drains do not converge
+
+#### 5.3 Fleet View
+
+- Fleet inventory:
+  - nodes (labels, region, capacity, version)
+  - services (replicas, image version, health)
+  - per-node and per-service load indicators (CPU/mem, request rate, consumer lag)
+- Operator actions:
+  - scale replicas, restart services, cordon/drain nodes (when supported by orchestrator)
+
+UX should align with the UltraBase “Fleet” and “Topology” navigation patterns ([FleetPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/FleetPage.tsx), [TopologyPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TopologyPage.tsx)).
+
+---
+
+### 6) Production Operations (Deployments, Maintenance, Safety)
+
+#### 6.1 Deployments
+
+- Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with:
+  - environment-specific rollout policies
+  - canary/rolling deploy support (when orchestrator supports it)
+  - automatic health checks gates and rollback triggers
+- Track releases:
+  - “what is running where” (service version matrix)
+  - change log links and approvals
+
+#### 6.2 Maintenance Operations
+
+- Drain operations:
+  - tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in [TenantGate](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L106-L200))
+  - node drain (aggregate tenant ranges, projection consumers, runner workers)
+- Replay / rebuild operations:
+  - projection rebuild triggers (dangerous, must be guarded and audited)
+  - workflow replay controls (reset checkpoints only with explicit intent)
+
+#### 6.3 Incident Response Toolkit
+
+- “Safe switches”:
+  - per-tenant kill switch (disable commands/effects via config)
+  - global degrade modes (rate limit reductions, disable expensive features)
+- Run actions:
+  - revoke sessions at scale
+  - freeze deployments
+  - trigger drain/migrate with guided plan
+
+---
+
+### 7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards
+
+#### 7.1 Stack Requirements
+
+Adopt a production-ready stack consistent with UltraBase’s operational baseline:
+- **VictoriaMetrics** for metrics storage and Prometheus-compatible query
+- **vmagent** for scraping and remote_write
+- **Grafana** for dashboards and alert routing
+- **Loki** (+ optional **Promtail**) for logs
+- Optional **vmalert** for rule evaluation against VictoriaMetrics
+
+UltraBase’s observability design is a direct reference implementation to mirror and adapt:
+- Stack overview and conventions: [observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L1-L47)
+- Provisioned dashboards and datasources: [grafana provisioning](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning)
+
+#### 7.2 Metrics Conventions
+
+- Every service exports `/metrics` in Prometheus format.
+- Required labels:
+  - `service` (stable, low cardinality)
+  - `env` (dev/staging/prod)
+  - `tenant_id` only where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled.
+- HTTP metrics must avoid unbounded `path` cardinality; prefer route templates (pattern-based paths).
+
+Tenant-aware metrics guidelines:
+- Prefer **tenant-only aggregates** for “who is hurting us?” views:
+  - `..._requests_total{tenant_id,service,status_class}` (no `path`)
+  - `..._request_duration_seconds{tenant_id,service}` (no `path`, limited bucket count)
+- Prefer **route-only aggregates** for “what endpoint is hurting us?” views:
+  - `..._requests_total{service,path,status}` (no `tenant_id`)
+- Where per-tenant and per-route both matter, implement a **top-k sampling** policy:
+  - emit `(tenant_id,path)` series only for top N tenants, or only for a fixed allowlist of routes.
+
+#### 7.3 Required Dashboards (Production)
+
+Minimum set of dashboards (provisioned on startup):
+- **Platform — Operations overview**
+  - `up` for core services and observability stack
+  - RPS, 4xx/5xx ratio, p95/p99 latency per service
+  - saturation indicators (CPU/mem, inflight, queue depth)
+- **Platform — HTTP detail**
+  - per-service request breakdown by route template, method, status
+  - top failing paths and latency outliers
+- **Platform — Logs**
+  - Loki stream filtering by `service`, `tenant_id` (where present), and correlation identifiers
+- **Platform — Event bus / JetStream**
+  - consumer lag, redeliveries, ack latency, stream storage pressure
+- **Platform — Workers (Runner)**
+  - outbox depth, effect latency, poison message counts, schedules backlog
+- **Platform — Storage (libmdbx)**
+  - DB size growth, write stalls, fsync latency (where exported), disk usage
+- **Platform — Cluster / Orchestrator**
+  - node health, container restarts, placement distribution by tenant range
+
+Dashboards should be modeled after UltraBase’s default set (for structure, not content), e.g. [ultrabase-operations.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-operations.json) and [ultrabase-http-detail.json](file:///Users/vlad/Developer/madapes/ultrabase/observability/grafana/provisioning/dashboards/default/ultrabase-http-detail.json).
+
+Additional production-operability dashboards (chosen and adapted):
+- **Platform — Noisy Neighbor & Tenant Health**
+  - Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant.
+  - Panels (minimum):
+    - Top tenants by Gateway RPS (topk of tenant-only request counters).
+    - Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms.
+    - Tenant error ratio (5xx and 429) per tenant.
+    - Aggregate in-flight commands by tenant (already exported: `aggregate_in_flight_commands{tenant_id}`).
+    - Projection processing error rate by tenant (from `projection_processing_errors_total{tenant_id,view_type}` aggregated per tenant).
+    - Loki logs panel with a `tenant_id` variable selector; selecting a tenant syncs RPS/latency/errors + logs.
+  - Required instrumentation:
+    - Gateway must expose **tenant-level** HTTP counters/histograms (tenant + status class + service, without `path`) in addition to existing route-level metrics.
+
+- **Platform — API Regression & Deployment**
+  - Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events.
+  - Panels (minimum):
+    - Error rate comparison “old vs new” by `service` and `version` (or `image_tag`) labels.
+    - Latency comparison “old vs new” (p95/p99) per service.
+    - Restart / flapping rate per service (container restarts, crash loops).
+    - Dependency latency correlation:
+      - Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency.
+    - Loki “new errors” panel:
+      - errors seen in the last 10m that were not present in the prior 60m window, grouped by `service`.
+    - Deployment annotations:
+      - vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric).
+  - Required instrumentation:
+    - Every service exports a `*_build_info{service,version,git_sha}` gauge (value=1) or equivalent, and scrape relabeling adds `image_tag` where possible.
+    - Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations).
+
+- **Platform — Storage & Event Bus Bottlenecks**
+  - Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting).
+  - Panels (minimum):
+    - NATS/JetStream health:
+      - stream storage pressure, publish/ack latency, consumer lag, redeliveries.
+    - Projection lag and throughput:
+      - events processed rate, processing duration, error rate.
+    - Aggregate write-path pressure:
+      - command duration, version conflicts, in-flight commands, tenant errors.
+    - Runner pressure:
+      - outbox dispatch failure rate, effect timeout rate, deadletter writes.
+    - Disk saturation on nodes hosting libmdbx:
+      - disk usage, read/write latency, IOPS; correlate with spikes in command/query latency.
+    - Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata:
+      - pool saturation, replica lag, slow queries, long transactions.
+  - Required instrumentation:
+    - Ensure JetStream metrics are scraped (NATS server `/varz` exporter or native Prometheus endpoint depending on deployment).
+    - Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent).
+
+- **Platform — Infrastructure Exhaustion**
+  - Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots.
+  - Panels (minimum):
+    - CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation.
+    - OOM kill tracker across the cluster.
+    - Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics).
+    - vmagent health:
+      - scrape error rate, remote_write errors, queue backlog.
+    - Loki ingestion health:
+      - dropped log lines (promtail) and ingestion errors (loki).
+    - Swarm task hygiene:
+      - desired_state vs current_state mismatches, orphaned tasks, restart loops.
+  - Required instrumentation:
+    - node exporter / cadvisor (or equivalent) must be part of the production scrape plan.
+    - promtail (or alternative) must expose drop/error metrics when logs are enabled.
+
+#### 7.4 Alerting Requirements
+
+Minimum alert classes:
+- Availability:
+  - service down (`up == 0`)
+  - scrape failures, vmagent remote_write errors
+- Reliability:
+  - sustained elevated 5xx ratio
+  - sustained elevated p95 latency per service
+- Backlogs:
+  - JetStream consumer lag above threshold
+  - Runner outbox depth above threshold
+- Data safety:
+  - disk usage near full (nodes hosting libmdbx)
+  - abnormal restart loops
+- Security:
+  - login anomaly detection signals (where instrumented)
+  - suspicious spike in session revocations / failed MFA
+
+Alert rules can follow UltraBase’s approach of version-controlled rules in YAML (reference: [alerts/](file:///Users/vlad/Developer/madapes/ultrabase/observability/alerts)).
+
+#### 7.5 Control Plane → Observability Linking
+
+The Admin UI must embed or deep-link into observability tools:
+- per-tenant and per-service quick links to Grafana dashboards and Loki queries
+- incident triage shortcuts (operations overview → HTTP detail → logs)
+
+This mirrors UltraBase’s “observability links JSON” concept ([observability/README.md](file:///Users/vlad/Developer/madapes/ultrabase/observability/README.md#L65-L75)), but adapted to Cloudlysis services and dashboards.
+
+---
+
+### 8) Audit, Compliance, and Change History
+
+- Audit log is an append-only stream of security and operations events:
+  - authentication and session events
+  - RBAC changes and permission grants
+  - config/definition changes and promotions
+  - scaling, drain, and migration operations
+  - deployments and rollbacks
+- Audit log must support:
+  - search and export (bounded and access controlled)
+  - correlation to production incidents (request ids, trace ids)
+  - retention policy controls
+
+---
+
+### 9) Control Plane API Surface (Admin API)
+
+The control plane requires a stable API surface for the Admin UI and automation.
+
+Minimum API capabilities:
+- **Idempotent jobs for multi-step operations**:
+  - every mutating operation returns a `job_id`, supports polling and cancellation, and records a full execution trace in the audit log.
+- **Preflight endpoints**:
+  - validate an intended change and return a plan (and “would-change” diff) without applying it.
+- **RBAC-first access model**:
+  - all endpoints enforce permission checks at the API boundary (UI is not trusted).
+
+Minimum endpoint groups:
+- `/admin/v1/iam/*` (users, roles, assignments, sessions)
+- `/admin/v1/tenants/*` (tenants lifecycle, status, metadata)
+- `/admin/v1/config/*` (versioned config, diff, apply, rollback)
+- `/admin/v1/definitions/*` (bundles, validate, promote, rollback)
+- `/admin/v1/scale/*` (placement, migrations, drain status)
+- `/admin/v1/ops/*` (deployments, rollbacks, service actions)
+- `/admin/v1/observability/*` (links, saved queries, dashboard registry)
+- `/admin/v1/audit/*` (search, export)
+
+Authentication/authorization integration:
+- Prefer using the **Gateway** as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions.
+
+---
+
+### 10) Secrets and Credentials Management
+
+The control plane must treat secrets as first-class operational data with strict handling.
+
+Requirements:
+- Secret values must never be logged and must be redacted in UI/API responses.
+- Secrets must support:
+  - creation and rotation workflows
+  - scoped access (global/tenant/environment)
+  - staged rollout (write new → verify → promote → retire old)
+- Rendering rules:
+  - after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only).
+- Integrations:
+  - support referencing secrets from config/definitions without embedding values (secret refs).
+
+---
+
+### 11) Backups, Restore, and Disaster Recovery (Production Operability)
+
+The control plane must provide explicit visibility and guardrails for data safety operations.
+
+Minimum requirements:
+- **Backup status**:
+  - show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores).
+- **Restore readiness**:
+  - preflight checks that validate a restore plan (target environment, versions, dependencies).
+- **Operational playbooks**:
+  - link to the exact restore procedure and post-restore verification checklist.
+- **Key rotation**:
+  - explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends.
+
+This should align with the platform’s existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs).
+
+---
+
+## **Admin UI Requirements (Information Architecture + UX)**
+
+### Navigation (Minimum)
+
+Left navigation sections:
+- Overview
+- Tenants
+- Users
+- Sessions
+- Roles & Permissions
+- Config
+- Definitions
+- Scale & Placement
+- Deployments
+- Observability
+- Audit Log
+- Settings
+
+### Page Patterns (Reuse UltraBase UI)
+
+Adopt the UltraBase component system and page layout patterns:
+- Layout, styling tokens, UI primitives: [components/ui](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/components/ui)
+- Table + search + action dropdown pattern: [TenantsPage](file:///Users/vlad/Developer/madapes/ultrabase/ui/control-plane-admin/src/pages/TenantsPage.tsx#L94-L203)
+
+Required page types:
+- List pages:
+  - searchable table, bulk actions, row actions menu, status pills, empty states
+- Detail pages:
+  - header with primary actions (drain, migrate, rollback)
+  - sub-nav tabs for domain-specific views
+- Mutation flows:
+  - modal confirmation + explicit reason entry for high-impact changes
+  - toast notifications and “busy” state handling consistent with UltraBase patterns
+
+### Tenant Detail Subpages (Minimum)
+
+- Overview (status, assignments, SLO highlights)
+- Placement (per service: Aggregate/Projection/Runner)
+- Health (node readiness and dependency checks)
+- Config (effective config + diffs)
+- Definitions (applied definition bundle + version)
+- Activity (audit trail filtered to tenant)
+- Observability (embedded links / panels)
+
+---
+
+## **Non-Functional Requirements**
+
+- **Security**:
+  - strict RBAC everywhere; deny-by-default
+  - audit every privileged operation
+  - step-up for sensitive actions
+  - CSRF protection for browser sessions
+  - safe secret handling (no secret values rendered after creation unless explicitly permitted)
+  - allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse
+- **Reliability**:
+  - control plane operations are idempotent and resilient to partial failures
+  - operations have clear “current state” and do not rely on UI assumptions
+- **Performance**:
+  - list pages paginate and filter server-side for large fleets
+  - dashboards load with bounded query costs and controlled label cardinality
+- **Operability**:
+  - control plane itself must be observable (metrics/logs, dashboards, alerts)
+  - every operation must surface preflight checks and post-conditions
+
+---
+
+## **Open Questions / Design Constraints (To Resolve During Implementation)**
+
+- Where does the source of truth live for:
+  - users/sessions/roles (Gateway vs control-plane backing store)?
+  - configs/definitions (NATS KV vs database vs GitOps)?
+- How should production promotions be modeled:
+  - environment branches, approval workflow, and rollback semantics?
+- What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)?
+ - Where should the job/execution state for long-running operations live:
+   - embedded in the control plane API process, durable store, or NATS workflows?