Files
cloudlysis/control/prd.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

30 KiB
Raw Permalink Blame History

🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops)

Definition:
This repository hosts the platform control plane:

  1. the Admin UI used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and
  2. the observability stack and production dashboards (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production.

The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails.


Context: Existing Node Repositories (../)

This PRD is derived from the currently implemented node repos in ../:

  • Aggregate: expects a control node to manage tenant placement and scaling operations, including tenant migrations (aggregate/prd.md). Tenant placement primitives and KV helper exist (swarm.rs).
  • Gateway: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production (gateway/prd.md).
  • Projection: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) (projection/prd.md).
  • Runner: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics (tenant_placement.rs) and exposes admin endpoints for drain/reload in its PRD (runner/prd.md).

The control plane also adopts the proven Admin UI UX + component library from UltraBases control-plane admin UI, adapting screens and information architecture to Cloudlysis needs:


Problem Statement

Operating the platform without a unified control plane forces operators to:

  • Use ad-hoc scripts, direct cluster access, or service-local admin endpoints
  • Manage tenants, placements, and deployments without a consistent audit trail
  • Correlate production incidents across services with incomplete dashboards and unsafe levels of access

The platform needs a control plane that:

  • Centralizes admin workflows and production operability
  • Enforces least-privilege RBAC, step-up, and auditing
  • Provides a consistent, safe abstraction over tenant placement, scale, and production operations

Goals

  • Deliver an Admin UI with full admin management over:
    • users, sessions, roles/permissions
    • configuration (global + per-tenant)
    • definitions (aggregates, projections, sagas, effects, manifests)
    • scaling and production management (tenant placement, drains, migrations, deployments)
  • Package production-grade monitoring:
    • metrics via VictoriaMetrics
    • logs via Loki
    • dashboards and alerting via Grafana (+ vmalert where used)
  • Make production operations observable, auditable, and safe by default:
    • strong change logging + approvals where needed
    • idempotent operations + dry runs + rollback paths

Non-Goals

  • Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway).
  • Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns.
  • Provide an arbitrary “general API gateway” for third-party upstreams.

Primary Users

  • Platform Owner / SRE: fleet operations, incident response, production change management.
  • Platform Admin: tenant provisioning, RBAC, config/definition promotion.
  • Security Admin: access reviews, session revocation, audit trails.
  • Support / On-call: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback).

Key Concepts

Control Plane Scope

  • The control plane is the authoritative interface for production operations and admin management.
  • The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them.

Tenant-Aware Operations

  • All tenant-scoped operations are keyed by tenant_id (consistent with x-tenant-id usage across nodes and Gateway).
  • Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns (swarm.rs, tenant_placement.rs).

Safe Change Management

  • Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible.
  • All high-impact operations support:
    • validation and preflight checks
    • dry-run planning
    • idempotency keys
    • explicit rollback guidance

Control Plane Components (In This Repo)

  • Admin UI (React):
    • Reuse UltraBases control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements (components/ui).
    • The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent.
  • Control Plane API (BFF / Admin API):
    • A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs.
    • Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions.
  • Observability Stack:
    • Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBases baseline (observability/README.md).

Functional Requirements

1) Admin IAM (Users, Sessions, Roles)

1.1 Users

  • CRUD users with lifecycle states:
    • invited (pending acceptance), active, suspended, disabled, deleted (tombstoned)
  • Identity attributes:
    • email (primary), optional secondary identities
    • display name, avatar, metadata tags
    • auth methods enabled (password, OIDC providers), MFA state
  • Administrative actions:
    • invite/resend invite
    • reset password flow initiation
    • force MFA reset / revoke recovery codes
    • disable login / suspend user
    • impersonation (break-glass, audited, time-boxed)
  • Security constraints:
    • privileged actions require step-up / recent auth
    • sensitive events must be audit logged (who, what, when, why, from where)

1.2 Sessions

  • View active sessions and refresh token families:
    • by user, by tenant, by IP / geo, by device, by time range
  • Revoke capabilities:
    • revoke a single session
    • revoke all sessions for a user
    • revoke all sessions for a tenant (incident response)
  • Detection surfaces:
    • unusual session fanout (many sessions per user)
    • repeated failed logins / MFA failures
    • suspicious IP changes

1.3 Roles & Permissions (RBAC)

  • Roles are sets of permissions; assignments bind principals to roles in a scope.
  • Scopes:
    • global (platform-level)
    • tenant-scoped
    • environment-scoped (dev/staging/prod) when applicable
  • Required permission domains (minimum):
    • iam.users.* (create/update/suspend/delete)
    • iam.sessions.* (list/revoke)
    • iam.roles.* (create/update/assign)
    • tenants.* (create/update/archive)
    • configs.* (read/write/approve/apply)
    • definitions.* (read/write/validate/promote/rollback)
    • scale.* (view/apply/migrate/drain)
    • ops.* (deploy/rollback/restart/drain)
    • observability.* (view dashboards, manage alert rules)
    • audit.* (view/export)
  • Role templates:
    • owner, admin, operator, support, read-only, security-admin, break-glass

2) Tenant Management

  • Create, list, and archive tenants.
  • Tenant status model:
    • provisioning, active, draining, migrating, degraded, suspended, archived
  • Tenant metadata:
    • plan/tier, quotas, feature flags, contact + billing metadata, environment(s)
  • Tenant operational actions:
    • trigger provisioning workflows (create streams/buckets, seed configs, create placement)
    • rotate tenant secrets (as definitions/config allow)
    • pause/resume workload (soft kill switch via config flags)

Tenant pages should mirror UltraBases “Tenant Overview + subpages” navigation patterns (example: TenantsPage and TenantOverviewPage).


3) Configuration Management (Global + Per-Tenant)

3.1 Config Model

  • Config items are versioned, typed documents with:
    • scope (global / tenant / environment)
    • schema version
    • provenance (who/what wrote it)
    • effective date and rollout strategy
  • Config must support:
    • validation against a schema
    • diff view (previous vs next)
    • staged rollout (preview → apply)
    • rollback to a prior version

Required config surfaces (minimum):

  • Gateway: routing/placement sources, auth policies, rate limits (see routing expectations in gateway/prd.md).
  • Aggregate / Projection / Runner:
    • shard identifiers and tenant allowlists/placement settings
    • drain/reload toggles and safety thresholds
    • resource limits / concurrency caps

4) Definition Management (System “Blueprints”)

Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types.

Required capabilities:

  • Upload/edit versioned definitions with:
    • validation (schema + semantic checks)
    • “impact analysis” (which tenants/services are affected)
    • promotion workflow (dev → staging → prod)
  • Change controls:
    • approvals (role-based) for production promotion
    • emergency rollback path (one-click revert to last-known-good definition bundle)
  • Tenant overrides:
    • allow per-tenant definition overrides only when explicitly permitted by policy

The control plane must present definitions in a way that maps to the node runtime responsibilities:


5) Scale Management (Tenant Placement, Shards, Fleet)

5.1 Placement Model

  • Placement is modeled as:
    • a set of nodes/shards and their attributes (labels, capacity, region)
    • tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant)
  • Control plane supports both:
    • static placement (development)
    • dynamic placement (production) backed by NATS KV (consistent with existing client patterns in swarm.rs)

5.2 Tenant Migration

  • Provide guided migration planning and execution:
    • show current assignment, target assignment, and a sequenced action plan
    • execute “graceful drain → update placement → reload” style plans (see plan_graceful_tenant_migration)
  • Migration safety:
    • require explicit confirmation and reason
    • block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high)
    • time-box and alert if drains do not converge

5.3 Fleet View

  • Fleet inventory:
    • nodes (labels, region, capacity, version)
    • services (replicas, image version, health)
    • per-node and per-service load indicators (CPU/mem, request rate, consumer lag)
  • Operator actions:
    • scale replicas, restart services, cordon/drain nodes (when supported by orchestrator)

UX should align with the UltraBase “Fleet” and “Topology” navigation patterns (FleetPage, TopologyPage).


6) Production Operations (Deployments, Maintenance, Safety)

6.1 Deployments

  • Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with:
    • environment-specific rollout policies
    • canary/rolling deploy support (when orchestrator supports it)
    • automatic health checks gates and rollback triggers
  • Track releases:
    • “what is running where” (service version matrix)
    • change log links and approvals

6.2 Maintenance Operations

  • Drain operations:
    • tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in TenantGate)
    • node drain (aggregate tenant ranges, projection consumers, runner workers)
  • Replay / rebuild operations:
    • projection rebuild triggers (dangerous, must be guarded and audited)
    • workflow replay controls (reset checkpoints only with explicit intent)

6.3 Incident Response Toolkit

  • “Safe switches”:
    • per-tenant kill switch (disable commands/effects via config)
    • global degrade modes (rate limit reductions, disable expensive features)
  • Run actions:
    • revoke sessions at scale
    • freeze deployments
    • trigger drain/migrate with guided plan

7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards

7.1 Stack Requirements

Adopt a production-ready stack consistent with UltraBases operational baseline:

  • VictoriaMetrics for metrics storage and Prometheus-compatible query
  • vmagent for scraping and remote_write
  • Grafana for dashboards and alert routing
  • Loki (+ optional Promtail) for logs
  • Optional vmalert for rule evaluation against VictoriaMetrics

UltraBases observability design is a direct reference implementation to mirror and adapt:

7.2 Metrics Conventions

  • Every service exports /metrics in Prometheus format.
  • Required labels:
    • service (stable, low cardinality)
    • env (dev/staging/prod)
    • tenant_id only where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled.
  • HTTP metrics must avoid unbounded path cardinality; prefer route templates (pattern-based paths).

Tenant-aware metrics guidelines:

  • Prefer tenant-only aggregates for “who is hurting us?” views:
    • ..._requests_total{tenant_id,service,status_class} (no path)
    • ..._request_duration_seconds{tenant_id,service} (no path, limited bucket count)
  • Prefer route-only aggregates for “what endpoint is hurting us?” views:
    • ..._requests_total{service,path,status} (no tenant_id)
  • Where per-tenant and per-route both matter, implement a top-k sampling policy:
    • emit (tenant_id,path) series only for top N tenants, or only for a fixed allowlist of routes.

7.3 Required Dashboards (Production)

Minimum set of dashboards (provisioned on startup):

  • Platform — Operations overview
    • up for core services and observability stack
    • RPS, 4xx/5xx ratio, p95/p99 latency per service
    • saturation indicators (CPU/mem, inflight, queue depth)
  • Platform — HTTP detail
    • per-service request breakdown by route template, method, status
    • top failing paths and latency outliers
  • Platform — Logs
    • Loki stream filtering by service, tenant_id (where present), and correlation identifiers
  • Platform — Event bus / JetStream
    • consumer lag, redeliveries, ack latency, stream storage pressure
  • Platform — Workers (Runner)
    • outbox depth, effect latency, poison message counts, schedules backlog
  • Platform — Storage (libmdbx)
    • DB size growth, write stalls, fsync latency (where exported), disk usage
  • Platform — Cluster / Orchestrator
    • node health, container restarts, placement distribution by tenant range

Dashboards should be modeled after UltraBases default set (for structure, not content), e.g. ultrabase-operations.json and ultrabase-http-detail.json.

Additional production-operability dashboards (chosen and adapted):

  • Platform — Noisy Neighbor & Tenant Health

    • Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant.
    • Panels (minimum):
      • Top tenants by Gateway RPS (topk of tenant-only request counters).
      • Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms.
      • Tenant error ratio (5xx and 429) per tenant.
      • Aggregate in-flight commands by tenant (already exported: aggregate_in_flight_commands{tenant_id}).
      • Projection processing error rate by tenant (from projection_processing_errors_total{tenant_id,view_type} aggregated per tenant).
      • Loki logs panel with a tenant_id variable selector; selecting a tenant syncs RPS/latency/errors + logs.
    • Required instrumentation:
      • Gateway must expose tenant-level HTTP counters/histograms (tenant + status class + service, without path) in addition to existing route-level metrics.
  • Platform — API Regression & Deployment

    • Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events.
    • Panels (minimum):
      • Error rate comparison “old vs new” by service and version (or image_tag) labels.
      • Latency comparison “old vs new” (p95/p99) per service.
      • Restart / flapping rate per service (container restarts, crash loops).
      • Dependency latency correlation:
        • Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency.
      • Loki “new errors” panel:
        • errors seen in the last 10m that were not present in the prior 60m window, grouped by service.
      • Deployment annotations:
        • vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric).
    • Required instrumentation:
      • Every service exports a *_build_info{service,version,git_sha} gauge (value=1) or equivalent, and scrape relabeling adds image_tag where possible.
      • Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations).
  • Platform — Storage & Event Bus Bottlenecks

    • Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting).
    • Panels (minimum):
      • NATS/JetStream health:
        • stream storage pressure, publish/ack latency, consumer lag, redeliveries.
      • Projection lag and throughput:
        • events processed rate, processing duration, error rate.
      • Aggregate write-path pressure:
        • command duration, version conflicts, in-flight commands, tenant errors.
      • Runner pressure:
        • outbox dispatch failure rate, effect timeout rate, deadletter writes.
      • Disk saturation on nodes hosting libmdbx:
        • disk usage, read/write latency, IOPS; correlate with spikes in command/query latency.
      • Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata:
        • pool saturation, replica lag, slow queries, long transactions.
    • Required instrumentation:
      • Ensure JetStream metrics are scraped (NATS server /varz exporter or native Prometheus endpoint depending on deployment).
      • Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent).
  • Platform — Infrastructure Exhaustion

    • Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots.
    • Panels (minimum):
      • CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation.
      • OOM kill tracker across the cluster.
      • Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics).
      • vmagent health:
        • scrape error rate, remote_write errors, queue backlog.
      • Loki ingestion health:
        • dropped log lines (promtail) and ingestion errors (loki).
      • Swarm task hygiene:
        • desired_state vs current_state mismatches, orphaned tasks, restart loops.
    • Required instrumentation:
      • node exporter / cadvisor (or equivalent) must be part of the production scrape plan.
      • promtail (or alternative) must expose drop/error metrics when logs are enabled.

7.4 Alerting Requirements

Minimum alert classes:

  • Availability:
    • service down (up == 0)
    • scrape failures, vmagent remote_write errors
  • Reliability:
    • sustained elevated 5xx ratio
    • sustained elevated p95 latency per service
  • Backlogs:
    • JetStream consumer lag above threshold
    • Runner outbox depth above threshold
  • Data safety:
    • disk usage near full (nodes hosting libmdbx)
    • abnormal restart loops
  • Security:
    • login anomaly detection signals (where instrumented)
    • suspicious spike in session revocations / failed MFA

Alert rules can follow UltraBases approach of version-controlled rules in YAML (reference: alerts/).

7.5 Control Plane → Observability Linking

The Admin UI must embed or deep-link into observability tools:

  • per-tenant and per-service quick links to Grafana dashboards and Loki queries
  • incident triage shortcuts (operations overview → HTTP detail → logs)

This mirrors UltraBases “observability links JSON” concept (observability/README.md), but adapted to Cloudlysis services and dashboards.


8) Audit, Compliance, and Change History

  • Audit log is an append-only stream of security and operations events:
    • authentication and session events
    • RBAC changes and permission grants
    • config/definition changes and promotions
    • scaling, drain, and migration operations
    • deployments and rollbacks
  • Audit log must support:
    • search and export (bounded and access controlled)
    • correlation to production incidents (request ids, trace ids)
    • retention policy controls

9) Control Plane API Surface (Admin API)

The control plane requires a stable API surface for the Admin UI and automation.

Minimum API capabilities:

  • Idempotent jobs for multi-step operations:
    • every mutating operation returns a job_id, supports polling and cancellation, and records a full execution trace in the audit log.
  • Preflight endpoints:
    • validate an intended change and return a plan (and “would-change” diff) without applying it.
  • RBAC-first access model:
    • all endpoints enforce permission checks at the API boundary (UI is not trusted).

Minimum endpoint groups:

  • /admin/v1/iam/* (users, roles, assignments, sessions)
  • /admin/v1/tenants/* (tenants lifecycle, status, metadata)
  • /admin/v1/config/* (versioned config, diff, apply, rollback)
  • /admin/v1/definitions/* (bundles, validate, promote, rollback)
  • /admin/v1/scale/* (placement, migrations, drain status)
  • /admin/v1/ops/* (deployments, rollbacks, service actions)
  • /admin/v1/observability/* (links, saved queries, dashboard registry)
  • /admin/v1/audit/* (search, export)

Authentication/authorization integration:

  • Prefer using the Gateway as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions.

10) Secrets and Credentials Management

The control plane must treat secrets as first-class operational data with strict handling.

Requirements:

  • Secret values must never be logged and must be redacted in UI/API responses.
  • Secrets must support:
    • creation and rotation workflows
    • scoped access (global/tenant/environment)
    • staged rollout (write new → verify → promote → retire old)
  • Rendering rules:
    • after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only).
  • Integrations:
    • support referencing secrets from config/definitions without embedding values (secret refs).

11) Backups, Restore, and Disaster Recovery (Production Operability)

The control plane must provide explicit visibility and guardrails for data safety operations.

Minimum requirements:

  • Backup status:
    • show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores).
  • Restore readiness:
    • preflight checks that validate a restore plan (target environment, versions, dependencies).
  • Operational playbooks:
    • link to the exact restore procedure and post-restore verification checklist.
  • Key rotation:
    • explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends.

This should align with the platforms existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs).


Admin UI Requirements (Information Architecture + UX)

Navigation (Minimum)

Left navigation sections:

  • Overview
  • Tenants
  • Users
  • Sessions
  • Roles & Permissions
  • Config
  • Definitions
  • Scale & Placement
  • Deployments
  • Observability
  • Audit Log
  • Settings

Page Patterns (Reuse UltraBase UI)

Adopt the UltraBase component system and page layout patterns:

Required page types:

  • List pages:
    • searchable table, bulk actions, row actions menu, status pills, empty states
  • Detail pages:
    • header with primary actions (drain, migrate, rollback)
    • sub-nav tabs for domain-specific views
  • Mutation flows:
    • modal confirmation + explicit reason entry for high-impact changes
    • toast notifications and “busy” state handling consistent with UltraBase patterns

Tenant Detail Subpages (Minimum)

  • Overview (status, assignments, SLO highlights)
  • Placement (per service: Aggregate/Projection/Runner)
  • Health (node readiness and dependency checks)
  • Config (effective config + diffs)
  • Definitions (applied definition bundle + version)
  • Activity (audit trail filtered to tenant)
  • Observability (embedded links / panels)

Non-Functional Requirements

  • Security:
    • strict RBAC everywhere; deny-by-default
    • audit every privileged operation
    • step-up for sensitive actions
    • CSRF protection for browser sessions
    • safe secret handling (no secret values rendered after creation unless explicitly permitted)
    • allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse
  • Reliability:
    • control plane operations are idempotent and resilient to partial failures
    • operations have clear “current state” and do not rely on UI assumptions
  • Performance:
    • list pages paginate and filter server-side for large fleets
    • dashboards load with bounded query costs and controlled label cardinality
  • Operability:
    • control plane itself must be observable (metrics/logs, dashboards, alerts)
    • every operation must surface preflight checks and post-conditions

Open Questions / Design Constraints (To Resolve During Implementation)

  • Where does the source of truth live for:
    • users/sessions/roles (Gateway vs control-plane backing store)?
    • configs/definitions (NATS KV vs database vs GitOps)?
  • How should production promotions be modeled:
    • environment branches, approval workflow, and rollback semantics?
  • What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)?
  • Where should the job/execution state for long-running operations live:
    • embedded in the control plane API process, durable store, or NATS workflows?