cloudlysis/control/prd.md at main

madapes/cloudlysis

Fork 0

Files

Vlad Durnea 1298d9a3df

ci / rust (push) Failing after 2m34s

Details

ci / ui (push) Failing after 30s

Details

Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets

2026-03-30 11:40:42 +03:00

30 KiB

Raw Permalink Blame History

🧱 Component: Control Plane (Admin UI + Monitoring + Production Ops)

Definition:
This repository hosts the platform control plane:

the Admin UI used by platform operators and admins to manage users/roles/sessions, tenants, configuration, definitions, and production scaling; and
the observability stack and production dashboards (VictoriaMetrics + Loki + Grafana, plus alerting/scrape config) required to operate the platform in production.

The control plane is the “single pane of glass” and the “safe hands” layer: it does not replace node runtime logic; it coordinates existing node capabilities and exposes them with strict RBAC, auditability, and operational guardrails.

Context: Existing Node Repositories (../)

This PRD is derived from the currently implemented node repos in ../:

Aggregate: expects a control node to manage tenant placement and scaling operations, including tenant migrations (aggregate/prd.md). Tenant placement primitives and KV helper exist (swarm.rs).
Gateway: provides the platform ingress, authn/authz, and tenant-aware routing; it explicitly expects NATS KV-based tenant placement and hot reload in production (gateway/prd.md).
Projection: consumes events, stores read models, and expects tenant-scoped query isolation and operational monitoring (consumer lag, checkpoints) (projection/prd.md).
Runner: executes sagas + effects, includes tenant assignment watching via NATS KV and tenant draining semantics (tenant_placement.rs) and exposes admin endpoints for drain/reload in its PRD (runner/prd.md).

The control plane also adopts the proven Admin UI UX + component library from UltraBase’s control-plane admin UI, adapting screens and information architecture to Cloudlysis needs:

Reusable UI components live under ui/control-plane-admin/src/components/ui.
Example pages include TenantsPage, AdminUsersPage, AdminSessionsPage, FleetPage, TopologyPage, and ObservabilityPage.

Problem Statement

Operating the platform without a unified control plane forces operators to:

Use ad-hoc scripts, direct cluster access, or service-local admin endpoints
Manage tenants, placements, and deployments without a consistent audit trail
Correlate production incidents across services with incomplete dashboards and unsafe levels of access

The platform needs a control plane that:

Centralizes admin workflows and production operability
Enforces least-privilege RBAC, step-up, and auditing
Provides a consistent, safe abstraction over tenant placement, scale, and production operations

Goals

Deliver an Admin UI with full admin management over:
- users, sessions, roles/permissions
- configuration (global + per-tenant)
- definitions (aggregates, projections, sagas, effects, manifests)
- scaling and production management (tenant placement, drains, migrations, deployments)
Package production-grade monitoring:
- metrics via VictoriaMetrics
- logs via Loki
- dashboards and alerting via Grafana (+ vmalert where used)
Make production operations observable, auditable, and safe by default:
- strong change logging + approvals where needed
- idempotent operations + dry runs + rollback paths

Non-Goals

Re-implement node business logic (Aggregate / Projection / Runner) or platform ingress (Gateway).
Replace NATS JetStream, libmdbx storage responsibilities, or per-service runtime concerns.
Provide an arbitrary “general API gateway” for third-party upstreams.

Primary Users

Platform Owner / SRE: fleet operations, incident response, production change management.
Platform Admin: tenant provisioning, RBAC, config/definition promotion.
Security Admin: access reviews, session revocation, audit trails.
Support / On-call: triage dashboards, logs/metrics correlation, safe mitigations (drain, disable, rollback).

Key Concepts

Control Plane Scope

The control plane is the authoritative interface for production operations and admin management.
The control plane uses node APIs, the Gateway, and NATS KV as its operational substrate rather than bypassing them.

Tenant-Aware Operations

All tenant-scoped operations are keyed by tenant_id (consistent with x-tenant-id usage across nodes and Gateway).
Tenant placement is treated as a first-class “control plane state” (NATS KV-backed in production; file/static in development), consistent with existing code patterns (swarm.rs, tenant_placement.rs).

Safe Change Management

Mutating actions require explicit intent, are recorded in audit logs, and should be reversible where possible.
All high-impact operations support:
- validation and preflight checks
- dry-run planning
- idempotency keys
- explicit rollback guidance

Control Plane Components (In This Repo)

Admin UI (React):
- Reuse UltraBase’s control-plane admin UI component system and interaction patterns, adapting routes and pages to Cloudlysis requirements (components/ui).
- The UI should prefer “table + detail pages + action dropdown + modals” patterns to keep ops workflows fast and consistent.
Control Plane API (BFF / Admin API):
- A thin API layer that enforces RBAC, writes audit logs, and orchestrates multi-step operations (drain/migrate/rollout) as idempotent jobs.
- Integrates with the Gateway for platform authn/authz and with node admin endpoints for operational actions.
Observability Stack:
- Version-controlled provisioning for Grafana dashboards/datasources, scrape configs for vmagent, and alert rules (vmalert or Grafana Alerting), modeled after UltraBase’s baseline (observability/README.md).

Functional Requirements

1) Admin IAM (Users, Sessions, Roles)

1.1 Users

CRUD users with lifecycle states:
- invited (pending acceptance), active, suspended, disabled, deleted (tombstoned)
Identity attributes:
- email (primary), optional secondary identities
- display name, avatar, metadata tags
- auth methods enabled (password, OIDC providers), MFA state
Administrative actions:
- invite/resend invite
- reset password flow initiation
- force MFA reset / revoke recovery codes
- disable login / suspend user
- impersonation (break-glass, audited, time-boxed)
Security constraints:
- privileged actions require step-up / recent auth
- sensitive events must be audit logged (who, what, when, why, from where)

1.2 Sessions

View active sessions and refresh token families:
- by user, by tenant, by IP / geo, by device, by time range
Revoke capabilities:
- revoke a single session
- revoke all sessions for a user
- revoke all sessions for a tenant (incident response)
Detection surfaces:
- unusual session fanout (many sessions per user)
- repeated failed logins / MFA failures
- suspicious IP changes

1.3 Roles & Permissions (RBAC)

Roles are sets of permissions; assignments bind principals to roles in a scope.
Scopes:
- global (platform-level)
- tenant-scoped
- environment-scoped (dev/staging/prod) when applicable
Required permission domains (minimum):
- iam.users.* (create/update/suspend/delete)
- iam.sessions.* (list/revoke)
- iam.roles.* (create/update/assign)
- tenants.* (create/update/archive)
- configs.* (read/write/approve/apply)
- definitions.* (read/write/validate/promote/rollback)
- scale.* (view/apply/migrate/drain)
- ops.* (deploy/rollback/restart/drain)
- observability.* (view dashboards, manage alert rules)
- audit.* (view/export)
Role templates:
- owner, admin, operator, support, read-only, security-admin, break-glass

2) Tenant Management

Create, list, and archive tenants.
Tenant status model:
- provisioning, active, draining, migrating, degraded, suspended, archived
Tenant metadata:
- plan/tier, quotas, feature flags, contact + billing metadata, environment(s)
Tenant operational actions:
- trigger provisioning workflows (create streams/buckets, seed configs, create placement)
- rotate tenant secrets (as definitions/config allow)
- pause/resume workload (soft kill switch via config flags)

Tenant pages should mirror UltraBase’s “Tenant Overview + subpages” navigation patterns (example: TenantsPage and TenantOverviewPage).

3) Configuration Management (Global + Per-Tenant)

3.1 Config Model

Config items are versioned, typed documents with:
- scope (global / tenant / environment)
- schema version
- provenance (who/what wrote it)
- effective date and rollout strategy
Config must support:
- validation against a schema
- diff view (previous vs next)
- staged rollout (preview → apply)
- rollback to a prior version

Required config surfaces (minimum):

Gateway: routing/placement sources, auth policies, rate limits (see routing expectations in gateway/prd.md).
Aggregate / Projection / Runner:
- shard identifiers and tenant allowlists/placement settings
- drain/reload toggles and safety thresholds
- resource limits / concurrency caps

4) Definition Management (System “Blueprints”)

Definitions are the declarative “what the platform is” and “what runs” layer: aggregates, projections, sagas, effect providers, and any manifests that tie runtime-function programs to entity types.

Required capabilities:

Upload/edit versioned definitions with:
- validation (schema + semantic checks)
- “impact analysis” (which tenants/services are affected)
- promotion workflow (dev → staging → prod)
Change controls:
- approvals (role-based) for production promotion
- emergency rollback path (one-click revert to last-known-good definition bundle)
Tenant overrides:
- allow per-tenant definition overrides only when explicitly permitted by policy

The control plane must present definitions in a way that maps to the node runtime responsibilities:

Aggregates and deterministic decide/apply programs (aggregate/prd.md)
Projections and deterministic project programs (projection/prd.md)
Runner sagas and effect provider manifests (runner/prd.md)

5) Scale Management (Tenant Placement, Shards, Fleet)

5.1 Placement Model

Placement is modeled as:
- a set of nodes/shards and their attributes (labels, capacity, region)
- tenant → shard assignments per service kind (Aggregate, Projection, Runner, optionally Gateway when relevant)
Control plane supports both:
- static placement (development)
- dynamic placement (production) backed by NATS KV (consistent with existing client patterns in swarm.rs)

5.2 Tenant Migration

Provide guided migration planning and execution:
- show current assignment, target assignment, and a sequenced action plan
- execute “graceful drain → update placement → reload” style plans (see plan_graceful_tenant_migration)
Migration safety:
- require explicit confirmation and reason
- block if draining is unsafe (inflight work too high, storage unhealthy, consumer lag too high)
- time-box and alert if drains do not converge

5.3 Fleet View

Fleet inventory:
- nodes (labels, region, capacity, version)
- services (replicas, image version, health)
- per-node and per-service load indicators (CPU/mem, request rate, consumer lag)
Operator actions:
- scale replicas, restart services, cordon/drain nodes (when supported by orchestrator)

UX should align with the UltraBase “Fleet” and “Topology” navigation patterns (FleetPage, TopologyPage).

6) Production Operations (Deployments, Maintenance, Safety)

6.1 Deployments

Manage deployable artifacts per service (Aggregate/Gateway/Projection/Runner) with:
- environment-specific rollout policies
- canary/rolling deploy support (when orchestrator supports it)
- automatic health checks gates and rollback triggers
Track releases:
- “what is running where” (service version matrix)
- change log links and approvals

6.2 Maintenance Operations

Drain operations:
- tenant drain (stop acquiring new work, finish inflight; required by Runner semantics in TenantGate)
- node drain (aggregate tenant ranges, projection consumers, runner workers)
Replay / rebuild operations:
- projection rebuild triggers (dangerous, must be guarded and audited)
- workflow replay controls (reset checkpoints only with explicit intent)

6.3 Incident Response Toolkit

“Safe switches”:
- per-tenant kill switch (disable commands/effects via config)
- global degrade modes (rate limit reductions, disable expensive features)
Run actions:
- revoke sessions at scale
- freeze deployments
- trigger drain/migrate with guided plan

7) Observability (VictoriaMetrics + Loki + Grafana) and Dashboards

7.1 Stack Requirements

Adopt a production-ready stack consistent with UltraBase’s operational baseline:

VictoriaMetrics for metrics storage and Prometheus-compatible query
vmagent for scraping and remote_write
Grafana for dashboards and alert routing
Loki (+ optional Promtail) for logs
Optional vmalert for rule evaluation against VictoriaMetrics

UltraBase’s observability design is a direct reference implementation to mirror and adapt:

Stack overview and conventions: observability/README.md
Provisioned dashboards and datasources: grafana provisioning

7.2 Metrics Conventions

Every service exports /metrics in Prometheus format.
Required labels:
- service (stable, low cardinality)
- env (dev/staging/prod)
- tenant_id only where safe and bounded; avoid tenant_id on high-frequency per-request series unless cardinality is controlled.
HTTP metrics must avoid unbounded path cardinality; prefer route templates (pattern-based paths).

Tenant-aware metrics guidelines:

Prefer tenant-only aggregates for “who is hurting us?” views:
- ..._requests_total{tenant_id,service,status_class} (no path)
- ..._request_duration_seconds{tenant_id,service} (no path, limited bucket count)
Prefer route-only aggregates for “what endpoint is hurting us?” views:
- ..._requests_total{service,path,status} (no tenant_id)
Where per-tenant and per-route both matter, implement a top-k sampling policy:
- emit (tenant_id,path) series only for top N tenants, or only for a fixed allowlist of routes.

7.3 Required Dashboards (Production)

Minimum set of dashboards (provisioned on startup):

Platform — Operations overview
- up for core services and observability stack
- RPS, 4xx/5xx ratio, p95/p99 latency per service
- saturation indicators (CPU/mem, inflight, queue depth)
Platform — HTTP detail
- per-service request breakdown by route template, method, status
- top failing paths and latency outliers
Platform — Logs
- Loki stream filtering by service, tenant_id (where present), and correlation identifiers
Platform — Event bus / JetStream
- consumer lag, redeliveries, ack latency, stream storage pressure
Platform — Workers (Runner)
- outbox depth, effect latency, poison message counts, schedules backlog
Platform — Storage (libmdbx)
- DB size growth, write stalls, fsync latency (where exported), disk usage
Platform — Cluster / Orchestrator
- node health, container restarts, placement distribution by tenant range

Dashboards should be modeled after UltraBase’s default set (for structure, not content), e.g. ultrabase-operations.json and ultrabase-http-detail.json.

Additional production-operability dashboards (chosen and adapted):

Platform — Noisy Neighbor & Tenant Health
- Purpose: identify a tenant causing cluster instability (attack, runaway job, bad config) and quickly pivot all panels to that tenant.
- Panels (minimum):
  - Top tenants by Gateway RPS (topk of tenant-only request counters).
  - Tenant latency distribution (p95/p99 per tenant) from tenant-only latency histograms.
  - Tenant error ratio (5xx and 429) per tenant.
  - Aggregate in-flight commands by tenant (already exported: aggregate_in_flight_commands{tenant_id}).
  - Projection processing error rate by tenant (from projection_processing_errors_total{tenant_id,view_type} aggregated per tenant).
  - Loki logs panel with a tenant_id variable selector; selecting a tenant syncs RPS/latency/errors + logs.
- Required instrumentation:
  - Gateway must expose tenant-level HTTP counters/histograms (tenant + status class + service, without path) in addition to existing route-level metrics.
Platform — API Regression & Deployment
- Purpose: determine whether a newly rolled out image caused regressions, and correlate changes with deployment events.
- Panels (minimum):
  - Error rate comparison “old vs new” by service and version (or image_tag) labels.
  - Latency comparison “old vs new” (p95/p99) per service.
  - Restart / flapping rate per service (container restarts, crash loops).
  - Dependency latency correlation:
    - Gateway request duration vs Aggregate command duration vs Projection processing duration vs Runner effect latency.
  - Loki “new errors” panel:
    - errors seen in the last 10m that were not present in the prior 60m window, grouped by service.
  - Deployment annotations:
    - vertical markers when Swarm service updates started/finished (via annotations or a deploy event metric).
- Required instrumentation:
  - Every service exports a *_build_info{service,version,git_sha} gauge (value=1) or equivalent, and scrape relabeling adds image_tag where possible.
  - Control plane emits deployment annotations/events (or pulls them from the orchestrator and writes to Grafana annotations).
Platform — Storage & Event Bus Bottlenecks
- Purpose: debug timeouts when the API is “up” but underlying storage/eventing is saturated (the Cloudlysis equivalent of DB firefighting).
- Panels (minimum):
  - NATS/JetStream health:
    - stream storage pressure, publish/ack latency, consumer lag, redeliveries.
  - Projection lag and throughput:
    - events processed rate, processing duration, error rate.
  - Aggregate write-path pressure:
    - command duration, version conflicts, in-flight commands, tenant errors.
  - Runner pressure:
    - outbox dispatch failure rate, effect timeout rate, deadletter writes.
  - Disk saturation on nodes hosting libmdbx:
    - disk usage, read/write latency, IOPS; correlate with spikes in command/query latency.
  - Optional Postgres/Autobase panels only when a managed DB backs any control-plane metadata:
    - pool saturation, replica lag, slow queries, long transactions.
- Required instrumentation:
  - Ensure JetStream metrics are scraped (NATS server /varz exporter or native Prometheus endpoint depending on deployment).
  - Ensure node-level disk/IO metrics are scraped (node exporter / cadvisor / equivalent).
Platform — Infrastructure Exhaustion
- Purpose: detect node/resource pressure earlier than raw CPU% and catch observability blind spots.
- Panels (minimum):
  - CPU/memory pressure (PSI) per node (when available), plus load average and CPU saturation.
  - OOM kill tracker across the cluster.
  - Disk usage + IO wait/latency on data volumes (libmdbx, Loki, VictoriaMetrics).
  - vmagent health:
    - scrape error rate, remote_write errors, queue backlog.
  - Loki ingestion health:
    - dropped log lines (promtail) and ingestion errors (loki).
  - Swarm task hygiene:
    - desired_state vs current_state mismatches, orphaned tasks, restart loops.
- Required instrumentation:
  - node exporter / cadvisor (or equivalent) must be part of the production scrape plan.
  - promtail (or alternative) must expose drop/error metrics when logs are enabled.

7.4 Alerting Requirements

Minimum alert classes:

Availability:
- service down (up == 0)
- scrape failures, vmagent remote_write errors
Reliability:
- sustained elevated 5xx ratio
- sustained elevated p95 latency per service
Backlogs:
- JetStream consumer lag above threshold
- Runner outbox depth above threshold
Data safety:
- disk usage near full (nodes hosting libmdbx)
- abnormal restart loops
Security:
- login anomaly detection signals (where instrumented)
- suspicious spike in session revocations / failed MFA

Alert rules can follow UltraBase’s approach of version-controlled rules in YAML (reference: alerts/).

7.5 Control Plane → Observability Linking

The Admin UI must embed or deep-link into observability tools:

per-tenant and per-service quick links to Grafana dashboards and Loki queries
incident triage shortcuts (operations overview → HTTP detail → logs)

This mirrors UltraBase’s “observability links JSON” concept (observability/README.md), but adapted to Cloudlysis services and dashboards.

8) Audit, Compliance, and Change History

Audit log is an append-only stream of security and operations events:
- authentication and session events
- RBAC changes and permission grants
- config/definition changes and promotions
- scaling, drain, and migration operations
- deployments and rollbacks
Audit log must support:
- search and export (bounded and access controlled)
- correlation to production incidents (request ids, trace ids)
- retention policy controls

9) Control Plane API Surface (Admin API)

The control plane requires a stable API surface for the Admin UI and automation.

Minimum API capabilities:

Idempotent jobs for multi-step operations:
- every mutating operation returns a job_id, supports polling and cancellation, and records a full execution trace in the audit log.
Preflight endpoints:
- validate an intended change and return a plan (and “would-change” diff) without applying it.
RBAC-first access model:
- all endpoints enforce permission checks at the API boundary (UI is not trusted).

Minimum endpoint groups:

/admin/v1/iam/* (users, roles, assignments, sessions)
/admin/v1/tenants/* (tenants lifecycle, status, metadata)
/admin/v1/config/* (versioned config, diff, apply, rollback)
/admin/v1/definitions/* (bundles, validate, promote, rollback)
/admin/v1/scale/* (placement, migrations, drain status)
/admin/v1/ops/* (deployments, rollbacks, service actions)
/admin/v1/observability/* (links, saved queries, dashboard registry)
/admin/v1/audit/* (search, export)

Authentication/authorization integration:

Prefer using the Gateway as the system of record for admin identities and sessions, with the control plane API validating requests using Gateway-issued tokens and enforcing platform-specific permissions.

10) Secrets and Credentials Management

The control plane must treat secrets as first-class operational data with strict handling.

Requirements:

Secret values must never be logged and must be redacted in UI/API responses.
Secrets must support:
- creation and rotation workflows
- scoped access (global/tenant/environment)
- staged rollout (write new → verify → promote → retire old)
Rendering rules:
- after creation, secret plaintext must not be retrievable unless explicitly enabled by policy (default: write-only).
Integrations:
- support referencing secrets from config/definitions without embedding values (secret refs).

11) Backups, Restore, and Disaster Recovery (Production Operability)

The control plane must provide explicit visibility and guardrails for data safety operations.

Minimum requirements:

Backup status:
- show last successful backup timestamps per critical store (metadata DB, NATS state if applicable, Grafana provisioning state as code, tenant placement/config stores).
Restore readiness:
- preflight checks that validate a restore plan (target environment, versions, dependencies).
Operational playbooks:
- link to the exact restore procedure and post-restore verification checklist.
Key rotation:
- explicit workflows and audit logs for rotating signing keys, service credentials, and secret backends.

This should align with the platform’s existing operational patterns (e.g., the explicit “restore / post-restore checks” concept used in UltraBase observability docs).

Admin UI Requirements (Information Architecture + UX)

Left navigation sections:

Overview
Tenants
Users
Sessions
Roles & Permissions
Config
Definitions
Scale & Placement
Deployments
Observability
Audit Log
Settings

Page Patterns (Reuse UltraBase UI)

Adopt the UltraBase component system and page layout patterns:

Layout, styling tokens, UI primitives: components/ui
Table + search + action dropdown pattern: TenantsPage

Required page types:

List pages:
- searchable table, bulk actions, row actions menu, status pills, empty states
Detail pages:
- header with primary actions (drain, migrate, rollback)
- sub-nav tabs for domain-specific views
Mutation flows:
- modal confirmation + explicit reason entry for high-impact changes
- toast notifications and “busy” state handling consistent with UltraBase patterns

Tenant Detail Subpages (Minimum)

Overview (status, assignments, SLO highlights)
Placement (per service: Aggregate/Projection/Runner)
Health (node readiness and dependency checks)
Config (effective config + diffs)
Definitions (applied definition bundle + version)
Activity (audit trail filtered to tenant)
Observability (embedded links / panels)

Non-Functional Requirements

Security:
- strict RBAC everywhere; deny-by-default
- audit every privileged operation
- step-up for sensitive actions
- CSRF protection for browser sessions
- safe secret handling (no secret values rendered after creation unless explicitly permitted)
- allowlist outbound integrations (Grafana/Loki/VM URLs, orchestration API endpoints) to prevent SSRF-style abuse
Reliability:
- control plane operations are idempotent and resilient to partial failures
- operations have clear “current state” and do not rely on UI assumptions
Performance:
- list pages paginate and filter server-side for large fleets
- dashboards load with bounded query costs and controlled label cardinality
Operability:
- control plane itself must be observable (metrics/logs, dashboards, alerts)
- every operation must surface preflight checks and post-conditions

Open Questions / Design Constraints (To Resolve During Implementation)

Where does the source of truth live for:
- users/sessions/roles (Gateway vs control-plane backing store)?
- configs/definitions (NATS KV vs database vs GitOps)?
How should production promotions be modeled:
- environment branches, approval workflow, and rollback semantics?
What orchestrator is the production baseline (Docker Swarm per existing PRDs, or will Kubernetes be introduced)?
Where should the job/execution state for long-running operations live:
- embedded in the control plane API process, durable store, or NATS workflows?

30 KiB Raw Permalink Blame History Unescape Escape