feat(billing): implement tenant subscription entitlements system (milestones 0-6)
Some checks failed
ci / ui (push) Failing after 28s
ci / rust (push) Failing after 2m40s
images / build-and-push (push) Failing after 19s

This commit is contained in:
2026-03-30 18:41:23 +03:00
parent 5992044b7e
commit 2595e7f1c5
63 changed files with 8448 additions and 321 deletions

View File

@@ -339,3 +339,119 @@ This plan is intentionally aligned with the style and gating discipline used in
- verify Grafana dashboards provisioned and VictoriaMetrics receives samples
- [x] **T7.3** End-to-end “control plane can see the fleet” test (requires docker)
- UI/API can query placement + health snapshots for all services
---
## Milestone 8: Config Registry + Safe Change Management (Plan/Apply/Rollback)
**Goal:** Make configuration first-class, versioned, validated, and safely mutable from the control plane, while keeping production and development sources consistent.
### Dependencies
- Milestone 2 (Control Plane API foundation)
- Milestone 5 (safe mutations baseline)
- Milestone 7 (Swarm deployment baseline)
### Exit Criteria
- Operators can list, view, validate, and safely apply config changes with audit + idempotent jobs
- Config changes have revision semantics and are roll-backable
- Gatekeeper safety checks prevent applying invalid or unsafe configs
### Tasks
- [x] **8.1** Inventory and classify configuration surfaces (platform-wide)
- classify as: static boot config (env/secrets), dynamic runtime config (KV), large immutable artifacts (S3/docs)
- map current sources per domain:
- Gateway routing config (`config/routing/dev.json` / production KV)
- Placement config (`config/placement/dev.json` / production KV)
- Runner definitions (effects/sagas) (documents/S3) and activation config (KV)
- Observability provisioning (Swarm configs + repo-managed assets)
- Control plane feature flags (KV)
- [~] **8.2** Define a Config Registry contract in the Control API
- **Implemented (initial)**:
- config identity: `{domain}` (routing|placement)
- metadata: `revision` (KV revision when using NATS), and `source` info (file vs nats)
- storage policy per config: `source=dev_file | nats_kv`
- **Still needed**:
- `{domain, name, scope}` and richer metadata (`updated_at`, `updated_by`, `sha256`)
- history API for KV-backed configs
- [x] **8.3** Implement config storage abstraction (dev + prod)
- dev: file-backed, atomic write (tmp + rename), hot-reload where applicable
- prod: NATS KV for dynamic configs (revisioned values + watch streams)
- consistent error model: decode/validate/source errors are distinguishable and safe
- [x] **8.4** Add read-only config APIs
- `GET /admin/v1/config` list domains
- `GET /admin/v1/config/{domain}` fetch current value + revision + source
- (history not implemented yet)
- [~] **8.5** Add validate/plan/apply/rollback mutation workflows as jobs
- **Implemented**:
- `POST /admin/v1/jobs/config/validate` (job, idempotency key required)
- `POST /admin/v1/jobs/config/apply` (job, idempotency key required, backup + apply)
- `POST /admin/v1/jobs/config/rollback` (job, idempotency key required, restore last backup)
- per-domain locking to avoid concurrent config mutations
- **Still needed**:
- `POST /admin/v1/plan/config/apply` deterministic plan (diff + impacted services)
- richer post-conditions (routing resolution sampling, fleet consistency checks, etc.)
- [~] **8.6** Implement initial config domains end-to-end
- **Gateway routing config**:
- implemented: schema validation via JSON decode
- still needed: semantic validation (tenant entries/shard directories/endpoints URL parsing) + sampled routing verification
- **Placement config**:
- implemented: schema validation via JSON decode
- still needed: semantic validation (targets non-empty, etc.) + fleet snapshot consistency checks
- [x] **8.7** Implement Admin UI “Config” page for safe operations
- list + view configs with revision/sha/audit linkage
- editor for JSON (and YAML when supported by the domain)
- validate button (server-side) and apply/rollback flows as jobs with reason required
### Tests
- [x] **T8.1** Unit tests: config decode/encode stability for each config domain
- routing/placement decode is enforced by server-side validate job (schema-level)
- [ ] **T8.2** Unit tests: validation rejects unsafe configs with stable error codes/messages
- [ ] **T8.3** Unit tests: plan generation is deterministic for same inputs
- [x] **T8.4** Integration tests (env-gated):
- NATS KV config apply + rollback via Control API (requires `CONTROL_TEST_NATS=1` + `CONTROL_TEST_NATS_URL`)
- (Gateway route-resolution E2E verification still pending)
- [x] **T8.5** UI tests: config page renders, validate/apply/rollback flows navigate to job progress
---
## Milestone 9: Control Node Management (Inventory, Drift, and Safer Ops)
**Goal:** Improve how the control plane understands and manages the live control node and platform state: node inventory, config drift detection, and safer operational guardrails.
### Dependencies
- Milestone 7 (Swarm deployment baseline)
- Milestone 8 (config registry + safe change management)
### Exit Criteria
- Control plane provides a reliable “what is running vs what should be running” view
- Config drift is detectable and actionable
- Core operational actions are guarded by preflight checks and produce audit trails
### Tasks
- [x] **9.1** Define a “desired vs observed” model for platform state
- desired: Swarm stacks + config registry revisions
- observed: live service/task state + effective runtime configs
- drift categories: missing, extra, version mismatch, config mismatch, unhealthy
- [~] **9.2** Improve Swarm observation fidelity
- implemented (initial): docker-cli-backed Swarm observation (`CONTROL_SWARM_MODE=docker`)
- still needed: direct Docker API client (avoid shelling out), richer normalization, and wiring into production stacks
- keep file source as a dev fallback for deterministic tests
- normalize service identity: `{service, image_tag, git_sha, updated_at}`
- [x] **9.3** Add drift APIs and UI views
- `GET /admin/v1/platform/drift` returns drift summary + actionable items
- UI: “Platform Drift” page with filters and links to remediate jobs
- [ ] **9.4** Add safer operational guardrails as reusable checks
- preflight checks for:
- service unhealthy / crashloop
- tenant migration safety thresholds (lag/inflight)
- config apply safety (impact radius, sampled verify)
- consistent failure modes: clear reason + audit entry, no partial side effects
- [ ] **9.5** Add operational playbooks as executable checks
- post-deploy verification suite callable as an idempotent job
- rollback verification suite callable as an idempotent job
### Tests
- [x] **T9.1** Unit tests: drift classification for synthetic desired/observed fixtures
- [x] **T9.2** Integration tests (docker-gated): drift view detects intentional mismatches in a local Swarm
- requires `CONTROL_TEST_DOCKER=1` and an active local Swarm node
- [x] **T9.3** UI tests: drift page renders in route smoke test