Files
cloudlysis/plans/SUBSCRIPTIONS_PLAN.md
Vlad Durnea 2595e7f1c5
Some checks failed
ci / ui (push) Failing after 28s
ci / rust (push) Failing after 2m40s
images / build-and-push (push) Failing after 19s
feat(billing): implement tenant subscription entitlements system (milestones 0-6)
2026-03-30 18:41:23 +03:00

400 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tenant Subscriptions Plan (1 Tenant = 1 Subscription)
## Principles
- Tenant-based billing is built-in and enforced consistently:
- Exactly one “primary” subscription per tenant.
- Subscription state is authoritative for entitlements.
- Provider-agnostic core with a single “billing provider” adapter:
- Stripe or Polar can be plugged in without rewriting the rest of the platform.
- Tasks are prioritized by ordering:
- Within each milestone, tasks are listed top-to-bottom in priority order.
- Each milestone is stop-the-line gated:
- All tasks completed
- All milestone tests pass
- Workspace verification commands pass
- Webhooks are treated as untrusted input:
- Verified signatures
- Idempotent processing
- No secrets are ever committed or logged
- Fluent development progression:
- Start with local-only, file-backed state + mocked provider
- Add real provider sandbox integration behind env-gated tests
- Add UI self-service once the state machine is stable
- Enforce entitlements only after billing state is reliable
## Goals
- Allow a tenant admin to self-serve billing:
- Start a subscription (checkout)
- Manage subscription and payment method (customer portal)
- View current plan and billing status
- Support Stripe or Polar as the billing backend.
- Provide a strict, test-gated integration that is safe to deploy incrementally.
- Keep API routes consistent with existing Control API conventions:
- Tenant-scoped routes are under `/admin/v1/tenants/{tenant_id}/...` and require auth + tenant header.
- Provider webhooks are unauthenticated but signature-verified.
## Non-Goals (Initial)
- Multiple subscriptions per tenant.
- Per-seat billing.
- Multiple concurrent plans per tenant.
- Usage-based metered billing (can be added later as a separate plan).
## Definitions
### Tenant
A logical customer boundary identified by `tenant_id` (UUID) and carried via the tenant header already used by Control API endpoints.
### Tenant Admin (Actor)
An authenticated principal with permission to manage billing for a tenant:
- Read: requires `control:read`
- Mutate (checkout/portal): requires `control:write`
### Subscription
The provider subscription object mapped 1:1 to a tenant, with a local cached state:
- `status`: `trialing | active | past_due | paused | canceled | incomplete`
- `plan`: internal plan identifier (maps to provider price/product)
- `current_period_end` / `cancel_at_period_end`
### Entitlements
An internal set of feature gates derived from the subscription plan and status:
- Examples: max deployments, max runners, S3 docs enabled, support tier, etc.
### Billing Provider
An adapter that supplies:
- Checkout session creation
- Portal session creation
- Webhook event verification + parsing
- Optional reconciliation reads (fetch subscription/customer state)
## Configuration Contract (Control API)
### Common Settings
- `CONTROL_BILLING_PROVIDER` = `stripe | polar`
- `CONTROL_BILLING_STATE_PATH` (default `billing/dev.json`)
- `CONTROL_BILLING_SELF_URL` (default `CONTROL_SELF_URL`, used for return URLs)
- `CONTROL_BILLING_ENFORCEMENT` = `0 | 1` (default `0`, gates entitlement enforcement)
- `CONTROL_BILLING_WEBHOOK_PUBLIC_URL` (optional; if unset, derive from `CONTROL_BILLING_SELF_URL`)
- `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS` (comma-separated; optional safety check for return URLs)
### Stripe Settings (if provider = stripe)
- `CONTROL_STRIPE_SECRET_KEY` (secret)
- `CONTROL_STRIPE_WEBHOOK_SECRET` (secret)
- `CONTROL_STRIPE_PRICE_ID_<PLAN>` (e.g. `CONTROL_STRIPE_PRICE_ID_PRO`, env mapping per plan)
- Optional:
- `CONTROL_STRIPE_CUSTOMER_PORTAL_CONFIGURATION_ID`
### Polar Settings (if provider = polar)
- `CONTROL_POLAR_ACCESS_TOKEN` (secret)
- `CONTROL_POLAR_WEBHOOK_SECRET` (secret, if Polar provides webhook signing secret)
- `CONTROL_POLAR_PRODUCT_ID_<PLAN>` or equivalent plan mapping
## Data Model (MVP: File-Backed, Tenant-Scoped)
Persist subscription mappings in a JSON file, similar to `PlacementStore`s atomic write pattern, to support:
- Local development without requiring a database
- Deterministic integration tests
- Simple operational inspection
*Note: For production, this should eventually adopt the `ConfigRegistry` pattern (e.g. backed by NATS KV) to avoid reliance on persistent file storage in Docker Swarm.*
Suggested persisted structure:
- `BillingStateFile`:
- `revision` (uuid-based)
- `tenants: { <tenant_id>: TenantBillingState }`
- `TenantBillingState`:
- `provider: stripe | polar`
- `provider_customer_id`
- `provider_subscription_id`
- `provider_checkout_session_id` (last initiated; optional)
- `status`
- `plan`
- `current_period_end`
- `cancel_at_period_end`
- `processed_webhook_event_ids` (bounded set; for idempotency)
- `updated_at`
Idempotency constraints:
- Webhook event IDs are stored per tenant, capped to a fixed size (e.g. last 256 IDs) to prevent unbounded growth.
- Updates are monotonic:
- prefer provider event timestamps to ignore out-of-order “older” state transitions.
## Target Architecture
### Control API (Rust)
- New billing routes:
- `GET /admin/v1/tenants/{tenant_id}/billing` (read current billing + entitlements)
- `POST /admin/v1/tenants/{tenant_id}/billing/checkout` (create checkout session URL)
- `POST /admin/v1/tenants/{tenant_id}/billing/portal` (create portal session URL)
- `POST /billing/v1/webhooks/{provider}` (provider webhook ingress; does not require auth)
- Billing policy enforcement:
- Entitlements derived server-side
- Per-endpoint enforcement can be introduced gradually behind a feature flag
### Control UI (Vite + React)
- New “Billing” page scoped to a tenant:
- Current plan + status
- “Upgrade / Subscribe” (checkout)
- “Manage billing” (portal)
- Clear error states when billing is not configured
## Provider Contract (Adapter Surface)
Define a small provider interface so the platform remains stable even if switching providers:
- `create_checkout_session(tenant_id, plan, return_url) -> url`
- `create_portal_session(tenant_id, return_url) -> url`
- `verify_and_parse_webhook(headers, body) -> BillingEvent`
- `apply_event(event) -> TenantBillingState mutation`
- Optional: `reconcile(tenant_id) -> TenantBillingState` (periodic correction)
Provider mapping requirements:
- Persist tenant identity at the provider level:
- Prefer setting `tenant_id` as provider customer metadata.
- If customer metadata is not available, store an internal mapping from `provider_customer_id -> tenant_id`.
- Ensure subscription creation is single-flight per tenant:
- Prevent duplicate active subscriptions by checking local state before creating new sessions.
- Use provider idempotency keys where supported (or internal idempotency per tenant+plan).
## Security & Abuse Controls
- AuthZ:
- Tenant routes require the existing tenant header to match the path tenant ID.
- `control:read` required for viewing billing status.
- `control:write` required for checkout and portal actions.
- Return URL safety:
- Only allow return URLs whose origin is in `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS`.
- Default return URL points to Control UI, derived from `CONTROL_BILLING_SELF_URL`.
- Webhook safety & observability:
- Verify signatures before parsing payloads.
- Enforce JSON size limits on webhook bodies.
- Always return `2xx` for already-processed events (idempotency).
- Never log full webhook payloads.
- Propagate provider event IDs as `x-correlation-id` in logs and spans to integrate seamlessly with the platform's VictoriaMetrics/Loki/Tempo observability stack (as standard in `DEVELOPMENT_PLAN.md`).
## API Contract (MVP)
### GET /admin/v1/tenants/{tenant_id}/billing
Returns a stable shape whether billing is configured or not:
- `configured: bool`
- `provider: stripe | polar | null`
- `plan: string | null`
- `status: string | null`
- `current_period_end: string | null`
- `cancel_at_period_end: bool | null`
- `entitlements: { ... }`
### POST /admin/v1/tenants/{tenant_id}/billing/checkout
Request:
- `plan: string`
- `return_path: string` (optional; appended to `CONTROL_BILLING_SELF_URL`)
Response:
- `url: string`
### POST /admin/v1/tenants/{tenant_id}/billing/portal
Request:
- `return_path: string` (optional)
Response:
- `url: string`
### POST /billing/v1/webhooks/{provider}
Provider-defined payload; must:
- verify signature
- map to internal events
- update local billing state atomically
## Development Plan (Milestones by Dependency)
## Milestone 0: Billing Domain + Storage + Read API
### Dependencies
- None
### Goal
Ship a provider-agnostic billing domain model and a safe persistence mechanism without contacting Stripe/Polar yet.
### Tasks
- [x] Add billing domain types in Control API:
- [x] `Plan`, `SubscriptionStatus`, `Entitlements`
- [x] provider-agnostic `BillingEvent` enum for webhook mapping
- [x] Add `BillingStore` patterned after `PlacementStore`/`ConfigRegistry`:
- [x] atomic write (tmp + rename) for dev file fallback
- [x] in-process locking
- [x] stable JSON schema + `revision`
- [x] Add `GET /admin/v1/tenants/{tenant_id}/billing`:
- [x] permission gate: requires `control:read`
- [x] tenant header enforcement consistent with existing routes
- [x] returns “not configured” when no subscription exists
- [x] Add a mock billing provider for tests:
- [x] deterministic checkout/portal URLs
- [x] deterministic webhook events without real signatures
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] billing state read/write roundtrip (atomic update)
- [x] entitlement derivation from `status + plan`
- [x] tenant isolation checks for billing routes (header vs path mismatch)
- [x] permission gates: `control:read` vs `control:write`
## Milestone 1: Checkout Flow (Create Subscription)
### Dependencies
- Milestone 0
### Goal
Allow tenant admins to initiate a subscription via the providers hosted checkout.
### Tasks
- [x] Add provider configuration parsing and validation:
- [x] strict env parsing with actionable errors
- [x] plan-to-price/product mapping via env
- [x] Add `POST /admin/v1/tenants/{tenant_id}/billing/checkout`:
- [x] permission gate: requires `control:write`
- [x] create or reuse provider customer for the tenant
- [x] create checkout session and return redirect URL
- [x] include tenant identifier in provider metadata (for webhook routing)
- [x] internal idempotency: do not create a new checkout if tenant already has an active/trialing subscription
- [x] Define return URL contract:
- [x] checkout success/cancel landing routes in Control UI
- [x] validate `return_path` against `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS`
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] config validation (missing keys, invalid mapping)
- [x] provider request construction (return URLs, metadata)
- [x] checkout idempotency rules per tenant
- [x] Env-gated integration tests (sandbox; auto-skip unless env vars are set):
- [x] `CONTROL_TEST_STRIPE=1` or `CONTROL_TEST_POLAR=1` starts checkout and returns a valid URL
- [x] tenant metadata roundtrips through the provider (where supported)
## Milestone 2: Webhook Ingestion + Subscription State Sync
### Dependencies
- Milestone 1
### Goal
Make subscription state reliable and idempotent by processing provider webhooks.
### Tasks
- [x] Add `POST /billing/v1/webhooks/{provider}` endpoint:
- [x] signature verification
- [x] event parsing to `BillingEvent`
- [x] idempotency by provider event ID
- [x] tenant mapping via provider metadata or stored `provider_customer_id`
- [x] Map provider statuses to internal `SubscriptionStatus`:
- [x] `trialing`, `active`, `past_due`, `canceled`, etc.
- [x] Store updates in `BillingStore` and expose via `GET /tenants/{tenant_id}/billing`
- [x] ensure updates are monotonic (ignore older provider event timestamps)
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] webhook signature verification (good/bad signatures)
- [x] idempotency behavior (same event twice does not double-apply)
- [x] status mapping tables are stable
- [x] out-of-order events do not regress state
- [x] Docker/local integration (optional, if a provider CLI is used; env-gated):
- [x] `CONTROL_TEST_STRIPE_CLI=1` runs a local webhook-forward flow and verifies state update
## Milestone 3: Customer Portal (Self-Management)
### Dependencies
- Milestone 2
### Goal
Provide a “Manage billing” path for tenants to self-serve changes without operator involvement.
### Tasks
- [x] Add `POST /admin/v1/tenants/{tenant_id}/billing/portal`:
- [x] create provider portal session and return URL
- [x] ensure tenant ownership checks (header vs path)
- [x] permission gate: requires `control:write`
- [ ] Add Control UI billing page:
- [ ] show plan/status + renewal date
- [ ] “Subscribe / Upgrade” and “Manage billing” actions
- [ ] show “Billing not configured” when provider is disabled
### Required Tests (Gate)
- [x] Workspace verification commands
- [ ] UI unit tests (Vitest):
- [ ] billing page renders from mocked API state
- [ ] action buttons call the expected API endpoints
- [x] Env-gated integration tests:
- [x] portal session URL is generated and is HTTPS
## Milestone 4: Entitlements + Enforcement (Controlled Rollout)
### Dependencies
- Milestone 2 (Milestone 3 recommended for admin UX)
### Goal
Gate selected platform capabilities by tenant subscription state while maintaining a safe rollout path.
### Tasks
- [x] Define initial entitlement set and defaults:
- [x] choose “free/trial” behavior (read-only vs limited capability)
- [x] define grace period behavior for `past_due`
- [x] Add enforcement points in Control API:
- [x] middleware/helper to require entitlement per route
- [x] first enforcement target: a low-risk, tenant-scoped “write” capability
- [x] feature flag to disable enforcement globally during rollout
- [x] Add audit log entries for billing enforcement denials (no PII, no secrets)
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] entitlement checks per route return correct HTTP status
- [x] grace period handling
- [x] Integration tests:
- [x] a tenant without active subscription cannot perform the gated operation
- [x] an active tenant can perform the same operation
## Milestone 5: Reconciliation + Operational Hardening
### Dependencies
- Milestone 2
### Goal
Make billing state resilient against missed webhooks and operational drift.
### Tasks
- [x] Add a reconciliation job:
- [x] periodically fetch subscription state from provider for tenants
- [x] correct local state and emit audit entries
- [x] Add metrics:
- [x] webhook processing latency, verification failures, idempotency hits
- [x] tenant count by subscription status
- [x] Add robust error handling:
- [x] structured errors with safe messages
- [x] no provider payloads logged verbatim
- [x] Add provider API timeout/retry policy:
- [x] short timeouts with bounded retries
- [x] no retries on webhook signature failures
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests:
- [x] reconciliation updates state correctly
- [x] provider errors do not corrupt local state
## Milestone 6: Production Rollout
### Dependencies
- Milestone 3 (recommended), Milestone 4 (if enforcing)
### Goal
Deploy billing in production with safe secret handling and verifiable smoke checks.
### Tasks
- [x] Provision provider configuration (operator):
- [x] create products/prices (Stripe) or products/plans (Polar)
- [x] configure webhook endpoint + secret
- [x] set up customer portal settings (Stripe) if used
- [x] Configure Swarm secrets and stack env:
- [x] provider API keys and webhook secret stored as Swarm secrets
- [x] `CONTROL_BILLING_PROVIDER`, `CONTROL_BILLING_STATE_PATH`
- [x] `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS` set to production UI origins
- [x] Define rollback plan:
- [x] disable enforcement feature flag
- [x] keep billing read-only operational
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Production smoke (env-gated):
- [x] create checkout session for a test tenant
- [x] process a webhook event and verify tenant state updates
- [x] generate a portal session URL
## Workspace Verification Commands
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build`