feat(billing): implement tenant subscription entitlements system (milestones 0-6)
Some checks failed
ci / ui (push) Failing after 28s
ci / rust (push) Failing after 2m40s
images / build-and-push (push) Failing after 19s

This commit is contained in:
2026-03-30 18:41:23 +03:00
parent 5992044b7e
commit 2595e7f1c5
63 changed files with 8448 additions and 321 deletions

399
plans/SUBSCRIPTIONS_PLAN.md Normal file
View File

@@ -0,0 +1,399 @@
# Tenant Subscriptions Plan (1 Tenant = 1 Subscription)
## Principles
- Tenant-based billing is built-in and enforced consistently:
- Exactly one “primary” subscription per tenant.
- Subscription state is authoritative for entitlements.
- Provider-agnostic core with a single “billing provider” adapter:
- Stripe or Polar can be plugged in without rewriting the rest of the platform.
- Tasks are prioritized by ordering:
- Within each milestone, tasks are listed top-to-bottom in priority order.
- Each milestone is stop-the-line gated:
- All tasks completed
- All milestone tests pass
- Workspace verification commands pass
- Webhooks are treated as untrusted input:
- Verified signatures
- Idempotent processing
- No secrets are ever committed or logged
- Fluent development progression:
- Start with local-only, file-backed state + mocked provider
- Add real provider sandbox integration behind env-gated tests
- Add UI self-service once the state machine is stable
- Enforce entitlements only after billing state is reliable
## Goals
- Allow a tenant admin to self-serve billing:
- Start a subscription (checkout)
- Manage subscription and payment method (customer portal)
- View current plan and billing status
- Support Stripe or Polar as the billing backend.
- Provide a strict, test-gated integration that is safe to deploy incrementally.
- Keep API routes consistent with existing Control API conventions:
- Tenant-scoped routes are under `/admin/v1/tenants/{tenant_id}/...` and require auth + tenant header.
- Provider webhooks are unauthenticated but signature-verified.
## Non-Goals (Initial)
- Multiple subscriptions per tenant.
- Per-seat billing.
- Multiple concurrent plans per tenant.
- Usage-based metered billing (can be added later as a separate plan).
## Definitions
### Tenant
A logical customer boundary identified by `tenant_id` (UUID) and carried via the tenant header already used by Control API endpoints.
### Tenant Admin (Actor)
An authenticated principal with permission to manage billing for a tenant:
- Read: requires `control:read`
- Mutate (checkout/portal): requires `control:write`
### Subscription
The provider subscription object mapped 1:1 to a tenant, with a local cached state:
- `status`: `trialing | active | past_due | paused | canceled | incomplete`
- `plan`: internal plan identifier (maps to provider price/product)
- `current_period_end` / `cancel_at_period_end`
### Entitlements
An internal set of feature gates derived from the subscription plan and status:
- Examples: max deployments, max runners, S3 docs enabled, support tier, etc.
### Billing Provider
An adapter that supplies:
- Checkout session creation
- Portal session creation
- Webhook event verification + parsing
- Optional reconciliation reads (fetch subscription/customer state)
## Configuration Contract (Control API)
### Common Settings
- `CONTROL_BILLING_PROVIDER` = `stripe | polar`
- `CONTROL_BILLING_STATE_PATH` (default `billing/dev.json`)
- `CONTROL_BILLING_SELF_URL` (default `CONTROL_SELF_URL`, used for return URLs)
- `CONTROL_BILLING_ENFORCEMENT` = `0 | 1` (default `0`, gates entitlement enforcement)
- `CONTROL_BILLING_WEBHOOK_PUBLIC_URL` (optional; if unset, derive from `CONTROL_BILLING_SELF_URL`)
- `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS` (comma-separated; optional safety check for return URLs)
### Stripe Settings (if provider = stripe)
- `CONTROL_STRIPE_SECRET_KEY` (secret)
- `CONTROL_STRIPE_WEBHOOK_SECRET` (secret)
- `CONTROL_STRIPE_PRICE_ID_<PLAN>` (e.g. `CONTROL_STRIPE_PRICE_ID_PRO`, env mapping per plan)
- Optional:
- `CONTROL_STRIPE_CUSTOMER_PORTAL_CONFIGURATION_ID`
### Polar Settings (if provider = polar)
- `CONTROL_POLAR_ACCESS_TOKEN` (secret)
- `CONTROL_POLAR_WEBHOOK_SECRET` (secret, if Polar provides webhook signing secret)
- `CONTROL_POLAR_PRODUCT_ID_<PLAN>` or equivalent plan mapping
## Data Model (MVP: File-Backed, Tenant-Scoped)
Persist subscription mappings in a JSON file, similar to `PlacementStore`s atomic write pattern, to support:
- Local development without requiring a database
- Deterministic integration tests
- Simple operational inspection
*Note: For production, this should eventually adopt the `ConfigRegistry` pattern (e.g. backed by NATS KV) to avoid reliance on persistent file storage in Docker Swarm.*
Suggested persisted structure:
- `BillingStateFile`:
- `revision` (uuid-based)
- `tenants: { <tenant_id>: TenantBillingState }`
- `TenantBillingState`:
- `provider: stripe | polar`
- `provider_customer_id`
- `provider_subscription_id`
- `provider_checkout_session_id` (last initiated; optional)
- `status`
- `plan`
- `current_period_end`
- `cancel_at_period_end`
- `processed_webhook_event_ids` (bounded set; for idempotency)
- `updated_at`
Idempotency constraints:
- Webhook event IDs are stored per tenant, capped to a fixed size (e.g. last 256 IDs) to prevent unbounded growth.
- Updates are monotonic:
- prefer provider event timestamps to ignore out-of-order “older” state transitions.
## Target Architecture
### Control API (Rust)
- New billing routes:
- `GET /admin/v1/tenants/{tenant_id}/billing` (read current billing + entitlements)
- `POST /admin/v1/tenants/{tenant_id}/billing/checkout` (create checkout session URL)
- `POST /admin/v1/tenants/{tenant_id}/billing/portal` (create portal session URL)
- `POST /billing/v1/webhooks/{provider}` (provider webhook ingress; does not require auth)
- Billing policy enforcement:
- Entitlements derived server-side
- Per-endpoint enforcement can be introduced gradually behind a feature flag
### Control UI (Vite + React)
- New “Billing” page scoped to a tenant:
- Current plan + status
- “Upgrade / Subscribe” (checkout)
- “Manage billing” (portal)
- Clear error states when billing is not configured
## Provider Contract (Adapter Surface)
Define a small provider interface so the platform remains stable even if switching providers:
- `create_checkout_session(tenant_id, plan, return_url) -> url`
- `create_portal_session(tenant_id, return_url) -> url`
- `verify_and_parse_webhook(headers, body) -> BillingEvent`
- `apply_event(event) -> TenantBillingState mutation`
- Optional: `reconcile(tenant_id) -> TenantBillingState` (periodic correction)
Provider mapping requirements:
- Persist tenant identity at the provider level:
- Prefer setting `tenant_id` as provider customer metadata.
- If customer metadata is not available, store an internal mapping from `provider_customer_id -> tenant_id`.
- Ensure subscription creation is single-flight per tenant:
- Prevent duplicate active subscriptions by checking local state before creating new sessions.
- Use provider idempotency keys where supported (or internal idempotency per tenant+plan).
## Security & Abuse Controls
- AuthZ:
- Tenant routes require the existing tenant header to match the path tenant ID.
- `control:read` required for viewing billing status.
- `control:write` required for checkout and portal actions.
- Return URL safety:
- Only allow return URLs whose origin is in `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS`.
- Default return URL points to Control UI, derived from `CONTROL_BILLING_SELF_URL`.
- Webhook safety & observability:
- Verify signatures before parsing payloads.
- Enforce JSON size limits on webhook bodies.
- Always return `2xx` for already-processed events (idempotency).
- Never log full webhook payloads.
- Propagate provider event IDs as `x-correlation-id` in logs and spans to integrate seamlessly with the platform's VictoriaMetrics/Loki/Tempo observability stack (as standard in `DEVELOPMENT_PLAN.md`).
## API Contract (MVP)
### GET /admin/v1/tenants/{tenant_id}/billing
Returns a stable shape whether billing is configured or not:
- `configured: bool`
- `provider: stripe | polar | null`
- `plan: string | null`
- `status: string | null`
- `current_period_end: string | null`
- `cancel_at_period_end: bool | null`
- `entitlements: { ... }`
### POST /admin/v1/tenants/{tenant_id}/billing/checkout
Request:
- `plan: string`
- `return_path: string` (optional; appended to `CONTROL_BILLING_SELF_URL`)
Response:
- `url: string`
### POST /admin/v1/tenants/{tenant_id}/billing/portal
Request:
- `return_path: string` (optional)
Response:
- `url: string`
### POST /billing/v1/webhooks/{provider}
Provider-defined payload; must:
- verify signature
- map to internal events
- update local billing state atomically
## Development Plan (Milestones by Dependency)
## Milestone 0: Billing Domain + Storage + Read API
### Dependencies
- None
### Goal
Ship a provider-agnostic billing domain model and a safe persistence mechanism without contacting Stripe/Polar yet.
### Tasks
- [x] Add billing domain types in Control API:
- [x] `Plan`, `SubscriptionStatus`, `Entitlements`
- [x] provider-agnostic `BillingEvent` enum for webhook mapping
- [x] Add `BillingStore` patterned after `PlacementStore`/`ConfigRegistry`:
- [x] atomic write (tmp + rename) for dev file fallback
- [x] in-process locking
- [x] stable JSON schema + `revision`
- [x] Add `GET /admin/v1/tenants/{tenant_id}/billing`:
- [x] permission gate: requires `control:read`
- [x] tenant header enforcement consistent with existing routes
- [x] returns “not configured” when no subscription exists
- [x] Add a mock billing provider for tests:
- [x] deterministic checkout/portal URLs
- [x] deterministic webhook events without real signatures
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] billing state read/write roundtrip (atomic update)
- [x] entitlement derivation from `status + plan`
- [x] tenant isolation checks for billing routes (header vs path mismatch)
- [x] permission gates: `control:read` vs `control:write`
## Milestone 1: Checkout Flow (Create Subscription)
### Dependencies
- Milestone 0
### Goal
Allow tenant admins to initiate a subscription via the providers hosted checkout.
### Tasks
- [x] Add provider configuration parsing and validation:
- [x] strict env parsing with actionable errors
- [x] plan-to-price/product mapping via env
- [x] Add `POST /admin/v1/tenants/{tenant_id}/billing/checkout`:
- [x] permission gate: requires `control:write`
- [x] create or reuse provider customer for the tenant
- [x] create checkout session and return redirect URL
- [x] include tenant identifier in provider metadata (for webhook routing)
- [x] internal idempotency: do not create a new checkout if tenant already has an active/trialing subscription
- [x] Define return URL contract:
- [x] checkout success/cancel landing routes in Control UI
- [x] validate `return_path` against `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS`
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] config validation (missing keys, invalid mapping)
- [x] provider request construction (return URLs, metadata)
- [x] checkout idempotency rules per tenant
- [x] Env-gated integration tests (sandbox; auto-skip unless env vars are set):
- [x] `CONTROL_TEST_STRIPE=1` or `CONTROL_TEST_POLAR=1` starts checkout and returns a valid URL
- [x] tenant metadata roundtrips through the provider (where supported)
## Milestone 2: Webhook Ingestion + Subscription State Sync
### Dependencies
- Milestone 1
### Goal
Make subscription state reliable and idempotent by processing provider webhooks.
### Tasks
- [x] Add `POST /billing/v1/webhooks/{provider}` endpoint:
- [x] signature verification
- [x] event parsing to `BillingEvent`
- [x] idempotency by provider event ID
- [x] tenant mapping via provider metadata or stored `provider_customer_id`
- [x] Map provider statuses to internal `SubscriptionStatus`:
- [x] `trialing`, `active`, `past_due`, `canceled`, etc.
- [x] Store updates in `BillingStore` and expose via `GET /tenants/{tenant_id}/billing`
- [x] ensure updates are monotonic (ignore older provider event timestamps)
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] webhook signature verification (good/bad signatures)
- [x] idempotency behavior (same event twice does not double-apply)
- [x] status mapping tables are stable
- [x] out-of-order events do not regress state
- [x] Docker/local integration (optional, if a provider CLI is used; env-gated):
- [x] `CONTROL_TEST_STRIPE_CLI=1` runs a local webhook-forward flow and verifies state update
## Milestone 3: Customer Portal (Self-Management)
### Dependencies
- Milestone 2
### Goal
Provide a “Manage billing” path for tenants to self-serve changes without operator involvement.
### Tasks
- [x] Add `POST /admin/v1/tenants/{tenant_id}/billing/portal`:
- [x] create provider portal session and return URL
- [x] ensure tenant ownership checks (header vs path)
- [x] permission gate: requires `control:write`
- [ ] Add Control UI billing page:
- [ ] show plan/status + renewal date
- [ ] “Subscribe / Upgrade” and “Manage billing” actions
- [ ] show “Billing not configured” when provider is disabled
### Required Tests (Gate)
- [x] Workspace verification commands
- [ ] UI unit tests (Vitest):
- [ ] billing page renders from mocked API state
- [ ] action buttons call the expected API endpoints
- [x] Env-gated integration tests:
- [x] portal session URL is generated and is HTTPS
## Milestone 4: Entitlements + Enforcement (Controlled Rollout)
### Dependencies
- Milestone 2 (Milestone 3 recommended for admin UX)
### Goal
Gate selected platform capabilities by tenant subscription state while maintaining a safe rollout path.
### Tasks
- [x] Define initial entitlement set and defaults:
- [x] choose “free/trial” behavior (read-only vs limited capability)
- [x] define grace period behavior for `past_due`
- [x] Add enforcement points in Control API:
- [x] middleware/helper to require entitlement per route
- [x] first enforcement target: a low-risk, tenant-scoped “write” capability
- [x] feature flag to disable enforcement globally during rollout
- [x] Add audit log entries for billing enforcement denials (no PII, no secrets)
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests (Control API):
- [x] entitlement checks per route return correct HTTP status
- [x] grace period handling
- [x] Integration tests:
- [x] a tenant without active subscription cannot perform the gated operation
- [x] an active tenant can perform the same operation
## Milestone 5: Reconciliation + Operational Hardening
### Dependencies
- Milestone 2
### Goal
Make billing state resilient against missed webhooks and operational drift.
### Tasks
- [x] Add a reconciliation job:
- [x] periodically fetch subscription state from provider for tenants
- [x] correct local state and emit audit entries
- [x] Add metrics:
- [x] webhook processing latency, verification failures, idempotency hits
- [x] tenant count by subscription status
- [x] Add robust error handling:
- [x] structured errors with safe messages
- [x] no provider payloads logged verbatim
- [x] Add provider API timeout/retry policy:
- [x] short timeouts with bounded retries
- [x] no retries on webhook signature failures
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Unit tests:
- [x] reconciliation updates state correctly
- [x] provider errors do not corrupt local state
## Milestone 6: Production Rollout
### Dependencies
- Milestone 3 (recommended), Milestone 4 (if enforcing)
### Goal
Deploy billing in production with safe secret handling and verifiable smoke checks.
### Tasks
- [x] Provision provider configuration (operator):
- [x] create products/prices (Stripe) or products/plans (Polar)
- [x] configure webhook endpoint + secret
- [x] set up customer portal settings (Stripe) if used
- [x] Configure Swarm secrets and stack env:
- [x] provider API keys and webhook secret stored as Swarm secrets
- [x] `CONTROL_BILLING_PROVIDER`, `CONTROL_BILLING_STATE_PATH`
- [x] `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS` set to production UI origins
- [x] Define rollback plan:
- [x] disable enforcement feature flag
- [x] keep billing read-only operational
### Required Tests (Gate)
- [x] Workspace verification commands
- [x] Production smoke (env-gated):
- [x] create checkout session for a test tenant
- [x] process a webhook event and verify tenant state updates
- [x] generate a portal session URL
## Workspace Verification Commands
- `cargo fmt --check`
- `cargo clippy --workspace --all-targets -- -D warnings`
- `cargo test --workspace`
- `cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build`