400 lines
16 KiB
Markdown
400 lines
16 KiB
Markdown
# Tenant Subscriptions Plan (1 Tenant = 1 Subscription)
|
||
|
||
## Principles
|
||
- Tenant-based billing is built-in and enforced consistently:
|
||
- Exactly one “primary” subscription per tenant.
|
||
- Subscription state is authoritative for entitlements.
|
||
- Provider-agnostic core with a single “billing provider” adapter:
|
||
- Stripe or Polar can be plugged in without rewriting the rest of the platform.
|
||
- Tasks are prioritized by ordering:
|
||
- Within each milestone, tasks are listed top-to-bottom in priority order.
|
||
- Each milestone is stop-the-line gated:
|
||
- All tasks completed
|
||
- All milestone tests pass
|
||
- Workspace verification commands pass
|
||
- Webhooks are treated as untrusted input:
|
||
- Verified signatures
|
||
- Idempotent processing
|
||
- No secrets are ever committed or logged
|
||
- Fluent development progression:
|
||
- Start with local-only, file-backed state + mocked provider
|
||
- Add real provider sandbox integration behind env-gated tests
|
||
- Add UI self-service once the state machine is stable
|
||
- Enforce entitlements only after billing state is reliable
|
||
|
||
## Goals
|
||
- Allow a tenant admin to self-serve billing:
|
||
- Start a subscription (checkout)
|
||
- Manage subscription and payment method (customer portal)
|
||
- View current plan and billing status
|
||
- Support Stripe or Polar as the billing backend.
|
||
- Provide a strict, test-gated integration that is safe to deploy incrementally.
|
||
- Keep API routes consistent with existing Control API conventions:
|
||
- Tenant-scoped routes are under `/admin/v1/tenants/{tenant_id}/...` and require auth + tenant header.
|
||
- Provider webhooks are unauthenticated but signature-verified.
|
||
|
||
## Non-Goals (Initial)
|
||
- Multiple subscriptions per tenant.
|
||
- Per-seat billing.
|
||
- Multiple concurrent plans per tenant.
|
||
- Usage-based metered billing (can be added later as a separate plan).
|
||
|
||
## Definitions
|
||
### Tenant
|
||
A logical customer boundary identified by `tenant_id` (UUID) and carried via the tenant header already used by Control API endpoints.
|
||
|
||
### Tenant Admin (Actor)
|
||
An authenticated principal with permission to manage billing for a tenant:
|
||
- Read: requires `control:read`
|
||
- Mutate (checkout/portal): requires `control:write`
|
||
|
||
### Subscription
|
||
The provider subscription object mapped 1:1 to a tenant, with a local cached state:
|
||
- `status`: `trialing | active | past_due | paused | canceled | incomplete`
|
||
- `plan`: internal plan identifier (maps to provider price/product)
|
||
- `current_period_end` / `cancel_at_period_end`
|
||
|
||
### Entitlements
|
||
An internal set of feature gates derived from the subscription plan and status:
|
||
- Examples: max deployments, max runners, S3 docs enabled, support tier, etc.
|
||
|
||
### Billing Provider
|
||
An adapter that supplies:
|
||
- Checkout session creation
|
||
- Portal session creation
|
||
- Webhook event verification + parsing
|
||
- Optional reconciliation reads (fetch subscription/customer state)
|
||
|
||
## Configuration Contract (Control API)
|
||
### Common Settings
|
||
- `CONTROL_BILLING_PROVIDER` = `stripe | polar`
|
||
- `CONTROL_BILLING_STATE_PATH` (default `billing/dev.json`)
|
||
- `CONTROL_BILLING_SELF_URL` (default `CONTROL_SELF_URL`, used for return URLs)
|
||
- `CONTROL_BILLING_ENFORCEMENT` = `0 | 1` (default `0`, gates entitlement enforcement)
|
||
- `CONTROL_BILLING_WEBHOOK_PUBLIC_URL` (optional; if unset, derive from `CONTROL_BILLING_SELF_URL`)
|
||
- `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS` (comma-separated; optional safety check for return URLs)
|
||
|
||
### Stripe Settings (if provider = stripe)
|
||
- `CONTROL_STRIPE_SECRET_KEY` (secret)
|
||
- `CONTROL_STRIPE_WEBHOOK_SECRET` (secret)
|
||
- `CONTROL_STRIPE_PRICE_ID_<PLAN>` (e.g. `CONTROL_STRIPE_PRICE_ID_PRO`, env mapping per plan)
|
||
- Optional:
|
||
- `CONTROL_STRIPE_CUSTOMER_PORTAL_CONFIGURATION_ID`
|
||
|
||
### Polar Settings (if provider = polar)
|
||
- `CONTROL_POLAR_ACCESS_TOKEN` (secret)
|
||
- `CONTROL_POLAR_WEBHOOK_SECRET` (secret, if Polar provides webhook signing secret)
|
||
- `CONTROL_POLAR_PRODUCT_ID_<PLAN>` or equivalent plan mapping
|
||
|
||
## Data Model (MVP: File-Backed, Tenant-Scoped)
|
||
Persist subscription mappings in a JSON file, similar to `PlacementStore`’s atomic write pattern, to support:
|
||
- Local development without requiring a database
|
||
- Deterministic integration tests
|
||
- Simple operational inspection
|
||
|
||
*Note: For production, this should eventually adopt the `ConfigRegistry` pattern (e.g. backed by NATS KV) to avoid reliance on persistent file storage in Docker Swarm.*
|
||
|
||
|
||
Suggested persisted structure:
|
||
- `BillingStateFile`:
|
||
- `revision` (uuid-based)
|
||
- `tenants: { <tenant_id>: TenantBillingState }`
|
||
- `TenantBillingState`:
|
||
- `provider: stripe | polar`
|
||
- `provider_customer_id`
|
||
- `provider_subscription_id`
|
||
- `provider_checkout_session_id` (last initiated; optional)
|
||
- `status`
|
||
- `plan`
|
||
- `current_period_end`
|
||
- `cancel_at_period_end`
|
||
- `processed_webhook_event_ids` (bounded set; for idempotency)
|
||
- `updated_at`
|
||
|
||
Idempotency constraints:
|
||
- Webhook event IDs are stored per tenant, capped to a fixed size (e.g. last 256 IDs) to prevent unbounded growth.
|
||
- Updates are monotonic:
|
||
- prefer provider event timestamps to ignore out-of-order “older” state transitions.
|
||
|
||
## Target Architecture
|
||
### Control API (Rust)
|
||
- New billing routes:
|
||
- `GET /admin/v1/tenants/{tenant_id}/billing` (read current billing + entitlements)
|
||
- `POST /admin/v1/tenants/{tenant_id}/billing/checkout` (create checkout session URL)
|
||
- `POST /admin/v1/tenants/{tenant_id}/billing/portal` (create portal session URL)
|
||
- `POST /billing/v1/webhooks/{provider}` (provider webhook ingress; does not require auth)
|
||
- Billing policy enforcement:
|
||
- Entitlements derived server-side
|
||
- Per-endpoint enforcement can be introduced gradually behind a feature flag
|
||
|
||
### Control UI (Vite + React)
|
||
- New “Billing” page scoped to a tenant:
|
||
- Current plan + status
|
||
- “Upgrade / Subscribe” (checkout)
|
||
- “Manage billing” (portal)
|
||
- Clear error states when billing is not configured
|
||
|
||
## Provider Contract (Adapter Surface)
|
||
Define a small provider interface so the platform remains stable even if switching providers:
|
||
- `create_checkout_session(tenant_id, plan, return_url) -> url`
|
||
- `create_portal_session(tenant_id, return_url) -> url`
|
||
- `verify_and_parse_webhook(headers, body) -> BillingEvent`
|
||
- `apply_event(event) -> TenantBillingState mutation`
|
||
- Optional: `reconcile(tenant_id) -> TenantBillingState` (periodic correction)
|
||
|
||
Provider mapping requirements:
|
||
- Persist tenant identity at the provider level:
|
||
- Prefer setting `tenant_id` as provider customer metadata.
|
||
- If customer metadata is not available, store an internal mapping from `provider_customer_id -> tenant_id`.
|
||
- Ensure subscription creation is single-flight per tenant:
|
||
- Prevent duplicate active subscriptions by checking local state before creating new sessions.
|
||
- Use provider idempotency keys where supported (or internal idempotency per tenant+plan).
|
||
|
||
## Security & Abuse Controls
|
||
- AuthZ:
|
||
- Tenant routes require the existing tenant header to match the path tenant ID.
|
||
- `control:read` required for viewing billing status.
|
||
- `control:write` required for checkout and portal actions.
|
||
- Return URL safety:
|
||
- Only allow return URLs whose origin is in `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS`.
|
||
- Default return URL points to Control UI, derived from `CONTROL_BILLING_SELF_URL`.
|
||
- Webhook safety & observability:
|
||
- Verify signatures before parsing payloads.
|
||
- Enforce JSON size limits on webhook bodies.
|
||
- Always return `2xx` for already-processed events (idempotency).
|
||
- Never log full webhook payloads.
|
||
- Propagate provider event IDs as `x-correlation-id` in logs and spans to integrate seamlessly with the platform's VictoriaMetrics/Loki/Tempo observability stack (as standard in `DEVELOPMENT_PLAN.md`).
|
||
|
||
## API Contract (MVP)
|
||
### GET /admin/v1/tenants/{tenant_id}/billing
|
||
Returns a stable shape whether billing is configured or not:
|
||
- `configured: bool`
|
||
- `provider: stripe | polar | null`
|
||
- `plan: string | null`
|
||
- `status: string | null`
|
||
- `current_period_end: string | null`
|
||
- `cancel_at_period_end: bool | null`
|
||
- `entitlements: { ... }`
|
||
|
||
### POST /admin/v1/tenants/{tenant_id}/billing/checkout
|
||
Request:
|
||
- `plan: string`
|
||
- `return_path: string` (optional; appended to `CONTROL_BILLING_SELF_URL`)
|
||
Response:
|
||
- `url: string`
|
||
|
||
### POST /admin/v1/tenants/{tenant_id}/billing/portal
|
||
Request:
|
||
- `return_path: string` (optional)
|
||
Response:
|
||
- `url: string`
|
||
|
||
### POST /billing/v1/webhooks/{provider}
|
||
Provider-defined payload; must:
|
||
- verify signature
|
||
- map to internal events
|
||
- update local billing state atomically
|
||
|
||
## Development Plan (Milestones by Dependency)
|
||
|
||
## Milestone 0: Billing Domain + Storage + Read API
|
||
### Dependencies
|
||
- None
|
||
|
||
### Goal
|
||
Ship a provider-agnostic billing domain model and a safe persistence mechanism without contacting Stripe/Polar yet.
|
||
|
||
### Tasks
|
||
- [x] Add billing domain types in Control API:
|
||
- [x] `Plan`, `SubscriptionStatus`, `Entitlements`
|
||
- [x] provider-agnostic `BillingEvent` enum for webhook mapping
|
||
- [x] Add `BillingStore` patterned after `PlacementStore`/`ConfigRegistry`:
|
||
- [x] atomic write (tmp + rename) for dev file fallback
|
||
- [x] in-process locking
|
||
- [x] stable JSON schema + `revision`
|
||
- [x] Add `GET /admin/v1/tenants/{tenant_id}/billing`:
|
||
- [x] permission gate: requires `control:read`
|
||
- [x] tenant header enforcement consistent with existing routes
|
||
- [x] returns “not configured” when no subscription exists
|
||
- [x] Add a mock billing provider for tests:
|
||
- [x] deterministic checkout/portal URLs
|
||
- [x] deterministic webhook events without real signatures
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [x] Unit tests (Control API):
|
||
- [x] billing state read/write roundtrip (atomic update)
|
||
- [x] entitlement derivation from `status + plan`
|
||
- [x] tenant isolation checks for billing routes (header vs path mismatch)
|
||
- [x] permission gates: `control:read` vs `control:write`
|
||
|
||
## Milestone 1: Checkout Flow (Create Subscription)
|
||
### Dependencies
|
||
- Milestone 0
|
||
|
||
### Goal
|
||
Allow tenant admins to initiate a subscription via the provider’s hosted checkout.
|
||
|
||
### Tasks
|
||
- [x] Add provider configuration parsing and validation:
|
||
- [x] strict env parsing with actionable errors
|
||
- [x] plan-to-price/product mapping via env
|
||
- [x] Add `POST /admin/v1/tenants/{tenant_id}/billing/checkout`:
|
||
- [x] permission gate: requires `control:write`
|
||
- [x] create or reuse provider customer for the tenant
|
||
- [x] create checkout session and return redirect URL
|
||
- [x] include tenant identifier in provider metadata (for webhook routing)
|
||
- [x] internal idempotency: do not create a new checkout if tenant already has an active/trialing subscription
|
||
- [x] Define return URL contract:
|
||
- [x] checkout success/cancel landing routes in Control UI
|
||
- [x] validate `return_path` against `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS`
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [x] Unit tests (Control API):
|
||
- [x] config validation (missing keys, invalid mapping)
|
||
- [x] provider request construction (return URLs, metadata)
|
||
- [x] checkout idempotency rules per tenant
|
||
- [x] Env-gated integration tests (sandbox; auto-skip unless env vars are set):
|
||
- [x] `CONTROL_TEST_STRIPE=1` or `CONTROL_TEST_POLAR=1` starts checkout and returns a valid URL
|
||
- [x] tenant metadata roundtrips through the provider (where supported)
|
||
|
||
## Milestone 2: Webhook Ingestion + Subscription State Sync
|
||
### Dependencies
|
||
- Milestone 1
|
||
|
||
### Goal
|
||
Make subscription state reliable and idempotent by processing provider webhooks.
|
||
|
||
### Tasks
|
||
- [x] Add `POST /billing/v1/webhooks/{provider}` endpoint:
|
||
- [x] signature verification
|
||
- [x] event parsing to `BillingEvent`
|
||
- [x] idempotency by provider event ID
|
||
- [x] tenant mapping via provider metadata or stored `provider_customer_id`
|
||
- [x] Map provider statuses to internal `SubscriptionStatus`:
|
||
- [x] `trialing`, `active`, `past_due`, `canceled`, etc.
|
||
- [x] Store updates in `BillingStore` and expose via `GET /tenants/{tenant_id}/billing`
|
||
- [x] ensure updates are monotonic (ignore older provider event timestamps)
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [x] Unit tests (Control API):
|
||
- [x] webhook signature verification (good/bad signatures)
|
||
- [x] idempotency behavior (same event twice does not double-apply)
|
||
- [x] status mapping tables are stable
|
||
- [x] out-of-order events do not regress state
|
||
- [x] Docker/local integration (optional, if a provider CLI is used; env-gated):
|
||
- [x] `CONTROL_TEST_STRIPE_CLI=1` runs a local webhook-forward flow and verifies state update
|
||
|
||
## Milestone 3: Customer Portal (Self-Management)
|
||
### Dependencies
|
||
- Milestone 2
|
||
|
||
### Goal
|
||
Provide a “Manage billing” path for tenants to self-serve changes without operator involvement.
|
||
|
||
### Tasks
|
||
- [x] Add `POST /admin/v1/tenants/{tenant_id}/billing/portal`:
|
||
- [x] create provider portal session and return URL
|
||
- [x] ensure tenant ownership checks (header vs path)
|
||
- [x] permission gate: requires `control:write`
|
||
- [ ] Add Control UI billing page:
|
||
- [ ] show plan/status + renewal date
|
||
- [ ] “Subscribe / Upgrade” and “Manage billing” actions
|
||
- [ ] show “Billing not configured” when provider is disabled
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [ ] UI unit tests (Vitest):
|
||
- [ ] billing page renders from mocked API state
|
||
- [ ] action buttons call the expected API endpoints
|
||
- [x] Env-gated integration tests:
|
||
- [x] portal session URL is generated and is HTTPS
|
||
|
||
## Milestone 4: Entitlements + Enforcement (Controlled Rollout)
|
||
### Dependencies
|
||
- Milestone 2 (Milestone 3 recommended for admin UX)
|
||
|
||
### Goal
|
||
Gate selected platform capabilities by tenant subscription state while maintaining a safe rollout path.
|
||
|
||
### Tasks
|
||
- [x] Define initial entitlement set and defaults:
|
||
- [x] choose “free/trial” behavior (read-only vs limited capability)
|
||
- [x] define grace period behavior for `past_due`
|
||
- [x] Add enforcement points in Control API:
|
||
- [x] middleware/helper to require entitlement per route
|
||
- [x] first enforcement target: a low-risk, tenant-scoped “write” capability
|
||
- [x] feature flag to disable enforcement globally during rollout
|
||
- [x] Add audit log entries for billing enforcement denials (no PII, no secrets)
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [x] Unit tests (Control API):
|
||
- [x] entitlement checks per route return correct HTTP status
|
||
- [x] grace period handling
|
||
- [x] Integration tests:
|
||
- [x] a tenant without active subscription cannot perform the gated operation
|
||
- [x] an active tenant can perform the same operation
|
||
|
||
## Milestone 5: Reconciliation + Operational Hardening
|
||
### Dependencies
|
||
- Milestone 2
|
||
|
||
### Goal
|
||
Make billing state resilient against missed webhooks and operational drift.
|
||
|
||
### Tasks
|
||
- [x] Add a reconciliation job:
|
||
- [x] periodically fetch subscription state from provider for tenants
|
||
- [x] correct local state and emit audit entries
|
||
- [x] Add metrics:
|
||
- [x] webhook processing latency, verification failures, idempotency hits
|
||
- [x] tenant count by subscription status
|
||
- [x] Add robust error handling:
|
||
- [x] structured errors with safe messages
|
||
- [x] no provider payloads logged verbatim
|
||
- [x] Add provider API timeout/retry policy:
|
||
- [x] short timeouts with bounded retries
|
||
- [x] no retries on webhook signature failures
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [x] Unit tests:
|
||
- [x] reconciliation updates state correctly
|
||
- [x] provider errors do not corrupt local state
|
||
|
||
## Milestone 6: Production Rollout
|
||
### Dependencies
|
||
- Milestone 3 (recommended), Milestone 4 (if enforcing)
|
||
|
||
### Goal
|
||
Deploy billing in production with safe secret handling and verifiable smoke checks.
|
||
|
||
### Tasks
|
||
- [x] Provision provider configuration (operator):
|
||
- [x] create products/prices (Stripe) or products/plans (Polar)
|
||
- [x] configure webhook endpoint + secret
|
||
- [x] set up customer portal settings (Stripe) if used
|
||
- [x] Configure Swarm secrets and stack env:
|
||
- [x] provider API keys and webhook secret stored as Swarm secrets
|
||
- [x] `CONTROL_BILLING_PROVIDER`, `CONTROL_BILLING_STATE_PATH`
|
||
- [x] `CONTROL_BILLING_ALLOWED_RETURN_ORIGINS` set to production UI origins
|
||
- [x] Define rollback plan:
|
||
- [x] disable enforcement feature flag
|
||
- [x] keep billing read-only operational
|
||
|
||
### Required Tests (Gate)
|
||
- [x] Workspace verification commands
|
||
- [x] Production smoke (env-gated):
|
||
- [x] create checkout session for a test tenant
|
||
- [x] process a webhook event and verify tenant state updates
|
||
- [x] generate a portal session URL
|
||
|
||
## Workspace Verification Commands
|
||
- `cargo fmt --check`
|
||
- `cargo clippy --workspace --all-targets -- -D warnings`
|
||
- `cargo test --workspace`
|
||
- `cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build`
|