Files
cloudlysis/plans/SUBSCRIPTIONS_PLAN.md
Vlad Durnea 2595e7f1c5
Some checks failed
ci / ui (push) Failing after 28s
ci / rust (push) Failing after 2m40s
images / build-and-push (push) Failing after 19s
feat(billing): implement tenant subscription entitlements system (milestones 0-6)
2026-03-30 18:41:23 +03:00

16 KiB
Raw Blame History

Tenant Subscriptions Plan (1 Tenant = 1 Subscription)

Principles

  • Tenant-based billing is built-in and enforced consistently:
    • Exactly one “primary” subscription per tenant.
    • Subscription state is authoritative for entitlements.
  • Provider-agnostic core with a single “billing provider” adapter:
    • Stripe or Polar can be plugged in without rewriting the rest of the platform.
  • Tasks are prioritized by ordering:
    • Within each milestone, tasks are listed top-to-bottom in priority order.
  • Each milestone is stop-the-line gated:
    • All tasks completed
    • All milestone tests pass
    • Workspace verification commands pass
  • Webhooks are treated as untrusted input:
    • Verified signatures
    • Idempotent processing
    • No secrets are ever committed or logged
  • Fluent development progression:
    • Start with local-only, file-backed state + mocked provider
    • Add real provider sandbox integration behind env-gated tests
    • Add UI self-service once the state machine is stable
    • Enforce entitlements only after billing state is reliable

Goals

  • Allow a tenant admin to self-serve billing:
    • Start a subscription (checkout)
    • Manage subscription and payment method (customer portal)
    • View current plan and billing status
  • Support Stripe or Polar as the billing backend.
  • Provide a strict, test-gated integration that is safe to deploy incrementally.
  • Keep API routes consistent with existing Control API conventions:
    • Tenant-scoped routes are under /admin/v1/tenants/{tenant_id}/... and require auth + tenant header.
    • Provider webhooks are unauthenticated but signature-verified.

Non-Goals (Initial)

  • Multiple subscriptions per tenant.
  • Per-seat billing.
  • Multiple concurrent plans per tenant.
  • Usage-based metered billing (can be added later as a separate plan).

Definitions

Tenant

A logical customer boundary identified by tenant_id (UUID) and carried via the tenant header already used by Control API endpoints.

Tenant Admin (Actor)

An authenticated principal with permission to manage billing for a tenant:

  • Read: requires control:read
  • Mutate (checkout/portal): requires control:write

Subscription

The provider subscription object mapped 1:1 to a tenant, with a local cached state:

  • status: trialing | active | past_due | paused | canceled | incomplete
  • plan: internal plan identifier (maps to provider price/product)
  • current_period_end / cancel_at_period_end

Entitlements

An internal set of feature gates derived from the subscription plan and status:

  • Examples: max deployments, max runners, S3 docs enabled, support tier, etc.

Billing Provider

An adapter that supplies:

  • Checkout session creation
  • Portal session creation
  • Webhook event verification + parsing
  • Optional reconciliation reads (fetch subscription/customer state)

Configuration Contract (Control API)

Common Settings

  • CONTROL_BILLING_PROVIDER = stripe | polar
  • CONTROL_BILLING_STATE_PATH (default billing/dev.json)
  • CONTROL_BILLING_SELF_URL (default CONTROL_SELF_URL, used for return URLs)
  • CONTROL_BILLING_ENFORCEMENT = 0 | 1 (default 0, gates entitlement enforcement)
  • CONTROL_BILLING_WEBHOOK_PUBLIC_URL (optional; if unset, derive from CONTROL_BILLING_SELF_URL)
  • CONTROL_BILLING_ALLOWED_RETURN_ORIGINS (comma-separated; optional safety check for return URLs)

Stripe Settings (if provider = stripe)

  • CONTROL_STRIPE_SECRET_KEY (secret)
  • CONTROL_STRIPE_WEBHOOK_SECRET (secret)
  • CONTROL_STRIPE_PRICE_ID_<PLAN> (e.g. CONTROL_STRIPE_PRICE_ID_PRO, env mapping per plan)
  • Optional:
    • CONTROL_STRIPE_CUSTOMER_PORTAL_CONFIGURATION_ID

Polar Settings (if provider = polar)

  • CONTROL_POLAR_ACCESS_TOKEN (secret)
  • CONTROL_POLAR_WEBHOOK_SECRET (secret, if Polar provides webhook signing secret)
  • CONTROL_POLAR_PRODUCT_ID_<PLAN> or equivalent plan mapping

Data Model (MVP: File-Backed, Tenant-Scoped)

Persist subscription mappings in a JSON file, similar to PlacementStores atomic write pattern, to support:

  • Local development without requiring a database
  • Deterministic integration tests
  • Simple operational inspection

Note: For production, this should eventually adopt the ConfigRegistry pattern (e.g. backed by NATS KV) to avoid reliance on persistent file storage in Docker Swarm.

Suggested persisted structure:

  • BillingStateFile:
    • revision (uuid-based)
    • tenants: { <tenant_id>: TenantBillingState }
  • TenantBillingState:
    • provider: stripe | polar
    • provider_customer_id
    • provider_subscription_id
    • provider_checkout_session_id (last initiated; optional)
    • status
    • plan
    • current_period_end
    • cancel_at_period_end
    • processed_webhook_event_ids (bounded set; for idempotency)
    • updated_at

Idempotency constraints:

  • Webhook event IDs are stored per tenant, capped to a fixed size (e.g. last 256 IDs) to prevent unbounded growth.
  • Updates are monotonic:
    • prefer provider event timestamps to ignore out-of-order “older” state transitions.

Target Architecture

Control API (Rust)

  • New billing routes:
    • GET /admin/v1/tenants/{tenant_id}/billing (read current billing + entitlements)
    • POST /admin/v1/tenants/{tenant_id}/billing/checkout (create checkout session URL)
    • POST /admin/v1/tenants/{tenant_id}/billing/portal (create portal session URL)
    • POST /billing/v1/webhooks/{provider} (provider webhook ingress; does not require auth)
  • Billing policy enforcement:
    • Entitlements derived server-side
    • Per-endpoint enforcement can be introduced gradually behind a feature flag

Control UI (Vite + React)

  • New “Billing” page scoped to a tenant:
    • Current plan + status
    • “Upgrade / Subscribe” (checkout)
    • “Manage billing” (portal)
    • Clear error states when billing is not configured

Provider Contract (Adapter Surface)

Define a small provider interface so the platform remains stable even if switching providers:

  • create_checkout_session(tenant_id, plan, return_url) -> url
  • create_portal_session(tenant_id, return_url) -> url
  • verify_and_parse_webhook(headers, body) -> BillingEvent
  • apply_event(event) -> TenantBillingState mutation
  • Optional: reconcile(tenant_id) -> TenantBillingState (periodic correction)

Provider mapping requirements:

  • Persist tenant identity at the provider level:
    • Prefer setting tenant_id as provider customer metadata.
    • If customer metadata is not available, store an internal mapping from provider_customer_id -> tenant_id.
  • Ensure subscription creation is single-flight per tenant:
    • Prevent duplicate active subscriptions by checking local state before creating new sessions.
    • Use provider idempotency keys where supported (or internal idempotency per tenant+plan).

Security & Abuse Controls

  • AuthZ:
    • Tenant routes require the existing tenant header to match the path tenant ID.
    • control:read required for viewing billing status.
    • control:write required for checkout and portal actions.
  • Return URL safety:
    • Only allow return URLs whose origin is in CONTROL_BILLING_ALLOWED_RETURN_ORIGINS.
    • Default return URL points to Control UI, derived from CONTROL_BILLING_SELF_URL.
  • Webhook safety & observability:
    • Verify signatures before parsing payloads.
    • Enforce JSON size limits on webhook bodies.
    • Always return 2xx for already-processed events (idempotency).
    • Never log full webhook payloads.
    • Propagate provider event IDs as x-correlation-id in logs and spans to integrate seamlessly with the platform's VictoriaMetrics/Loki/Tempo observability stack (as standard in DEVELOPMENT_PLAN.md).

API Contract (MVP)

GET /admin/v1/tenants/{tenant_id}/billing

Returns a stable shape whether billing is configured or not:

  • configured: bool
  • provider: stripe | polar | null
  • plan: string | null
  • status: string | null
  • current_period_end: string | null
  • cancel_at_period_end: bool | null
  • entitlements: { ... }

POST /admin/v1/tenants/{tenant_id}/billing/checkout

Request:

  • plan: string
  • return_path: string (optional; appended to CONTROL_BILLING_SELF_URL) Response:
  • url: string

POST /admin/v1/tenants/{tenant_id}/billing/portal

Request:

  • return_path: string (optional) Response:
  • url: string

POST /billing/v1/webhooks/{provider}

Provider-defined payload; must:

  • verify signature
  • map to internal events
  • update local billing state atomically

Development Plan (Milestones by Dependency)

Milestone 0: Billing Domain + Storage + Read API

Dependencies

  • None

Goal

Ship a provider-agnostic billing domain model and a safe persistence mechanism without contacting Stripe/Polar yet.

Tasks

  • Add billing domain types in Control API:
    • Plan, SubscriptionStatus, Entitlements
    • provider-agnostic BillingEvent enum for webhook mapping
  • Add BillingStore patterned after PlacementStore/ConfigRegistry:
    • atomic write (tmp + rename) for dev file fallback
    • in-process locking
    • stable JSON schema + revision
  • Add GET /admin/v1/tenants/{tenant_id}/billing:
    • permission gate: requires control:read
    • tenant header enforcement consistent with existing routes
    • returns “not configured” when no subscription exists
  • Add a mock billing provider for tests:
    • deterministic checkout/portal URLs
    • deterministic webhook events without real signatures

Required Tests (Gate)

  • Workspace verification commands
  • Unit tests (Control API):
    • billing state read/write roundtrip (atomic update)
    • entitlement derivation from status + plan
    • tenant isolation checks for billing routes (header vs path mismatch)
    • permission gates: control:read vs control:write

Milestone 1: Checkout Flow (Create Subscription)

Dependencies

  • Milestone 0

Goal

Allow tenant admins to initiate a subscription via the providers hosted checkout.

Tasks

  • Add provider configuration parsing and validation:
    • strict env parsing with actionable errors
    • plan-to-price/product mapping via env
  • Add POST /admin/v1/tenants/{tenant_id}/billing/checkout:
    • permission gate: requires control:write
    • create or reuse provider customer for the tenant
    • create checkout session and return redirect URL
    • include tenant identifier in provider metadata (for webhook routing)
    • internal idempotency: do not create a new checkout if tenant already has an active/trialing subscription
  • Define return URL contract:
    • checkout success/cancel landing routes in Control UI
    • validate return_path against CONTROL_BILLING_ALLOWED_RETURN_ORIGINS

Required Tests (Gate)

  • Workspace verification commands
  • Unit tests (Control API):
    • config validation (missing keys, invalid mapping)
    • provider request construction (return URLs, metadata)
    • checkout idempotency rules per tenant
  • Env-gated integration tests (sandbox; auto-skip unless env vars are set):
    • CONTROL_TEST_STRIPE=1 or CONTROL_TEST_POLAR=1 starts checkout and returns a valid URL
    • tenant metadata roundtrips through the provider (where supported)

Milestone 2: Webhook Ingestion + Subscription State Sync

Dependencies

  • Milestone 1

Goal

Make subscription state reliable and idempotent by processing provider webhooks.

Tasks

  • Add POST /billing/v1/webhooks/{provider} endpoint:
    • signature verification
    • event parsing to BillingEvent
    • idempotency by provider event ID
    • tenant mapping via provider metadata or stored provider_customer_id
  • Map provider statuses to internal SubscriptionStatus:
    • trialing, active, past_due, canceled, etc.
  • Store updates in BillingStore and expose via GET /tenants/{tenant_id}/billing
    • ensure updates are monotonic (ignore older provider event timestamps)

Required Tests (Gate)

  • Workspace verification commands
  • Unit tests (Control API):
    • webhook signature verification (good/bad signatures)
    • idempotency behavior (same event twice does not double-apply)
    • status mapping tables are stable
    • out-of-order events do not regress state
  • Docker/local integration (optional, if a provider CLI is used; env-gated):
    • CONTROL_TEST_STRIPE_CLI=1 runs a local webhook-forward flow and verifies state update

Milestone 3: Customer Portal (Self-Management)

Dependencies

  • Milestone 2

Goal

Provide a “Manage billing” path for tenants to self-serve changes without operator involvement.

Tasks

  • Add POST /admin/v1/tenants/{tenant_id}/billing/portal:
    • create provider portal session and return URL
    • ensure tenant ownership checks (header vs path)
    • permission gate: requires control:write
  • Add Control UI billing page:
    • show plan/status + renewal date
    • “Subscribe / Upgrade” and “Manage billing” actions
    • show “Billing not configured” when provider is disabled

Required Tests (Gate)

  • Workspace verification commands
  • UI unit tests (Vitest):
    • billing page renders from mocked API state
    • action buttons call the expected API endpoints
  • Env-gated integration tests:
    • portal session URL is generated and is HTTPS

Milestone 4: Entitlements + Enforcement (Controlled Rollout)

Dependencies

  • Milestone 2 (Milestone 3 recommended for admin UX)

Goal

Gate selected platform capabilities by tenant subscription state while maintaining a safe rollout path.

Tasks

  • Define initial entitlement set and defaults:
    • choose “free/trial” behavior (read-only vs limited capability)
    • define grace period behavior for past_due
  • Add enforcement points in Control API:
    • middleware/helper to require entitlement per route
    • first enforcement target: a low-risk, tenant-scoped “write” capability
    • feature flag to disable enforcement globally during rollout
  • Add audit log entries for billing enforcement denials (no PII, no secrets)

Required Tests (Gate)

  • Workspace verification commands
  • Unit tests (Control API):
    • entitlement checks per route return correct HTTP status
    • grace period handling
  • Integration tests:
    • a tenant without active subscription cannot perform the gated operation
    • an active tenant can perform the same operation

Milestone 5: Reconciliation + Operational Hardening

Dependencies

  • Milestone 2

Goal

Make billing state resilient against missed webhooks and operational drift.

Tasks

  • Add a reconciliation job:
    • periodically fetch subscription state from provider for tenants
    • correct local state and emit audit entries
  • Add metrics:
    • webhook processing latency, verification failures, idempotency hits
    • tenant count by subscription status
  • Add robust error handling:
    • structured errors with safe messages
    • no provider payloads logged verbatim
  • Add provider API timeout/retry policy:
    • short timeouts with bounded retries
    • no retries on webhook signature failures

Required Tests (Gate)

  • Workspace verification commands
  • Unit tests:
    • reconciliation updates state correctly
    • provider errors do not corrupt local state

Milestone 6: Production Rollout

Dependencies

  • Milestone 3 (recommended), Milestone 4 (if enforcing)

Goal

Deploy billing in production with safe secret handling and verifiable smoke checks.

Tasks

  • Provision provider configuration (operator):
    • create products/prices (Stripe) or products/plans (Polar)
    • configure webhook endpoint + secret
    • set up customer portal settings (Stripe) if used
  • Configure Swarm secrets and stack env:
    • provider API keys and webhook secret stored as Swarm secrets
    • CONTROL_BILLING_PROVIDER, CONTROL_BILLING_STATE_PATH
    • CONTROL_BILLING_ALLOWED_RETURN_ORIGINS set to production UI origins
  • Define rollback plan:
    • disable enforcement feature flag
    • keep billing read-only operational

Required Tests (Gate)

  • Workspace verification commands
  • Production smoke (env-gated):
    • create checkout session for a test tenant
    • process a webhook event and verify tenant state updates
    • generate a portal session URL

Workspace Verification Commands

  • cargo fmt --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace
  • cd control/ui && npm ci && npm run lint && npm run typecheck && npm run test && npm run build