cloudlysis/runner/external_prd.md

### External PRD: Changes Required in Aggregate, Projection, Runner

This document captures the work needed outside the Gateway to support:
- Tenant-aware routing via `x-tenant-id`
- Independent horizontal scalability of Aggregate, Projection, Runner
- A safe mechanism for tenant rebalancing per service kind

---

## **Target State**

### Independent Placements

Each service kind has its own placement map:
- `aggregate_placement[tenant_id] -> aggregate_shard_id`
- `projection_placement[tenant_id] -> projection_shard_id`
- `runner_placement[tenant_id] -> runner_shard_id`

Each shard is a replica set that can scale independently.

### Rebalancing Contract (Per Service Kind)

All nodes MUST support:
- Dynamic placement updates (watch NATS KV or reload config)
- A drain mechanism that can target a specific tenant (stop acquiring new work for that tenant, finish in-flight, report status)
- Clear readiness semantics that reflect whether the node will accept work for a tenant

Additionally, all nodes SHOULD converge on the same operational contract:
- A per-tenant “accepting” gate (can this shard accept new work/queries/commands for tenant X?)
- A per-tenant “drained” signal (no in-flight work remains for tenant X)
- A per-tenant warmup/catchup signal where relevant (projection lag, aggregate snapshot availability)

---

## **Aggregate: Required Changes**

### 1) Expose a Real Command API (Gateway Upstream)

Today, Aggregate has internal command handling types (e.g., `CommandServer`) but its running HTTP server only exposes health/metrics/admin endpoints ([aggregate/http_server.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/http_server.rs#L15-L82), [aggregate/server/mod.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/server/mod.rs#L81-L213)).

Aggregate MUST expose one of the following upstream APIs for the Gateway to call:
- **Option A (Recommended)**: gRPC server implementing `aggregate.gateway.v1.CommandService/SubmitCommand` compatible with [aggregate.proto](file:///Users/vlad/Developer/cloudlysis/aggregate/proto/aggregate.proto#L1-L31).
- **Option B**: HTTP endpoint for command submission (REST), with a stable request/response shape that the Gateway can proxy.

### 2) Tenant Placement Enforcement

Aggregate MUST enforce “hosted tenants” so independent scaling is safe:
- If an Aggregate shard/node is not assigned a tenant, it MUST reject commands for that tenant (e.g., `403` or `503` with retriable hint depending on whether the issue is authorization vs placement).
- Aggregate SHOULD maintain an in-memory allowlist of hosted tenants that is driven by:
  - NATS KV placement watcher (preferred), or
  - Hot-reloaded config pushed via `/admin/reload`

Aggregate already has admin hooks for drain/reload, but they are currently generic and/or illustrative ([aggregate/http_server.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/http_server.rs#L15-L72), [aggregate/server/mod.rs](file:///Users/vlad/Developer/cloudlysis/aggregate/src/server/mod.rs#L402-L442)). These need to become placement-aware.

### 3) Tenant Drain (Per Tenant)

Aggregate MUST provide a per-tenant drain mechanism to support rebalancing:
- Stop accepting new commands for the tenant.
- Allow in-flight commands to finish (bounded wait), then report drained.
- Expose drain status per tenant (admin endpoint).

### 4) Rebalancing State Strategy

Aggregate persists snapshots locally (MDBX) and uses JetStream for events. To move a tenant:
- **Approach 1 (Snapshot migration)**: copy tenant snapshot DB/state to the target shard, then switch placement.
- **Approach 2 (Cold rehydrate)**: switch placement and let the target shard rebuild state by replaying events from JetStream; expect higher latency during warmup.

The system should support both, with the rebalancer selecting the strategy based on tenant size/SLO.

### 5) Metrics for Placement Decisions

Aggregate SHOULD expose:
- Per-tenant command rate, error rate
- In-flight commands by tenant
- Rehydrate time / snapshot hit ratio
- Storage size per tenant (if feasible)

---

## **Projection: Required Changes**

### 1) Expose Query API Upstream for Gateway

Projection has a working `QueryService` with tenant-scoped prefix scans ([uqf.rs](file:///Users/vlad/Developer/cloudlysis/projection/src/query/uqf.rs#L121-L162)) but it is not exposed via HTTP/gRPC (current HTTP routes are health/ready/metrics/info only: [projection/http/mod.rs](file:///Users/vlad/Developer/cloudlysis/projection/src/http/mod.rs#L102-L109)).

Projection MUST add one upstream API the Gateway can route to:
- `POST /query/{view_type}` (HTTP) accepting `x-tenant-id` and a UQF payload, returning `QueryResponse`.
- Or a gRPC query service (new proto) if gRPC is preferred end-to-end.

### 2) Tenant Placement Filtering (Independent Scaling)

Projection MUST support running in one of these modes:
- **Multi-tenant shard**: consumes all tenants (simple, less isolated).
- **Tenant-filtered shard (required for rebalancing)**:
  - only consumes/serves queries for the tenants assigned to that shard
  - rejects queries for unassigned tenants (consistent error semantics)

Implementation direction:
- Add a placement watcher similar to Runner’s tenant filter ([runner/tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L8-L100)).
- Apply tenant filter to:
  - event consumption subject filters (preferred), and
  - query serving validation (always).

### 3) Drain + Warmup Endpoints

Projection SHOULD add:
- `/admin/drain?tenant_id=...` (stop consuming new events for that tenant, finish in-flight, flush checkpoints)
- `/admin/reload` (apply latest placement/config)
- Optional warmup status: whether the shard has caught up to JetStream tail for that tenant/view_types

### 4) Rebalancing Strategy for Projection

Projection can rebalance safely with “warm then cut over”:
- Assign tenant to the new projection shard while old shard still serves.
- New shard catches up (replay from JetStream, build view KV).
- Switch Gateway placement for query routing to new shard.
- Drain old shard for that tenant and optionally delete old tenant KV keys.

### 5) Metrics for Placement Decisions

Projection SHOULD expose:
- JetStream lag per tenant/view_type (tail minus checkpoint)
- Query latency and scan counts
- Storage size per tenant (if feasible)

---

## **Runner: Required Changes**

Runner already has:
- A tenant placement watcher capable of producing an allowlist ([tenant_placement.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/tenant_placement.rs#L8-L100))
- Admin endpoints including drain/reload/config ([runner/http/mod.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/http/mod.rs#L69-L86))
- Gateway client integration for aggregate command submission ([runner/gateway/mod.rs](file:///Users/vlad/Developer/cloudlysis/runner/src/gateway/mod.rs#L1-L47))

To support independent scalability + rebalancing, Runner needs the following.

### 1) Per-Tenant Drain (Not Only Global)

Runner’s current drain is global (`/admin/drain` toggles a single draining flag). Runner MUST support draining a specific tenant:
- Stop acquiring new saga/effect work for the tenant.
- Allow in-flight work for the tenant to finish (bounded).
- Flush outbox for the tenant (or guarantee idempotency on handoff).
- Persist final checkpoints so another shard can continue without duplication beyond at-least-once bounds.

### 2) Placement-Enforced Work Acquisition

Runner MUST validate tenant assignment at the boundary where it:
- consumes JetStream messages (saga triggers, effect commands), and
- dispatches outbox work.

If a tenant is not assigned to the shard, Runner must not process its work.

### 3) Handoff Safety Rules for Rebalancing

Runner rebalancing should follow:
- New shard begins processing only after it is assigned the tenant.
- Old shard stops acquiring new work for that tenant, then drains.
- Idempotency remains correct across handoff using checkpoints and dedupe markers.

### 4) Metrics for Placement Decisions

Runner SHOULD expose:
- Outbox depth by tenant
- Work processing latency and retries by tenant/effect
- Schedule due items by tenant
- Consumer lag by tenant (if the consumption model supports per-tenant lag)

### 5) Auth Delivery Side Effects (Email/SMS/Push)

If the platform’s AuthN flows require out-of-band delivery (password reset links, email verification, MFA codes), the Runner SHOULD be the standard place to execute those side effects:
- Define a stable effect interface for sending transactional emails (reset links, verification links, security alerts).
- Optionally add SMS/push providers later under the same effect contract.

This keeps the Gateway free of long-lived provider credentials and aligns with the existing “effects are executed by workers” pattern.

---

## **Gateway Integration Notes**

Once the above changes exist:
- Gateway routes per `(tenant_id, service_kind)` using independent placement maps.
- Gateway can implement “warm then cut over” rebalancing for Projection and Runner by switching only query/workflow routing after readiness conditions are met.
- Gateway can enforce consistent tenant validation, authn/authz, and error semantics at the edge even as placements move.

---

## **Gaps / Opportunities**

- **KV schema + ownership**: define the exact NATS KV bucket layout, key naming, revisioning rules, and who is allowed to write placement updates.
- **Rebalancer API**: define operator workflows (plan/apply/rollback), status reporting, and audit log requirements for placement changes.
- **Shard discovery**: define how shard endpoints are registered (static config vs KV directory entries) and how health is represented.
- **Consistency boundaries**: define rebalancing guarantees per service kind (projection can be warm-cutover; runner requires checkpoint handoff; aggregate requires single-writer and state availability).