# Runner Scaling Model

## Assumptions

- Runner state (saga state, dedupe markers, checkpoints, outbox, schedules) is stored in a local MDBX database via `edge_storage`.
- Correctness for a given tenant+saga depends on reading/writing the same storage instance over time.

## Practical Scaling Model

### 1) Scale by tenant partitioning (recommended)

Run multiple Runner instances, each responsible for a disjoint set of tenants, and give each instance its own storage volume.

- Use `RUNNER_TENANT_ALLOWLIST` to bind an instance to tenants.
- Or use NATS KV placement: set `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID`.
- Streams/consumers can be shared; subjects are tenant-qualified, and per-instance consumers filter by tenant subjects.

Example:

- Runner A: `RUNNER_TENANT_ALLOWLIST=t1,t2`
- Runner B: `RUNNER_TENANT_ALLOWLIST=t3,t4`

### NATS KV Placement (optional)

If `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID` are set, the Runner watches a NATS KV bucket where:

- key = tenant_id
- value = shard_id

and dynamically updates the set of per-tenant consumers it is polling without restarting.

### 2) Multiple replicas for the same tenant (not supported with local storage)

If two replicas for the same tenant use different local storages, they will not share:

- dedupe markers
- checkpoints
- saga state

and can duplicate work.

To support same-tenant replicas, storage must be shared/replicated (not implemented here).

## Rollout/Drain Strategy

Use the drain endpoint before stopping a process:

- `POST /admin/drain` to stop taking new work.
- then stop the container/process.

## Replay

Controlled replay exists for operational/debug use:

- `POST /admin/replay` with `tenant_id`, `saga_name`, and `mode`.
- Modes:
  - `checkpoint_only`
  - `checkpoint_and_dedupe`
  - `full_reset`