60 lines
1.8 KiB
Markdown
60 lines
1.8 KiB
Markdown
# Runner Scaling Model
|
|
|
|
## Assumptions
|
|
|
|
- Runner state (saga state, dedupe markers, checkpoints, outbox, schedules) is stored in a local MDBX database via `edge_storage`.
|
|
- Correctness for a given tenant+saga depends on reading/writing the same storage instance over time.
|
|
|
|
## Practical Scaling Model
|
|
|
|
### 1) Scale by tenant partitioning (recommended)
|
|
|
|
Run multiple Runner instances, each responsible for a disjoint set of tenants, and give each instance its own storage volume.
|
|
|
|
- Use `RUNNER_TENANT_ALLOWLIST` to bind an instance to tenants.
|
|
- Or use NATS KV placement: set `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID`.
|
|
- Streams/consumers can be shared; subjects are tenant-qualified, and per-instance consumers filter by tenant subjects.
|
|
|
|
Example:
|
|
|
|
- Runner A: `RUNNER_TENANT_ALLOWLIST=t1,t2`
|
|
- Runner B: `RUNNER_TENANT_ALLOWLIST=t3,t4`
|
|
|
|
### NATS KV Placement (optional)
|
|
|
|
If `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID` are set, the Runner watches a NATS KV bucket where:
|
|
|
|
- key = tenant_id
|
|
- value = shard_id
|
|
|
|
and dynamically updates the set of per-tenant consumers it is polling without restarting.
|
|
|
|
### 2) Multiple replicas for the same tenant (not supported with local storage)
|
|
|
|
If two replicas for the same tenant use different local storages, they will not share:
|
|
|
|
- dedupe markers
|
|
- checkpoints
|
|
- saga state
|
|
|
|
and can duplicate work.
|
|
|
|
To support same-tenant replicas, storage must be shared/replicated (not implemented here).
|
|
|
|
## Rollout/Drain Strategy
|
|
|
|
Use the drain endpoint before stopping a process:
|
|
|
|
- `POST /admin/drain` to stop taking new work.
|
|
- then stop the container/process.
|
|
|
|
## Replay
|
|
|
|
Controlled replay exists for operational/debug use:
|
|
|
|
- `POST /admin/replay` with `tenant_id`, `saga_name`, and `mode`.
|
|
- Modes:
|
|
- `checkpoint_only`
|
|
- `checkpoint_and_dedupe`
|
|
- `full_reset`
|