Files
cloudlysis/runner/SCALING.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

60 lines
1.8 KiB
Markdown

# Runner Scaling Model
## Assumptions
- Runner state (saga state, dedupe markers, checkpoints, outbox, schedules) is stored in a local MDBX database via `edge_storage`.
- Correctness for a given tenant+saga depends on reading/writing the same storage instance over time.
## Practical Scaling Model
### 1) Scale by tenant partitioning (recommended)
Run multiple Runner instances, each responsible for a disjoint set of tenants, and give each instance its own storage volume.
- Use `RUNNER_TENANT_ALLOWLIST` to bind an instance to tenants.
- Or use NATS KV placement: set `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID`.
- Streams/consumers can be shared; subjects are tenant-qualified, and per-instance consumers filter by tenant subjects.
Example:
- Runner A: `RUNNER_TENANT_ALLOWLIST=t1,t2`
- Runner B: `RUNNER_TENANT_ALLOWLIST=t3,t4`
### NATS KV Placement (optional)
If `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID` are set, the Runner watches a NATS KV bucket where:
- key = tenant_id
- value = shard_id
and dynamically updates the set of per-tenant consumers it is polling without restarting.
### 2) Multiple replicas for the same tenant (not supported with local storage)
If two replicas for the same tenant use different local storages, they will not share:
- dedupe markers
- checkpoints
- saga state
and can duplicate work.
To support same-tenant replicas, storage must be shared/replicated (not implemented here).
## Rollout/Drain Strategy
Use the drain endpoint before stopping a process:
- `POST /admin/drain` to stop taking new work.
- then stop the container/process.
## Replay
Controlled replay exists for operational/debug use:
- `POST /admin/replay` with `tenant_id`, `saga_name`, and `mode`.
- Modes:
- `checkpoint_only`
- `checkpoint_and_dedupe`
- `full_reset`