Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
This commit is contained in:
59
runner/SCALING.md
Normal file
59
runner/SCALING.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# Runner Scaling Model
|
||||
|
||||
## Assumptions
|
||||
|
||||
- Runner state (saga state, dedupe markers, checkpoints, outbox, schedules) is stored in a local MDBX database via `edge_storage`.
|
||||
- Correctness for a given tenant+saga depends on reading/writing the same storage instance over time.
|
||||
|
||||
## Practical Scaling Model
|
||||
|
||||
### 1) Scale by tenant partitioning (recommended)
|
||||
|
||||
Run multiple Runner instances, each responsible for a disjoint set of tenants, and give each instance its own storage volume.
|
||||
|
||||
- Use `RUNNER_TENANT_ALLOWLIST` to bind an instance to tenants.
|
||||
- Or use NATS KV placement: set `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID`.
|
||||
- Streams/consumers can be shared; subjects are tenant-qualified, and per-instance consumers filter by tenant subjects.
|
||||
|
||||
Example:
|
||||
|
||||
- Runner A: `RUNNER_TENANT_ALLOWLIST=t1,t2`
|
||||
- Runner B: `RUNNER_TENANT_ALLOWLIST=t3,t4`
|
||||
|
||||
### NATS KV Placement (optional)
|
||||
|
||||
If `RUNNER_TENANT_PLACEMENT_BUCKET` and `RUNNER_SHARD_ID` are set, the Runner watches a NATS KV bucket where:
|
||||
|
||||
- key = tenant_id
|
||||
- value = shard_id
|
||||
|
||||
and dynamically updates the set of per-tenant consumers it is polling without restarting.
|
||||
|
||||
### 2) Multiple replicas for the same tenant (not supported with local storage)
|
||||
|
||||
If two replicas for the same tenant use different local storages, they will not share:
|
||||
|
||||
- dedupe markers
|
||||
- checkpoints
|
||||
- saga state
|
||||
|
||||
and can duplicate work.
|
||||
|
||||
To support same-tenant replicas, storage must be shared/replicated (not implemented here).
|
||||
|
||||
## Rollout/Drain Strategy
|
||||
|
||||
Use the drain endpoint before stopping a process:
|
||||
|
||||
- `POST /admin/drain` to stop taking new work.
|
||||
- then stop the container/process.
|
||||
|
||||
## Replay
|
||||
|
||||
Controlled replay exists for operational/debug use:
|
||||
|
||||
- `POST /admin/replay` with `tenant_id`, `saga_name`, and `mode`.
|
||||
- Modes:
|
||||
- `checkpoint_only`
|
||||
- `checkpoint_and_dedupe`
|
||||
- `full_reset`
|
||||
Reference in New Issue
Block a user