45 lines
1.9 KiB
Markdown
45 lines
1.9 KiB
Markdown
# Instrumentation Requirements (Control Plane Dashboards)
|
|
|
|
## Global Conventions
|
|
- Build/version labeling: every service exports a gauge metric `*_build_info{service,version,git_sha} 1`
|
|
- Correlation identifiers:
|
|
- Logs include `correlation_id` and `request_id` fields for all request spans
|
|
- Traces propagate `traceparent` end-to-end and expose trace IDs in logs
|
|
- Cardinality safety:
|
|
- `tenant_id` must not be a label on high-frequency metrics unless bounded (sampling, rollups, or explicit allowlist)
|
|
|
|
## Dashboard: Noisy Neighbor & Tenant Health
|
|
Required metrics (examples):
|
|
- `http_request_duration_ms_bucket{service,route,method,status}` histogram
|
|
- `job_duration_ms_bucket{job_kind,status}` histogram for control-plane jobs
|
|
- Optional bounded tenant signals:
|
|
- `tenant_active_jobs{tenant_id}` only if tenant count is bounded and enforced
|
|
|
|
## Dashboard: API Regression & Deployment
|
|
Required metrics (examples):
|
|
- `*_build_info{service,version,git_sha} 1`
|
|
- `http_request_duration_ms_bucket{service,route,method,status}`
|
|
- `http_requests_total{service,route,method,status}`
|
|
Deploy markers:
|
|
- Grafana annotations for deploy events (vertical markers) or a low-cardinality metric like:
|
|
- `deploy_event{service,version,git_sha} 1` (sparse, emitted once per deploy)
|
|
|
|
## Dashboard: Storage & Event Bus Bottlenecks
|
|
Required metrics (examples):
|
|
- Storage:
|
|
- `process_resident_memory_bytes{service}`
|
|
- `disk_io_time_seconds_total{device}` (node-exporter)
|
|
- `mdbx_*` or equivalent libmdbx metrics if exposed by storage services
|
|
- Event bus / JetStream:
|
|
- `nats_*` / `jetstream_*` metrics for consumer lag, ack latency, stream bytes, and redeliveries
|
|
|
|
## Dashboard: Infrastructure Exhaustion
|
|
Required metrics (examples):
|
|
- Node exporter:
|
|
- `node_cpu_seconds_total`
|
|
- `node_memory_MemAvailable_bytes`
|
|
- `node_filesystem_avail_bytes`
|
|
- Container/service level:
|
|
- `process_open_fds`
|
|
- `tokio_*` runtime metrics (if enabled) for saturation indicators
|