1.9 KiB
1.9 KiB
Instrumentation Requirements (Control Plane Dashboards)
Global Conventions
- Build/version labeling: every service exports a gauge metric
*_build_info{service,version,git_sha} 1 - Correlation identifiers:
- Logs include
correlation_idandrequest_idfields for all request spans - Traces propagate
traceparentend-to-end and expose trace IDs in logs
- Logs include
- Cardinality safety:
tenant_idmust not be a label on high-frequency metrics unless bounded (sampling, rollups, or explicit allowlist)
Dashboard: Noisy Neighbor & Tenant Health
Required metrics (examples):
http_request_duration_ms_bucket{service,route,method,status}histogramjob_duration_ms_bucket{job_kind,status}histogram for control-plane jobs- Optional bounded tenant signals:
tenant_active_jobs{tenant_id}only if tenant count is bounded and enforced
Dashboard: API Regression & Deployment
Required metrics (examples):
*_build_info{service,version,git_sha} 1http_request_duration_ms_bucket{service,route,method,status}http_requests_total{service,route,method,status}Deploy markers:- Grafana annotations for deploy events (vertical markers) or a low-cardinality metric like:
deploy_event{service,version,git_sha} 1(sparse, emitted once per deploy)
Dashboard: Storage & Event Bus Bottlenecks
Required metrics (examples):
- Storage:
process_resident_memory_bytes{service}disk_io_time_seconds_total{device}(node-exporter)mdbx_*or equivalent libmdbx metrics if exposed by storage services
- Event bus / JetStream:
nats_*/jetstream_*metrics for consumer lag, ack latency, stream bytes, and redeliveries
Dashboard: Infrastructure Exhaustion
Required metrics (examples):
- Node exporter:
node_cpu_seconds_totalnode_memory_MemAvailable_bytesnode_filesystem_avail_bytes
- Container/service level:
process_open_fdstokio_*runtime metrics (if enabled) for saturation indicators