# Instrumentation Requirements (Control Plane Dashboards) ## Global Conventions - Build/version labeling: every service exports a gauge metric `*_build_info{service,version,git_sha} 1` - Correlation identifiers: - Logs include `correlation_id` and `request_id` fields for all request spans - Traces propagate `traceparent` end-to-end and expose trace IDs in logs - Cardinality safety: - `tenant_id` must not be a label on high-frequency metrics unless bounded (sampling, rollups, or explicit allowlist) ## Dashboard: Noisy Neighbor & Tenant Health Required metrics (examples): - `http_request_duration_ms_bucket{service,route,method,status}` histogram - `job_duration_ms_bucket{job_kind,status}` histogram for control-plane jobs - Optional bounded tenant signals: - `tenant_active_jobs{tenant_id}` only if tenant count is bounded and enforced ## Dashboard: API Regression & Deployment Required metrics (examples): - `*_build_info{service,version,git_sha} 1` - `http_request_duration_ms_bucket{service,route,method,status}` - `http_requests_total{service,route,method,status}` Deploy markers: - Grafana annotations for deploy events (vertical markers) or a low-cardinality metric like: - `deploy_event{service,version,git_sha} 1` (sparse, emitted once per deploy) ## Dashboard: Storage & Event Bus Bottlenecks Required metrics (examples): - Storage: - `process_resident_memory_bytes{service}` - `disk_io_time_seconds_total{device}` (node-exporter) - `mdbx_*` or equivalent libmdbx metrics if exposed by storage services - Event bus / JetStream: - `nats_*` / `jetstream_*` metrics for consumer lag, ack latency, stream bytes, and redeliveries ## Dashboard: Infrastructure Exhaustion Required metrics (examples): - Node exporter: - `node_cpu_seconds_total` - `node_memory_MemAvailable_bytes` - `node_filesystem_avail_bytes` - Container/service level: - `process_open_fds` - `tokio_*` runtime metrics (if enabled) for saturation indicators