Files
cloudlysis/observability/INSTRUMENTATION.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / ui (push) Failing after 30s
ci / rust (push) Failing after 2m34s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

45 lines
1.9 KiB
Markdown

# Instrumentation Requirements (Control Plane Dashboards)
## Global Conventions
- Build/version labeling: every service exports a gauge metric `*_build_info{service,version,git_sha} 1`
- Correlation identifiers:
- Logs include `correlation_id` and `request_id` fields for all request spans
- Traces propagate `traceparent` end-to-end and expose trace IDs in logs
- Cardinality safety:
- `tenant_id` must not be a label on high-frequency metrics unless bounded (sampling, rollups, or explicit allowlist)
## Dashboard: Noisy Neighbor & Tenant Health
Required metrics (examples):
- `http_request_duration_ms_bucket{service,route,method,status}` histogram
- `job_duration_ms_bucket{job_kind,status}` histogram for control-plane jobs
- Optional bounded tenant signals:
- `tenant_active_jobs{tenant_id}` only if tenant count is bounded and enforced
## Dashboard: API Regression & Deployment
Required metrics (examples):
- `*_build_info{service,version,git_sha} 1`
- `http_request_duration_ms_bucket{service,route,method,status}`
- `http_requests_total{service,route,method,status}`
Deploy markers:
- Grafana annotations for deploy events (vertical markers) or a low-cardinality metric like:
- `deploy_event{service,version,git_sha} 1` (sparse, emitted once per deploy)
## Dashboard: Storage & Event Bus Bottlenecks
Required metrics (examples):
- Storage:
- `process_resident_memory_bytes{service}`
- `disk_io_time_seconds_total{device}` (node-exporter)
- `mdbx_*` or equivalent libmdbx metrics if exposed by storage services
- Event bus / JetStream:
- `nats_*` / `jetstream_*` metrics for consumer lag, ack latency, stream bytes, and redeliveries
## Dashboard: Infrastructure Exhaustion
Required metrics (examples):
- Node exporter:
- `node_cpu_seconds_total`
- `node_memory_MemAvailable_bytes`
- `node_filesystem_avail_bytes`
- Container/service level:
- `process_open_fds`
- `tokio_*` runtime metrics (if enabled) for saturation indicators