Files
cloudlysis/observability/INSTRUMENTATION.md
Vlad Durnea 1298d9a3df
Some checks failed
ci / rust (push) Failing after 2m34s
ci / ui (push) Failing after 30s
Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
2026-03-30 11:40:42 +03:00

1.9 KiB

Instrumentation Requirements (Control Plane Dashboards)

Global Conventions

  • Build/version labeling: every service exports a gauge metric *_build_info{service,version,git_sha} 1
  • Correlation identifiers:
    • Logs include correlation_id and request_id fields for all request spans
    • Traces propagate traceparent end-to-end and expose trace IDs in logs
  • Cardinality safety:
    • tenant_id must not be a label on high-frequency metrics unless bounded (sampling, rollups, or explicit allowlist)

Dashboard: Noisy Neighbor & Tenant Health

Required metrics (examples):

  • http_request_duration_ms_bucket{service,route,method,status} histogram
  • job_duration_ms_bucket{job_kind,status} histogram for control-plane jobs
  • Optional bounded tenant signals:
    • tenant_active_jobs{tenant_id} only if tenant count is bounded and enforced

Dashboard: API Regression & Deployment

Required metrics (examples):

  • *_build_info{service,version,git_sha} 1
  • http_request_duration_ms_bucket{service,route,method,status}
  • http_requests_total{service,route,method,status} Deploy markers:
  • Grafana annotations for deploy events (vertical markers) or a low-cardinality metric like:
    • deploy_event{service,version,git_sha} 1 (sparse, emitted once per deploy)

Dashboard: Storage & Event Bus Bottlenecks

Required metrics (examples):

  • Storage:
    • process_resident_memory_bytes{service}
    • disk_io_time_seconds_total{device} (node-exporter)
    • mdbx_* or equivalent libmdbx metrics if exposed by storage services
  • Event bus / JetStream:
    • nats_* / jetstream_* metrics for consumer lag, ack latency, stream bytes, and redeliveries

Dashboard: Infrastructure Exhaustion

Required metrics (examples):

  • Node exporter:
    • node_cpu_seconds_total
    • node_memory_MemAvailable_bytes
    • node_filesystem_avail_bytes
  • Container/service level:
    • process_open_fds
    • tokio_* runtime metrics (if enabled) for saturation indicators